LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* hackbench regression since 2.6.25-rc
@ 2008-03-13  7:46 Zhang, Yanmin
  2008-03-13  8:48 ` Andrew Morton
  2008-03-13 15:14 ` Greg KH
  0 siblings, 2 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-13  7:46 UTC (permalink / raw)
  To: Kay Sievers, Greg Kroah-Hartman; +Cc: LKML

Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
40% regression with 2.6.25-rc1, and more than 20% regression with kernel
2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.

Command to start it.
#hackbench 100 process 2000
I ran it for 3 times and sum the values.

I tried to investiagte it by bisect.
Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.

Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
these 2 tags for many times manually and kernel always paniced.

All patches between the 2 tags are on kobject restructure. I guess such restructure
creates more cache miss on the 16-core tigerton.

Any idea?

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13  7:46 hackbench regression since 2.6.25-rc Zhang, Yanmin
@ 2008-03-13  8:48 ` Andrew Morton
  2008-03-13  9:28   ` Zhang, Yanmin
  2008-03-13 15:14 ` Greg KH
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2008-03-13  8:48 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> 
> Command to start it.
> #hackbench 100 process 2000
> I ran it for 3 times and sum the values.
> 
> I tried to investiagte it by bisect.
> Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> 
> Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> these 2 tags for many times manually and kernel always paniced.
> 
> All patches between the 2 tags are on kobject restructure. I guess such restructure
> creates more cache miss on the 16-core tigerton.
> 

That's pretty surprising - hackbench spends most of its time in userspace
and zeroing out anonymous pages.  It shouldn't be fiddling with kobjects
much at all.

Some kernel profiling might be needed here..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13  8:48 ` Andrew Morton
@ 2008-03-13  9:28   ` Zhang, Yanmin
  2008-03-13  9:52     ` Andrew Morton
  2008-03-14  0:16     ` Christoph Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-13  9:28 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 2008-03-13 at 01:48 -0700, Andrew Morton wrote:
> On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> 
> > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > 
> > Command to start it.
> > #hackbench 100 process 2000
> > I ran it for 3 times and sum the values.
> > 
> > I tried to investiagte it by bisect.
> > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > 
> > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > these 2 tags for many times manually and kernel always paniced.
> > 
> > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > creates more cache miss on the 16-core tigerton.
> > 
> 
> That's pretty surprising - hackbench spends most of its time in userspace
> and zeroing out anonymous pages.
No. vmstat showed hackbench spends almost 100% in sys.

>   It shouldn't be fiddling with kobjects
> much at all.
> 
> Some kernel profiling might be needed here..
Thanks for your kind reminder. I don't know why I forgot it.

2.6.24 oprofile data:
CPU: Core 2, speed 1602 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
40200494 43.3899  linux-2.6.24             linux-2.6.24             __slab_alloc
35338431 38.1421  linux-2.6.24             linux-2.6.24             add_partial_tail
2993156   3.2306  linux-2.6.24             linux-2.6.24             __slab_free
1365806   1.4742  linux-2.6.24             linux-2.6.24             sock_alloc_send_skb
1253820   1.3533  linux-2.6.24             linux-2.6.24             copy_user_generic_string
1141442   1.2320  linux-2.6.24             linux-2.6.24             unix_stream_recvmsg
846836    0.9140  linux-2.6.24             linux-2.6.24             unix_stream_sendmsg
777561    0.8393  linux-2.6.24             linux-2.6.24             kmem_cache_alloc
587127    0.6337  linux-2.6.24             linux-2.6.24             sock_def_readable




2.6.25-rc4 oprofile data:
CPU: Core 2, speed 1602 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
46746994 43.3801  linux-2.6.25-rc4         linux-2.6.25-rc4         __slab_alloc
45986635 42.6745  linux-2.6.25-rc4         linux-2.6.25-rc4         add_partial
2577578   2.3919  linux-2.6.25-rc4         linux-2.6.25-rc4         __slab_free
1301644   1.2079  linux-2.6.25-rc4         linux-2.6.25-rc4         sock_alloc_send_skb
1185888   1.1005  linux-2.6.25-rc4         linux-2.6.25-rc4         copy_user_generic_string
969847    0.9000  linux-2.6.25-rc4         linux-2.6.25-rc4         unix_stream_recvmsg
806665    0.7486  linux-2.6.25-rc4         linux-2.6.25-rc4         kmem_cache_alloc
731059    0.6784  linux-2.6.25-rc4         linux-2.6.25-rc4         unix_stream_sendmsg


-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13  9:28   ` Zhang, Yanmin
@ 2008-03-13  9:52     ` Andrew Morton
  2008-03-14  0:16     ` Christoph Lameter
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2008-03-13  9:52 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 13 Mar 2008 17:28:58 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> On Thu, 2008-03-13 at 01:48 -0700, Andrew Morton wrote:
> > On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > 
> > > Command to start it.
> > > #hackbench 100 process 2000
> > > I ran it for 3 times and sum the values.
> > > 
> > > I tried to investiagte it by bisect.
> > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > 
> > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > these 2 tags for many times manually and kernel always paniced.
> > > 
> > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > creates more cache miss on the 16-core tigerton.
> > > 
> > 
> > That's pretty surprising - hackbench spends most of its time in userspace
> > and zeroing out anonymous pages.
> No. vmstat showed hackbench spends almost 100% in sys.

ah, I got confused about which test that is.

> >   It shouldn't be fiddling with kobjects
> > much at all.
> > 
> > Some kernel profiling might be needed here..
> Thanks for your kind reminder. I don't know why I forgot it.
> 
> 2.6.24 oprofile data:
> CPU: Core 2, speed 1602 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        image name               app name                 symbol name
> 40200494 43.3899  linux-2.6.24             linux-2.6.24             __slab_alloc
> 35338431 38.1421  linux-2.6.24             linux-2.6.24             add_partial_tail
> 2993156   3.2306  linux-2.6.24             linux-2.6.24             __slab_free
> 1365806   1.4742  linux-2.6.24             linux-2.6.24             sock_alloc_send_skb
> 1253820   1.3533  linux-2.6.24             linux-2.6.24             copy_user_generic_string
> 1141442   1.2320  linux-2.6.24             linux-2.6.24             unix_stream_recvmsg
> 846836    0.9140  linux-2.6.24             linux-2.6.24             unix_stream_sendmsg
> 777561    0.8393  linux-2.6.24             linux-2.6.24             kmem_cache_alloc
> 587127    0.6337  linux-2.6.24             linux-2.6.24             sock_def_readable
> 
> 
> 
> 
> 2.6.25-rc4 oprofile data:
> CPU: Core 2, speed 1602 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        image name               app name                 symbol name
> 46746994 43.3801  linux-2.6.25-rc4         linux-2.6.25-rc4         __slab_alloc
> 45986635 42.6745  linux-2.6.25-rc4         linux-2.6.25-rc4         add_partial
> 2577578   2.3919  linux-2.6.25-rc4         linux-2.6.25-rc4         __slab_free
> 1301644   1.2079  linux-2.6.25-rc4         linux-2.6.25-rc4         sock_alloc_send_skb
> 1185888   1.1005  linux-2.6.25-rc4         linux-2.6.25-rc4         copy_user_generic_string
> 969847    0.9000  linux-2.6.25-rc4         linux-2.6.25-rc4         unix_stream_recvmsg
> 806665    0.7486  linux-2.6.25-rc4         linux-2.6.25-rc4         kmem_cache_alloc
> 731059    0.6784  linux-2.6.25-rc4         linux-2.6.25-rc4         unix_stream_sendmsg
> 

So slub got a litle slower?

(Is slab any better?)

Still, I don't think there are any kobject operations in these codepaths
are there?  Maybe some related to the network device, but I doubt it -
networking tends to go it alone on those things, mainly for performance
reasons.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13  7:46 hackbench regression since 2.6.25-rc Zhang, Yanmin
  2008-03-13  8:48 ` Andrew Morton
@ 2008-03-13 15:14 ` Greg KH
  2008-03-13 16:19   ` Randy Dunlap
  1 sibling, 1 reply; 31+ messages in thread
From: Greg KH @ 2008-03-13 15:14 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Kay Sievers, LKML

On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> 
> Command to start it.
> #hackbench 100 process 2000
> I ran it for 3 times and sum the values.
> 
> I tried to investiagte it by bisect.
> Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> 
> Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> these 2 tags for many times manually and kernel always paniced.

Where is the kernel panicing?  The changeset right after the last one
above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
are you using that in your .config?

> All patches between the 2 tags are on kobject restructure. I guess such restructure
> creates more cache miss on the 16-core tigerton.

Nothing should be creating kobjects on a normal load like this, so a
regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
triggering this?

Do you have a link to where I can get hackbench (google seems to find
lots of reports with it, but not the source itself), so I can test to
see if we are accidentally creating kobjects with this load?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13 15:14 ` Greg KH
@ 2008-03-13 16:19   ` Randy Dunlap
  2008-03-13 17:12     ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2008-03-13 16:19 UTC (permalink / raw)
  To: Greg KH; +Cc: Zhang, Yanmin, Kay Sievers, LKML

On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:

> On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > 
> > Command to start it.
> > #hackbench 100 process 2000
> > I ran it for 3 times and sum the values.
> > 
> > I tried to investiagte it by bisect.
> > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > 
> > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > these 2 tags for many times manually and kernel always paniced.
> 
> Where is the kernel panicing?  The changeset right after the last one
> above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> are you using that in your .config?
> 
> > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > creates more cache miss on the 16-core tigerton.
> 
> Nothing should be creating kobjects on a normal load like this, so a
> regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
> triggering this?
> 
> Do you have a link to where I can get hackbench (google seems to find
> lots of reports with it, but not the source itself), so I can test to
> see if we are accidentally creating kobjects with this load?

The version that I see referenced most often (unscientifically :)
is somewhere under people.redhat.com/mingo/, like so:
http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c


---
~Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13 16:19   ` Randy Dunlap
@ 2008-03-13 17:12     ` Greg KH
  2008-03-14  0:50       ` Zhang, Yanmin
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2008-03-13 17:12 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Zhang, Yanmin, Kay Sievers, LKML

On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> 
> > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > 
> > > Command to start it.
> > > #hackbench 100 process 2000
> > > I ran it for 3 times and sum the values.
> > > 
> > > I tried to investiagte it by bisect.
> > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > 
> > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > these 2 tags for many times manually and kernel always paniced.
> > 
> > Where is the kernel panicing?  The changeset right after the last one
> > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > are you using that in your .config?
> > 
> > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > creates more cache miss on the 16-core tigerton.
> > 
> > Nothing should be creating kobjects on a normal load like this, so a
> > regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
> > triggering this?
> > 
> > Do you have a link to where I can get hackbench (google seems to find
> > lots of reports with it, but not the source itself), so I can test to
> > see if we are accidentally creating kobjects with this load?
> 
> The version that I see referenced most often (unscientifically :)
> is somewhere under people.redhat.com/mingo/, like so:
> http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Great, thanks for the link.

In using that version, I do not see any kobjects being created at all
when running the program.  So I don't see how a kobject change could
have caused any slowdown.

Yanmin, is the above link the version you are using?

Hm, running with "hackbench 100 process 2000" seems to lock up my
laptop, maybe I shouldn't run 4000 tasks at once on such a memory
starved machine...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13  9:28   ` Zhang, Yanmin
  2008-03-13  9:52     ` Andrew Morton
@ 2008-03-14  0:16     ` Christoph Lameter
  2008-03-14  3:04       ` Zhang, Yanmin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14  0:16 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

Could you recompile the kernel with slub performance statistics and post 
the output of

slabinfo -AD

?



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-13 17:12     ` Greg KH
@ 2008-03-14  0:50       ` Zhang, Yanmin
  2008-03-14  5:01         ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  0:50 UTC (permalink / raw)
  To: Greg KH; +Cc: Randy Dunlap, Kay Sievers, LKML

On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> > 
> > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > > 
> > > > Command to start it.
> > > > #hackbench 100 process 2000
> > > > I ran it for 3 times and sum the values.
> > > > 
> > > > I tried to investiagte it by bisect.
> > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > > 
> > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > these 2 tags for many times manually and kernel always paniced.
> > > 
> > > Where is the kernel panicing?  The changeset right after the last one
> > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > are you using that in your .config?
> > > 
> > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > creates more cache miss on the 16-core tigerton.
> > > 
> > > Nothing should be creating kobjects on a normal load like this, so a
> > > regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
> > > triggering this?
> > > 
> > > Do you have a link to where I can get hackbench (google seems to find
> > > lots of reports with it, but not the source itself), so I can test to
> > > see if we are accidentally creating kobjects with this load?
> > 
> > The version that I see referenced most often (unscientifically :)
> > is somewhere under people.redhat.com/mingo/, like so:
> > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
> 
> Great, thanks for the link.
> 
> In using that version, I do not see any kobjects being created at all
> when running the program.  So I don't see how a kobject change could
> have caused any slowdown.
> 
> Yanmin, is the above link the version you are using?
Yes.

> 
> Hm, running with "hackbench 100 process 2000" seems to lock up my
> laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> starved machine...
The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
you could reproduce it on laptop.

>From the oprofile data, perhaps we need dig into SLUB firstly.

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  0:16     ` Christoph Lameter
@ 2008-03-14  3:04       ` Zhang, Yanmin
  2008-03-14  3:30         ` Zhang, Yanmin
  2008-03-14  6:32         ` Christoph Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  3:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> Could you recompile the kernel with slub performance statistics and post 
> the output of
> 
> slabinfo -AD
Before testing with kernel 2.6.25-rc5:
Name                   Objects    Alloc     Free   %Fast
vm_area_struct            2795   135185   132587  93  29
:0004096                    25   119045   119043  99  98
:0000064                 12257   119671   107742  98  50
:0000192                  3312    78563    75370  92  21
:0000128                  4648    48143    43738  97  53
dentry                   15217    46675    31527  95  72
:0000080                 12784    33674    21206  99  97
:0000016                  4367    25871    23705  99  78
:0000096                  3001    22591    20084  99  92
buffer_head               5536    18147    12884  97  42
anon_vma                  1729    14948    14130  99  73


After testing:
Name                   Objects    Alloc     Free   %Fast
:0000192                  3428 80093958 80090708  92   8
:0000512                   374 80016030 80015715  68   7
vm_area_struct            2875   224524   221868  94  20
:0000064                 12408   134273   122227  98  47
:0004096                    24   127397   127395  99  98
:0000128                  4596    57837    53432  97  48
dentry                   15659    51402    35824  95  64
:0000016                  4584    29327    27161  99  76
:0000080                 12784    33674    21206  99  97
:0000096                  2998    26264    23757  99  93


So block 192 and 512's and very active and their fast free percentage is low.

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  3:04       ` Zhang, Yanmin
@ 2008-03-14  3:30         ` Zhang, Yanmin
  2008-03-14  5:28           ` Zhang, Yanmin
  2008-03-14  6:34           ` Christoph Lameter
  2008-03-14  6:32         ` Christoph Lameter
  1 sibling, 2 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  3:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 2008-03-14 at 11:04 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> > Could you recompile the kernel with slub performance statistics and post 
> > the output of
> > 
> > slabinfo -AD
> Before testing with kernel 2.6.25-rc5:
> Name                   Objects    Alloc     Free   %Fast
> vm_area_struct            2795   135185   132587  93  29
> :0004096                    25   119045   119043  99  98
> :0000064                 12257   119671   107742  98  50
> :0000192                  3312    78563    75370  92  21
> :0000128                  4648    48143    43738  97  53
> dentry                   15217    46675    31527  95  72
> :0000080                 12784    33674    21206  99  97
> :0000016                  4367    25871    23705  99  78
> :0000096                  3001    22591    20084  99  92
> buffer_head               5536    18147    12884  97  42
> anon_vma                  1729    14948    14130  99  73
> 
> 
> After testing:
> Name                   Objects    Alloc     Free   %Fast
> :0000192                  3428 80093958 80090708  92   8
> :0000512                   374 80016030 80015715  68   7
> vm_area_struct            2875   224524   221868  94  20
> :0000064                 12408   134273   122227  98  47
> :0004096                    24   127397   127395  99  98
> :0000128                  4596    57837    53432  97  48
> dentry                   15659    51402    35824  95  64
> :0000016                  4584    29327    27161  99  76
> :0000080                 12784    33674    21206  99  97
> :0000096                  2998    26264    23757  99  93
> 
> 
> So block 192 and 512's and very active and their fast free percentage is low.
On my 8-core stoakley, there is no such regression. Below data is after testing.

[root@lkp-st02-x8664 ~]# slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000192                  3170 80055388 80052280  92   1 
:0000512                   316 80012750 80012466  69   1 
vm_area_struct            2642   194700   192193  94  16 
:0000064                  3846    74468    70820  97  53 
:0004096                    15    69014    69012  98  97 
:0000128                  1447    32920    31541  91   8 
dentry                   13485    33060    19652  92  42 
:0000080                 10639    23377    12953  98  98 
:0000096                  1662    16496    15036  99  94 
:0000832                   232    14422    14203  85  10 
:0000016                  2733    15102    13372  99  14 

So the block 192 and 512's fast free percentage is even smaller than the ones on tigerton.

Oprofile data on stoakley:

CPU: Core 2, speed 2660 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
2897265  25.7603  linux-2.6.25-rc5         __slab_alloc
2689900  23.9166  linux-2.6.25-rc5         add_partial
629355    5.5957  linux-2.6.25-rc5         copy_user_generic_string
552309    4.9107  linux-2.6.25-rc5         __slab_free
514792    4.5771  linux-2.6.25-rc5         sock_alloc_send_skb
500879    4.4534  linux-2.6.25-rc5         unix_stream_recvmsg
274798    2.4433  linux-2.6.25-rc5         __kmalloc_track_caller
230283    2.0475  linux-2.6.25-rc5         kfree
222286    1.9764  linux-2.6.25-rc5         unix_stream_sendmsg
217413    1.9331  linux-2.6.25-rc5         memset_c
211589    1.8813  linux-2.6.25-rc5         kmem_cache_alloc
151500    1.3470  linux-2.6.25-rc5         system_call
132262    1.1760  linux-2.6.25-rc5         sock_def_readable
123130    1.0948  linux-2.6.25-rc5         kmem_cache_free
109518    0.9738  linux-2.6.25-rc5         sock_wfree


yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  0:50       ` Zhang, Yanmin
@ 2008-03-14  5:01         ` Greg KH
  2008-03-14  5:32           ` Zhang, Yanmin
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2008-03-14  5:01 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Randy Dunlap, Kay Sievers, LKML

On Fri, Mar 14, 2008 at 08:50:19AM +0800, Zhang, Yanmin wrote:
> On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> > On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> > > 
> > > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > > > 
> > > > > Command to start it.
> > > > > #hackbench 100 process 2000
> > > > > I ran it for 3 times and sum the values.
> > > > > 
> > > > > I tried to investiagte it by bisect.
> > > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > > > 
> > > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > > these 2 tags for many times manually and kernel always paniced.
> > > > 
> > > > Where is the kernel panicing?  The changeset right after the last one
> > > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > > are you using that in your .config?
> > > > 
> > > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > > creates more cache miss on the 16-core tigerton.
> > > > 
> > > > Nothing should be creating kobjects on a normal load like this, so a
> > > > regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
> > > > triggering this?
> > > > 
> > > > Do you have a link to where I can get hackbench (google seems to find
> > > > lots of reports with it, but not the source itself), so I can test to
> > > > see if we are accidentally creating kobjects with this load?
> > > 
> > > The version that I see referenced most often (unscientifically :)
> > > is somewhere under people.redhat.com/mingo/, like so:
> > > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
> > 
> > Great, thanks for the link.
> > 
> > In using that version, I do not see any kobjects being created at all
> > when running the program.  So I don't see how a kobject change could
> > have caused any slowdown.
> > 
> > Yanmin, is the above link the version you are using?
> Yes.
> 
> > 
> > Hm, running with "hackbench 100 process 2000" seems to lock up my
> > laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> > starved machine...
> The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
> you could reproduce it on laptop.

But I should see any kobjects being created and destroyed as you are
thinking that is the problem here, right?

And I don't see any, so I'm thinking that this is probably something
else.

I'm still interested in why your machine was oopsing when bisecting
through the kobject commits.  I thought it all should have worked
without problems, as I spend enough time trying to ensure it was so...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  3:30         ` Zhang, Yanmin
@ 2008-03-14  5:28           ` Zhang, Yanmin
  2008-03-14  6:39             ` Christoph Lameter
  2008-03-14  6:34           ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  5:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 2008-03-14 at 11:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2008-03-14 at 11:04 +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> > > Could you recompile the kernel with slub performance statistics and post 
> > > the output of
> > > 
> > > slabinfo -AD
> > Before testing with kernel 2.6.25-rc5:
> > Name                   Objects    Alloc     Free   %Fast
> > vm_area_struct            2795   135185   132587  93  29
> > :0004096                    25   119045   119043  99  98
> > :0000064                 12257   119671   107742  98  50
> > :0000192                  3312    78563    75370  92  21
> > :0000128                  4648    48143    43738  97  53
> > dentry                   15217    46675    31527  95  72
> > :0000080                 12784    33674    21206  99  97
> > :0000016                  4367    25871    23705  99  78
> > :0000096                  3001    22591    20084  99  92
> > buffer_head               5536    18147    12884  97  42
> > anon_vma                  1729    14948    14130  99  73
> > 
> > 
> > After testing:
> > Name                   Objects    Alloc     Free   %Fast
> > :0000192                  3428 80093958 80090708  92   8
> > :0000512                   374 80016030 80015715  68   7
> > vm_area_struct            2875   224524   221868  94  20
> > :0000064                 12408   134273   122227  98  47
> > :0004096                    24   127397   127395  99  98
> > :0000128                  4596    57837    53432  97  48
> > dentry                   15659    51402    35824  95  64
> > :0000016                  4584    29327    27161  99  76
> > :0000080                 12784    33674    21206  99  97
> > :0000096                  2998    26264    23757  99  93
> > 
> > 
> > So block 192 and 512's and very active and their fast free percentage is low.
> On my 8-core stoakley, there is no such regression. Below data is after testing.
> 
> [root@lkp-st02-x8664 ~]# slabinfo -AD
> Name                   Objects    Alloc     Free   %Fast
> :0000192                  3170 80055388 80052280  92   1 
> :0000512                   316 80012750 80012466  69   1 
> vm_area_struct            2642   194700   192193  94  16 
> :0000064                  3846    74468    70820  97  53 
> :0004096                    15    69014    69012  98  97 
> :0000128                  1447    32920    31541  91   8 
> dentry                   13485    33060    19652  92  42 
> :0000080                 10639    23377    12953  98  98 
> :0000096                  1662    16496    15036  99  94 
> :0000832                   232    14422    14203  85  10 
> :0000016                  2733    15102    13372  99  14 
> 
> So the block 192 and 512's fast free percentage is even smaller than the ones on tigerton.
> 
> Oprofile data on stoakley:
> 
> CPU: Core 2, speed 2660 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        app name                 symbol name
> 2897265  25.7603  linux-2.6.25-rc5         __slab_alloc
> 2689900  23.9166  linux-2.6.25-rc5         add_partial
> 629355    5.5957  linux-2.6.25-rc5         copy_user_generic_string
> 552309    4.9107  linux-2.6.25-rc5         __slab_free
> 514792    4.5771  linux-2.6.25-rc5         sock_alloc_send_skb
> 500879    4.4534  linux-2.6.25-rc5         unix_stream_recvmsg
> 274798    2.4433  linux-2.6.25-rc5         __kmalloc_track_caller
> 230283    2.0475  linux-2.6.25-rc5         kfree
> 222286    1.9764  linux-2.6.25-rc5         unix_stream_sendmsg
> 217413    1.9331  linux-2.6.25-rc5         memset_c
> 211589    1.8813  linux-2.6.25-rc5         kmem_cache_alloc
> 151500    1.3470  linux-2.6.25-rc5         system_call
> 132262    1.1760  linux-2.6.25-rc5         sock_def_readable
> 123130    1.0948  linux-2.6.25-rc5         kmem_cache_free
> 109518    0.9738  linux-2.6.25-rc5         sock_wfree
On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel boot cmdline,
the result is improved significantly and it takes just 1/10 time of the original testing.

Below is the new output of slabino -AD.

Name                   Objects    Alloc     Free   %Fast
:0000192                  3192 80087199 80084141  92   8
kmalloc-512                773 80016203 80015888  97   9
vm_area_struct            2787   223100   220525  94  17
:0004096                    68   118322   118320  99  98
:0000064                 12215   123575   111669  98  42
:0000128                  4616    53826    49422  97  45
dentry                   12373    49568    37286  95  65
:0000080                 12823    33755    21206  99  97


So kmalloc-512 is the key.


Then, I tested it on stoakley with the same kernel commandline. Improvement is about 50%.
One important thing is without the boot parameter, hackbench on stoakey takes only 1/4 time
of the one on tigerton. With the boot parameter, hackbench on tigerton is faster than the one
on stoakely.

Is it possible to initiate slub_min_objects based on possible cpu number? I mean,
cpu_possible_map(). We could calculate slub_min_objects by a formular.

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  5:01         ` Greg KH
@ 2008-03-14  5:32           ` Zhang, Yanmin
  0 siblings, 0 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  5:32 UTC (permalink / raw)
  To: Greg KH; +Cc: Randy Dunlap, Kay Sievers, LKML

On Thu, 2008-03-13 at 22:01 -0700, Greg KH wrote:
> On Fri, Mar 14, 2008 at 08:50:19AM +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> > > On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > > > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> > > > 
> > > > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > > > > 
> > > > > > Command to start it.
> > > > > > #hackbench 100 process 2000
> > > > > > I ran it for 3 times and sum the values.
> > > > > > 
> > > > > > I tried to investiagte it by bisect.
> > > > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > > > > 
> > > > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > > > these 2 tags for many times manually and kernel always paniced.
> > > > > 
> > > > > Where is the kernel panicing?  The changeset right after the last one
> > > > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > > > are you using that in your .config?
> > > > > 
> > > > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > > > creates more cache miss on the 16-core tigerton.
> > > > > 
> > > > > Nothing should be creating kobjects on a normal load like this, so a
> > > > > regression seems very odd.  Unless the /sys/kernel/uids/ stuff is
> > > > > triggering this?
> > > > > 
> > > > > Do you have a link to where I can get hackbench (google seems to find
> > > > > lots of reports with it, but not the source itself), so I can test to
> > > > > see if we are accidentally creating kobjects with this load?
> > > > 
> > > > The version that I see referenced most often (unscientifically :)
> > > > is somewhere under people.redhat.com/mingo/, like so:
> > > > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
> > > 
> > > Great, thanks for the link.
> > > 
> > > In using that version, I do not see any kobjects being created at all
> > > when running the program.  So I don't see how a kobject change could
> > > have caused any slowdown.
> > > 
> > > Yanmin, is the above link the version you are using?
> > Yes.
> > 
> > > 
> > > Hm, running with "hackbench 100 process 2000" seems to lock up my
> > > laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> > > starved machine...
> > The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
> > you could reproduce it on laptop.
> 
> But I should see any kobjects being created and destroyed as you are
> thinking that is the problem here, right?
Not just thinking. That's based on lots of testing. But as you know, performance
work is complicated often. Now, I think maybe kernel image changes cache line alignment.

> 
> And I don't see any, so I'm thinking that this is probably something
> else.
Yes.

> 
> I'm still interested in why your machine was oopsing when bisecting
> through the kobject commits.  I thought it all should have worked
> without problems, as I spend enough time trying to ensure it was so...
Kernel panic after printing warning in kref_get when executing add_disk
in rd_init.

Thanks,
Yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  3:04       ` Zhang, Yanmin
  2008-03-14  3:30         ` Zhang, Yanmin
@ 2008-03-14  6:32         ` Christoph Lameter
  2008-03-14  7:14           ` Zhang, Yanmin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14  6:32 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar


On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> After testing:
> Name                   Objects    Alloc     Free   %Fast
> :0000192                  3428 80093958 80090708  92   8
> :0000512                   374 80016030 80015715  68   7

Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
really a difference to 2.6.24?

> So block 192 and 512's and very active and their fast free percentage is low.

Yes but that is to be expected given that hackbench does allocate objects 
and then passes them to other processors for freeing.

Could you get me more details on the two critical slabs?

Do slabinfo -a and then pick one alias for each of those sizes.

Then do

slabinfo skbuff_head (whatever alias you want to use to refer to the slab)

for each of them. Should give some more insight as to how slub behaves
with these two slab caches.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  3:30         ` Zhang, Yanmin
  2008-03-14  5:28           ` Zhang, Yanmin
@ 2008-03-14  6:34           ` Christoph Lameter
  2008-03-14  7:23             ` Zhang, Yanmin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14  6:34 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> > So block 192 and 512's and very active and their fast free percentage 
> > is low.
> On my 8-core stoakley, there is no such regression. Below data is after testing.

Ok get the detailed statistics for this configuration as well. Then we 
can see what kind of slub behavior changes between both configurations.

The 16p is really one node? No strange variances in memory latencies? 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  5:28           ` Zhang, Yanmin
@ 2008-03-14  6:39             ` Christoph Lameter
  2008-03-14  7:29               ` Zhang, Yanmin
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14  6:39 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel 
> boot cmdline, the result is improved significantly and it takes just 
> 1/10 time of the original testing.

Hmmm... That means the updates to SLUB in mm will fix the regression that 
you are seeing because we there can use large orders of slabs and fallback
for all slab caches. But I am still interested to get to the details of
slub behavior on the 16p.

> So kmalloc-512 is the key.

Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version 
increases that with a larger allocation size.

> Then, I tested it on stoakley with the same kernel commandline. 
> Improvement is about 50%. One important thing is without the boot 
> parameter, hackbench on stoakey takes only 1/4 time of the one on 
> tigerton. With the boot parameter, hackbench on tigerton is faster than 
> the one on stoakely.
> 
> Is it possible to initiate slub_min_objects based on possible cpu 
> number? I mean, cpu_possible_map(). We could calculate slub_min_objects 
> by a formular.

Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can 
start toying around with the slub version in mm to configure slub in such 
a way that we get best results on both machines.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  6:32         ` Christoph Lameter
@ 2008-03-14  7:14           ` Zhang, Yanmin
  2008-03-14 21:08             ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  7:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 2008-03-13 at 23:32 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> 
> > After testing:
> > Name                   Objects    Alloc     Free   %Fast
> > :0000192                  3428 80093958 80090708  92   8
> > :0000512                   374 80016030 80015715  68   7
> 
> Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> really a difference to 2.6.24?
As oprofile shows slub functions spend more than 80% cpu time, I would like
to focus on optimizing SLUB before going back to 2.6.24.

> 
> > So block 192 and 512's and very active and their fast free percentage is low.
> 
> Yes but that is to be expected given that hackbench does allocate objects 
> and then passes them to other processors for freeing.
> 
> Could you get me more details on the two critical slabs?
Yes, definitely.

> 
> Do slabinfo -a and then pick one alias for each of those sizes.
They are skbuff_head_cache and kmalloc-512.

> 
> Then do
> 
> slabinfo skbuff_head (whatever alias you want to use to refer to the slab)
Slabcache: skbuff_head_cache     Aliases:  7 Order :  0 Objects: 2848

Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     192  Total  :     142   Sanity Checks : Off  Total:  581632
SlabObj:     192  Full   :     126   Redzoning     : Off  Used :  546816
SlabSiz:    4096  Partial:       0   Poisoning     : Off  Loss :   34816
Loss   :       0  CpuSlab:      16   Tracking      : Off  Lalig:       0
Align  :       8  Objects:      21   Tracing       : Off  Lpadd:    9088

skbuff_head_cache has no kmem_cache operations

skbuff_head_cache: Kernel object allocation
-----------------------------------------------------------------------
No Data

skbuff_head_cache: Kernel object freeing
------------------------------------------------------------------------
No Data

skbuff_head_cache: No NUMA information available.

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             74048234  6259131  92   7
Slowpath              6031994 73818377   7  92
Page Alloc              19746    19603   0   0
Add partial                 0  4658709   0   5
Remove partial        4639106    19603   5   0
RemoteObj/SlabFrozen        0  3887872   0   4
Total                80080228 80077508

Refill  6031979
Deactivate Full=4658836(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)




Slabcache: kmalloc-512           Aliases:  1 Order :  0 Objects: 365

Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     512  Total  :      61   Sanity Checks : Off  Total:  249856
SlabObj:     512  Full   :      36   Redzoning     : Off  Used :  186880
SlabSiz:    4096  Partial:       9   Poisoning     : Off  Loss :   62976
Loss   :       0  CpuSlab:      16   Tracking      : Off  Lalig:       0
Align  :       8  Objects:       8   Tracing       : Off  Lpadd:       0

kmalloc-512 has no kmem_cache operations

kmalloc-512: Kernel object allocation
-----------------------------------------------------------------------
No Data

kmalloc-512: Kernel object freeing
------------------------------------------------------------------------
No Data

kmalloc-512: No NUMA information available.

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             55039159  5006829  68   6
Slowpath             24975754 75007769  31  93
Page Alloc              73840    73779   0   0
Add partial                 0 24341085   0  30
Remove partial       24267297    73779  30   0
RemoteObj/SlabFrozen        0   953614   0   1
Total                80014913 80014598

Refill 24975738
Deactivate Full=24341121(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)


> 
> for each of them. Should give some more insight as to how slub behaves
> with these two slab caches.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  6:34           ` Christoph Lameter
@ 2008-03-14  7:23             ` Zhang, Yanmin
  2008-03-14 21:06               ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  7:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 2008-03-13 at 23:34 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> 
> > > So block 192 and 512's and very active and their fast free percentage 
> > > is low.
> > On my 8-core stoakley, there is no such regression. Below data is after testing.
> 
> Ok get the detailed statistics for this configuration as well. Then we 
> can see what kind of slub behavior changes between both configurations.
I did paste such data in a prior email. COpy it below.

On my 8-core stoakley, there is no such regression. Below data is after testing.
[root@lkp-st02-x8664 ~]# slabinfo -AD
Name                   Objects    Alloc     Free   %Fast
:0000192                  3170 80055388 80052280  92   1 
:0000512                   316 80012750 80012466  69   1 
vm_area_struct            2642   194700   192193  94  16 
:0000064                  3846    74468    70820  97  53 
:0004096                    15    69014    69012  98  97 
:0000128                  1447    32920    31541  91   8 
dentry                   13485    33060    19652  92  42 
:0000080                 10639    23377    12953  98  98 
:0000096                  1662    16496    15036  99  94 
:0000832                   232    14422    14203  85  10 
:0000016                  2733    15102    13372  99  14 

I ran it for many times and got the similiar output from slabinfo.

> 
> The 16p is really one node?
Yes. It's a SMP machine.

>  No strange variances in memory latencies? 
No.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  6:39             ` Christoph Lameter
@ 2008-03-14  7:29               ` Zhang, Yanmin
  2008-03-14 21:05                 ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-14  7:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Thu, 2008-03-13 at 23:39 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> 
> > On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel 
> > boot cmdline, the result is improved significantly and it takes just 
> > 1/10 time of the original testing.
> 
> Hmmm... That means the updates to SLUB in mm will fix the regression that 
> you are seeing because we there can use large orders of slabs and fallback
> for all slab caches. But I am still interested to get to the details of
> slub behavior on the 16p.
> 
> > So kmalloc-512 is the key.
> 
> Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version 
> increases that with a larger allocation size.
Would you like to give me a pointer to the patch? Is it one patch, or many patches?

> 
> > Then, I tested it on stoakley with the same kernel commandline. 
> > Improvement is about 50%. One important thing is without the boot 
> > parameter, hackbench on stoakey takes only 1/4 time of the one on 
> > tigerton. With the boot parameter, hackbench on tigerton is faster than 
> > the one on stoakely.
> > 
> > Is it possible to initiate slub_min_objects based on possible cpu 
> > number? I mean, cpu_possible_map(). We could calculate slub_min_objects 
> > by a formular.
> 
> Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can 
> start toying around with the slub version in mm to configure slub in such 
> a way that we get best results on both machines.
Boot parameter "slub_max_order=3 slub_min_objects=16" could boost perforamnce
both on stoakley and on tigerton.

So should we keep slub_min_objects scalable based on possible cpu number? When a
machine has more cpu, it means more processes/threads will run on it and it will
take more time when they compete for the same resources. SLAB is such a typical
resource.

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  7:29               ` Zhang, Yanmin
@ 2008-03-14 21:05                 ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14 21:05 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> > Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version 
> > increases that with a larger allocation size.
> Would you like to give me a pointer to the patch? Is it one patch, or many patches?

If you a git pull on the slab-mm branch off my VM tree on kernel.org then
you got all you need. There will be an update in the next days though 
since some of the data you gave me already suggests a couple of ways that
things may be made better.

> > Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can 
> > start toying around with the slub version in mm to configure slub in such 
> > a way that we get best results on both machines.
> Boot parameter "slub_max_order=3 slub_min_objects=16" could boost perforamnce
> both on stoakley and on tigerton.

Well the current slab-mm tree already does order 4 and min_objects=60 
which is probably overkill. Next git push on slab-mm will reduce that
to the values you found to be sufficient.
 
> So should we keep slub_min_objects scalable based on possible cpu 
> number? When a machine has more cpu, it means more processes/threads 
> will run on it and it will take more time when they compete for the same 
> resources. SLAB is such a typical resource.

We would have to do some experiments to see how cpu counts affect multiple 
benchmarks. If we can establish a consistent benefit from varying these
parameters based on processor count then we should do so. There is already
one example in mm/vmstat.c how this could be done.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  7:23             ` Zhang, Yanmin
@ 2008-03-14 21:06               ` Christoph Lameter
  2008-03-17  7:50                 ` Zhang, Yanmin
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14 21:06 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> On my 8-core stoakley, there is no such regression. Below data is after 
> testing.

I was looking for the details on two slab caches. The comparison of the 
details statistics is likely very interesting because we will be able to
see how the doubling of processor counts affects the internal behavior of 
slub.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14  7:14           ` Zhang, Yanmin
@ 2008-03-14 21:08             ` Christoph Lameter
  2008-03-15  0:15               ` Christoph Lameter
  2008-03-17  3:05               ` Zhang, Yanmin
  0 siblings, 2 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-03-14 21:08 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 14 Mar 2008, Zhang, Yanmin wrote:

> > Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> > really a difference to 2.6.24?
> As oprofile shows slub functions spend more than 80% cpu time, I would like
> to focus on optimizing SLUB before going back to 2.6.24.

I thought you wanted to address a regression vs 2.6.24?

> kmalloc-512: No NUMA information available.
> 
> Slab Perf Counter       Alloc     Free %Al %Fr
> --------------------------------------------------
> Fastpath             55039159  5006829  68   6
> Slowpath             24975754 75007769  31  93
> Page Alloc              73840    73779   0   0
> Add partial                 0 24341085   0  30
> Remove partial       24267297    73779  30   0

^^^ add partial/remove partial is likely the cause for 
trouble here. 30% is unacceptably high. The larger allocs will reduce the 
partial handling overhead. That is likely the effect that we see here.

> Refill 24975738

Duh refills at 50%? We could try to just switch to another slab instead of 
reusing the existing one. May also affect the add/remove partial 
situation.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14 21:08             ` Christoph Lameter
@ 2008-03-15  0:15               ` Christoph Lameter
  2008-03-17  3:35                 ` Zhang, Yanmin
  2008-03-17  3:05               ` Zhang, Yanmin
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-15  0:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

Here is a patch to just not perform refills but switch slabs instead. 
Could check what effect doing so has on the statistics you see on the 16p?

---
 mm/slub.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-03-14 16:49:36.000000000 -0700
+++ linux-2.6/mm/slub.c	2008-03-14 16:50:04.000000000 -0700
@@ -1474,10 +1474,7 @@ static void *__slab_alloc(struct kmem_ca
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
-		goto another_slab;
-
-	stat(c, ALLOC_REFILL);
+	goto another_slab;
 
 load_freelist:
 	object = c->page->freelist;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14 21:08             ` Christoph Lameter
  2008-03-15  0:15               ` Christoph Lameter
@ 2008-03-17  3:05               ` Zhang, Yanmin
  1 sibling, 0 replies; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-17  3:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 2008-03-14 at 14:08 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> 
> > > Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> > > really a difference to 2.6.24?
> > As oprofile shows slub functions spend more than 80% cpu time, I would like
> > to focus on optimizing SLUB before going back to 2.6.24.
> 
> I thought you wanted to address a regression vs 2.6.24?
Initially I wanted to do so, but oprofile data showed both 2.6.24 and 2.6.25-rc
aren't good with hachbench on tigerton.

The slub_min_objects boot parameter could boost performance largely. So I think
we need optimize it before addressing the regression.

> 
> > kmalloc-512: No NUMA information available.
> > 
> > Slab Perf Counter       Alloc     Free %Al %Fr
> > --------------------------------------------------
> > Fastpath             55039159  5006829  68   6
> > Slowpath             24975754 75007769  31  93
> > Page Alloc              73840    73779   0   0
> > Add partial                 0 24341085   0  30
> > Remove partial       24267297    73779  30   0
> 
> ^^^ add partial/remove partial is likely the cause for 
> trouble here. 30% is unacceptably high. The larger allocs will reduce the 
> partial handling overhead. That is likely the effect that we see here.
> 
> > Refill 24975738
> 
> Duh refills at 50%? We could try to just switch to another slab instead of 
> reusing the existing one. May also affect the add/remove partial 
> situation.
> 
> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-15  0:15               ` Christoph Lameter
@ 2008-03-17  3:35                 ` Zhang, Yanmin
  2008-03-17 17:27                   ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-17  3:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 2008-03-14 at 17:15 -0700, Christoph Lameter wrote:
> Here is a patch to just not perform refills but switch slabs instead. 
> Could check what effect doing so has on the statistics you see on the 16p?
> 
> ---
>  mm/slub.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2008-03-14 16:49:36.000000000 -0700
> +++ linux-2.6/mm/slub.c	2008-03-14 16:50:04.000000000 -0700
> @@ -1474,10 +1474,7 @@ static void *__slab_alloc(struct kmem_ca
>  		goto new_slab;
>  
>  	slab_lock(c->page);
> -	if (unlikely(!node_match(c, node)))
> -		goto another_slab;
> -
> -	stat(c, ALLOC_REFILL);
> +	goto another_slab;
>  
>  load_freelist:
>  	object = c->page->freelist;
There is no much help. In 2.6.25-rc5, REFILL means refill from c->page->freelist
and another_slab. It's looks like its definition is confusing. In the case of
hackbench, mostly, c->page->freelist is NULL.

With #hackbench 100 process 2000, 100*20*2 (totoally 4000) processes are started.
vmstat shows about 300~500 processes are at RUNNING state, so every processor runqueue
has more than 20 processes running on 16p tigerton.


Below is the data with kernel 2.6.25-rc5+your_patch.

[ymzhang@lkp-tt01-x8664 ~]$ slabinfo kmalloc-512

Slabcache: kmalloc-512           Aliases:  1 Order :  0 Objects: 352

Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     512  Total  :      56   Sanity Checks : Off  Total:  229376
SlabObj:     512  Full   :      36   Redzoning     : Off  Used :  180224
SlabSiz:    4096  Partial:       4   Poisoning     : Off  Loss :   49152
Loss   :       0  CpuSlab:      16   Tracking      : Off  Lalig:       0
Align  :       8  Objects:       8   Tracing       : Off  Lpadd:       0

kmalloc-512 has no kmem_cache operations

kmalloc-512: Kernel object allocation
-----------------------------------------------------------------------
No Data

kmalloc-512: Kernel object freeing
------------------------------------------------------------------------
No Data

kmalloc-512: No NUMA information available.

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             55883575  6130576  69   7
Slowpath             24131134 73883818  30  92
Page Alloc              84844    84788   0   0
Add partial            270625 23860257   0  29
Remove partial       24046290    84752  30   0
RemoteObj/SlabFrozen   270825   439015   0   0
Total                80014709 80014394

Deactivate Full=23860293(98%) Empty=200(0%) ToHead=0(0%) ToTail=270625(1%)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-14 21:06               ` Christoph Lameter
@ 2008-03-17  7:50                 ` Zhang, Yanmin
  2008-03-17 17:32                   ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-17  7:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Fri, 2008-03-14 at 14:06 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> 
> > On my 8-core stoakley, there is no such regression. Below data is after 
> > testing.
> 
> I was looking for the details on two slab caches. The comparison of the 
> details statistics is likely very interesting because we will be able to
> see how the doubling of processor counts affects the internal behavior of 
> slub.
I collected more data on 16-p tigerton to try to find the possible relationship
between slub_min_objects and processor number. Kernel is 2.6.25-rc5.

Command\slub_min_objects	|	slub_min_objects=8  |   16  |   32   |  64
-------------------------------------------------------------------------------------
./hackbench 100 process 2000    |  		250second   |	23  |   18.6 |  17.5
./hackbench 200 process 2000	|	      	532         |	44  |   35.6 |  33.5

                                
The first command line will start 4000 processes and the second will start 8000 processes.

As the problemtic slab is kmalloc-512, slub_min_objects=8 is just the default configuration.

Oprofile data shows the ratio of __slab_alloc+__slab_free+add_partial has no difference
between the 2 commandline with the same kernel boot parameters.


slub_min_objects	                                   | 8      |   16   |   32   |  64
--------------------------------------------------------------------------------------------
slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%


When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
is very small. 32 is just possible_cpu_number*2 on my tigerton.

It's hard to say hackbench simulates real applications closely. But it discloses a possible
performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
too many slab page buffers.

As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.

-yanmin



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-17  3:35                 ` Zhang, Yanmin
@ 2008-03-17 17:27                   ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-03-17 17:27 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Mon, 17 Mar 2008, Zhang, Yanmin wrote:

> There is no much help. In 2.6.25-rc5, REFILL means refill from c->page->freelist
> and another_slab. It's looks like its definition is confusing. In the case of
> hackbench, mostly, c->page->freelist is NULL.

REFILL means refilling the per cpu objects from the freelist of the 
per cpu slab page. That could be bad because it requires taking the slab 
lock on the slab page.

> Slab Perf Counter       Alloc     Free %Al %Fr
> --------------------------------------------------
> Fastpath             55883575  6130576  69   7
> Slowpath             24131134 73883818  30  92
> Page Alloc              84844    84788   0   0
> Add partial            270625 23860257   0  29
> Remove partial       24046290    84752  30   0

Hmmm... I was hoping that add/remove partial numbers would come down. Ok 
lets forget about the patch. Increasing min_objects does the trick.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-17  7:50                 ` Zhang, Yanmin
@ 2008-03-17 17:32                   ` Christoph Lameter
  2008-03-18  3:28                     ` Zhang, Yanmin
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2008-03-17 17:32 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Mon, 17 Mar 2008, Zhang, Yanmin wrote:

> slub_min_objects	                                   | 8      |   16   |   32   |  64
> --------------------------------------------------------------------------------------------
> slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%
> 
> 
> When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
> is very small. 32 is just possible_cpu_number*2 on my tigerton.

Interesting. What is the optimal configuration for your 8p? Could you 
figure out the optimal configuration for an 4p and a 2p configuration?

> It's hard to say hackbench simulates real applications closely. But it discloses a possible
> performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
> default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
> when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
> too many slab page buffers.
> 
> As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.

Well for a 4k cpu configu this would set min_objects to 8192. So I think 
we could implement a form of logarithmic scaling based on cpu 
counts comparable to what is done for the statistics update in vmstat.c

fls(num_online_cpus()) = 4

So maybe

slub_min_objects= 8 + (2 + fls(num_online_cpus())) * 4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-17 17:32                   ` Christoph Lameter
@ 2008-03-18  3:28                     ` Zhang, Yanmin
  2008-03-18  4:07                       ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Zhang, Yanmin @ 2008-03-18  3:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Mon, 2008-03-17 at 10:32 -0700, Christoph Lameter wrote:
> On Mon, 17 Mar 2008, Zhang, Yanmin wrote:
> 
> > slub_min_objects	                                   | 8      |   16   |   32   |  64
> > --------------------------------------------------------------------------------------------
> > slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%
> > 
> > 
> > When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
> > is very small. 32 is just possible_cpu_number*2 on my tigerton.
> 
> Interesting. What is the optimal configuration for your 8p? Could you 
> figure out the optimal configuration for an 4p and a 2p configuration?
I used 8-core stoakley to do testing, and tried boot kernel with maxcpus=4 and 2.

Just ran ./hackbench 100 process 2000.

processor number\slub_min_objects        |       slub_min_objects=8  |   16  |   32   | 64
--------------------------------------------------------------------------------------------
                  8p                     |               60second    |   30  |   28.5 | 26.5
--------------------------------------------------------------------------------------------
                  4p                     |               50second    |   43  |   42   | 
--------------------------------------------------------------------------------------------
                  2p                     |               92second    |   79  |        |


As stoakley is just multi-core machine and hasn't hyper-threading, I also tested it on an old
harwich machine which has 4 physical processors and 8 logical processors with hyperthreading.
processor number\slub_min_objects        |       slub_min_objects=8  |   16  |   32   | 64
--------------------------------------------------------------------------------------------
                  8p                     |               78.7second  |   77.5|        |


> 
> > It's hard to say hackbench simulates real applications closely. But it discloses a possible
> > performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
> > default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
> > when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
> > too many slab page buffers.
> > 
> > As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.
> 
> Well for a 4k cpu configu this would set min_objects to 8192.
>  So I think 
> we could implement a form of logarithmic scaling based on cpu 
> counts comparable to what is done for the statistics update in vmstat.c
> 
> fls(num_online_cpus()) = 4
num_online_cpus as the input parameter is ok. A potential issue is how to consider cpu hot-plug.

When num_online_cpus()=16, fls(num_online_cpus())=5.

> 
> So maybe
> 
> slub_min_objects= 8 + (2 + fls(num_online_cpus())) * 4
So slub_min_objects= 8 + (1 + fls(num_online_cpus())) * 4.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: hackbench regression since 2.6.25-rc
  2008-03-18  3:28                     ` Zhang, Yanmin
@ 2008-03-18  4:07                       ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2008-03-18  4:07 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andrew Morton, Kay Sievers, Greg Kroah-Hartman, LKML, Ingo Molnar

On Tue, 18 Mar 2008, Zhang, Yanmin wrote:

> num_online_cpus as the input parameter is ok. A potential issue is how to consider cpu hot-plug.

Yeah I used nr_cpu_ids instead in the patchset that I cced you on. Maybe 
continue discussion on that thread?


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2008-03-18  4:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-13  7:46 hackbench regression since 2.6.25-rc Zhang, Yanmin
2008-03-13  8:48 ` Andrew Morton
2008-03-13  9:28   ` Zhang, Yanmin
2008-03-13  9:52     ` Andrew Morton
2008-03-14  0:16     ` Christoph Lameter
2008-03-14  3:04       ` Zhang, Yanmin
2008-03-14  3:30         ` Zhang, Yanmin
2008-03-14  5:28           ` Zhang, Yanmin
2008-03-14  6:39             ` Christoph Lameter
2008-03-14  7:29               ` Zhang, Yanmin
2008-03-14 21:05                 ` Christoph Lameter
2008-03-14  6:34           ` Christoph Lameter
2008-03-14  7:23             ` Zhang, Yanmin
2008-03-14 21:06               ` Christoph Lameter
2008-03-17  7:50                 ` Zhang, Yanmin
2008-03-17 17:32                   ` Christoph Lameter
2008-03-18  3:28                     ` Zhang, Yanmin
2008-03-18  4:07                       ` Christoph Lameter
2008-03-14  6:32         ` Christoph Lameter
2008-03-14  7:14           ` Zhang, Yanmin
2008-03-14 21:08             ` Christoph Lameter
2008-03-15  0:15               ` Christoph Lameter
2008-03-17  3:35                 ` Zhang, Yanmin
2008-03-17 17:27                   ` Christoph Lameter
2008-03-17  3:05               ` Zhang, Yanmin
2008-03-13 15:14 ` Greg KH
2008-03-13 16:19   ` Randy Dunlap
2008-03-13 17:12     ` Greg KH
2008-03-14  0:50       ` Zhang, Yanmin
2008-03-14  5:01         ` Greg KH
2008-03-14  5:32           ` Zhang, Yanmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).