LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* The performance and behaviour of the anti-fragmentation related patches
@ 2007-03-01 10:12 Mel Gorman
  2007-03-02  1:52 ` Bill Irwin
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
  0 siblings, 2 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-01 10:12 UTC (permalink / raw)
  To: akpm, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh
  Cc: linux-mm, linux-kernel

Hi all,

I've posted up patches that implement the two generally accepted approaches
for reducing external fragmentation in the buddy allocator. The first
(list-based) works by grouping pages of related mobility together, across
all existing zones.  The second (zone-based) creates a zone where only pages
that can be migrated or reclaimed are allocated from.

List-based requires no configuration other than setting min_free_kbytes to
16384, but workloads might exist that break it down, such that different
page types become badly mixed. It was suggested that zones should instead
be used to partition memory between the types to avoid this breakdown. This
works well and it's behaviour is predictable, but it requires configuration at
boot time and that is a lot less flexible than the list-based approach. Both
approaches had their proponents and detractors.

Hence, the two patchsets posted are no longer mutually exclusive and will work
together when they are both applied.  This means that without configuration,
external fragmentation will be reduced as much as possible. However,
if the system administrator has a workload which requires higher levels
of availability or is using varying numbers of huge pages between jobs,
ZONE_MOVABLE can be configured to be the maximum number of huge pages
required by any job.

This mail is intended to describe more about how the patches actually work
and provide some performance figures to the people who have made serious
comments about the patches in the past, mainly at VM Summit. I hope it will
help solidify discussions on these patch sets and ultimately lead to a decision
on which method or methods are worthy of merging to -mm for wider exposure.

In the past, it has been pointed out that the code is complicated and it is not
particularly clear what the end effect of the patches is or why they work. To
give people a better understanding of what the patches are actually doing,
some tools were put together by myself and Andy that can graph how pages of
different mobility types are distributed in memory. An example image from
an x86_64 machine is

http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

Each pixel represents a page of memory, each block represents
MAX_ORDER_NR_PAGES number of pages. The color of a pixel indicates the
mobility type of the page.  In most cases, one box is one huge page. On
x86_64 and some x86, half a box will be a huge page. The legend for colors
is at the top but for further clarification;

ZoneBoundary 	- A black outline to the left and above a box implies a zone
		  starts there
Movable 	- __GFP_MOVABLE pages
Reclaimable 	- __GFP_RECLAIMABLE pages
Pinned		- Bootmem allocated pages and unmovable pages are these
per-cpu		- These are allocated pages with no mappings or count. They are usually
		  per-cpu pages but a high-order allocation can appear like this too.
Movable-under-reclaim - These are __GFP_MOVABLE pages but IO is being performed

During tests, the data required to generate this image is collected every
2 seconds. At the end of the test, all the samples are gathered together
and a video is created. The video shows over time an approximate view of
the memory fragmentation, this very clearly shows trends in locations of
allocations and areas under reclaim.

This is a video[*] of the vanilla kernel running a series of benchmarks on a
ppc64 machine.

video:  http://www.skynet.ie/~mel/anti-frag/2007-02-28/examplevideo-ppc64-2.6.20-mm2-vanilla.avi
frames: http://www.skynet.ie/~mel/anti-frag/2007-02-28/exampleframes-ppc64-2.6.20-mm2-vanilla.tar.gz


Notice that pages of different mobility types get scattered all over the
physical address space on the vanilla kernel because there is no effort made
to place the pages.  2% of memory was kept free with min_free_kbytes and this
results in the free block of pages towards the start of memory. Notice
also that the huge page allocations always come from here as well. *This* is
why setting min_free_kbytes to a high value allows higher-order allocations
work for a period of time! As the buddy allocator favours small blocks for
allocation, the pages kept free were contiguous to begin with. It works
until that block gets split due to excessive memory pressure and after that,
high-order allocations start failing again. If min_free_kbytes was set to
a higher value once the system had been running for some time, high-order
allocations would continue to fail. This is why I've asserted before that
setting min_free_kbytes is not a fix for external fragmentation problems.

Next is a video of a kernel patched with both list-based and zone-based
patches applied.

video:  http://www.skynet.ie/~mel/anti-frag/2007-02-28/examplevideo-ppc64-2.6.20-mm2-antifrag.avi
frames: http://www.skynet.ie/~mel/anti-frag/2007-02-28/exampleframes-ppc64-2.6.20-mm2-antifrag.tar.gz

kernelcore has been set to 60% of memory so you'll see where the black
line indicating the zone is starting. Note how the higher zone is always
green (indicating it is being used for movable pages) - this is how
zone-based works.  In the remaining portions of memory, you'll see how
the boxes (i.e. MAX_ORDER areas) remain as solid colors the majority of
the time. this is the effect of list-based as it groups pages together of
similar mobility. Note how when allocating huge pages under load as well,
it fails to use all of ZONE_MOVABLE. This is a problem with reclaim which
patches from Andy Whitcroft aim to fix up. Two sets of figures are posted
below. The first set is just anti-frag related and the second test includes
patches from Andy.

It should be clear from the videos how and why anti-frag is successful at
what it does. To get higher success rates, defragmentation is needed to move
movable pages from sparsely populated hugepages to densely populated ones.
It should also be clear that slab reclaim needs to be a bit smarter because
you'll see in the videos the "blue" pages that are very sparsely populated
but not being reclaimed.

However, as anti-frag currently stands, it's very effective. Improvements
are logical progressions instead of problems with the fundamental idea. For
example, on gekko-lp4, 1% of memory can be allocated as a huge page which
represents min_free_kbytes. With both patches applied, 51% of memory can be
allocated as huge pages.

The following are performance figures based on a number of tests with
different machines

Kernbench Total CPU Time
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              121.34          119.00            119.60           ---    
elm3b14     x86-numaq          1527.57         1530.80           1529.26          1530.64 
elm3b245    x86_64              346.95          346.48            347.18           346.67  
gekko-lp1   ppc64               323.66          323.80            323.67           323.58  
gekko-lp4   ppc64               319.61          320.25            319.49           319.58  


Kernbench Total Elapsed Time
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              36.32           37.78             35.14            ---    
elm3b14     x86-numaq          426.08          427.34            426.76           427.73  
elm3b245    x86_64              96.50           96.03             96.34            96.11   
gekko-lp1   ppc64              172.17          171.74            172.06           171.73  
gekko-lp4   ppc64              325.38          326.26            324.90           324.83  


Percentage of memory allocated as huge pages under load
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              10              75                 17             0  
elm3b14     x86-numaq           21              27                 19             25 
elm3b245    x86_64              34              66                 27             62 
gekko-lp1   ppc64               2               14                 4              20 
gekko-lp4   ppc64               1               24                 3              17 


Percentage of memory allocated as huge pages at rest at end of test
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
bl6-13      x86_64              17              76                 22             0  
elm3b14     x86-numaq           69              84                 55             82 
elm3b245    x86_64              41              82                 44             82 
gekko-lp1   ppc64               3               61                 9              69 
gekko-lp4   ppc64               1               32                 4              51 

These are figures based on kernels patches with Andy Whitcrofts reclaim
patches. You will see that the zone-based kernel is getting success rates
closer to 40% as one would expect although there is still something amiss.

Kernbench Total CPU Time        
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           1528.42         1531.25           1528.48          1531.04 
elm3b245    x86_64              347.48          346.09            346.67           346.04  
gekko-lp1   ppc64               323.74          323.79            323.45           323.77  
gekko-lp4   ppc64               319.65          319.72            319.74           319.70  
                                
                                
Kernbench Total Elapsed Time    
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch           Seconds           Seconds           Seconds          Seconds
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           427.00          427.85            426.18           427.42  
elm3b245    x86_64              96.72           96.03             96.58            96.27   
gekko-lp1   ppc64               172.07          172.07            171.96           172.72  
gekko-lp4   ppc64               325.41          324.97            325.71           324.94  
                                
                                
Percentage of memory allocated as huge pages under load
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           24              29                 23             26 
elm3b245    x86_64              33              76                 42             75 
gekko-lp1   ppc64               2               23                 9              29 
gekko-lp4   ppc64               1               24                 24             40 
                                
                                
Percentage of memory allocated as huge pages at rest at end of test
                          Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch          Percentage       Percentage         Percentage       Percentage
-------     ---------     --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq           52              84                 64             82 
elm3b245    x86_64              51              87                 44             85 
gekko-lp1   ppc64               7               69                 25             67 
gekko-lp4   ppc64               3               43                 29             53 

The patches go a long way to making sure that high-order allocations work
and particularly that the hugepage pool can be resized once the system has
been running. With the clustering of high-order atomic allocations, I have
some confidence that allocating contiguous jumbo frames will work even with
loads performing lots of IO. I think the videos show how the patches actually
work in the clearest possible manner.

I am of the opinion that both approaches have their advantages and
disadvantages. Given a choice between the two, I prefer list-based
because of it's flexibility and it should also help high-order kernel
allocations. However, by applying both, the disadvantages of list-based are
covered and there still appears to be no performance loss as a result. Hence,
I'd like to see both merged.  Any opinion on merging these patches into -mm
for wider testing?




Here is a list of videos showing different patched kernels on each machine
for the curious. Be warned that they are all pretty large which means the
guys hosting the machine are going to love me.

elm3b14-vanilla       http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-vanilla.avi
elm3b14-list-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-listbased.avi
elm3b14-zone-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-zonebased.avi
elm3b14-combined      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-combined.avi

elm3b245-vanilla      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-vanilla.avi
elm3b245-list-based   http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-listbased.avi
elm3b245-zone-based   http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-zonebased.avi
elm3b245-combined     http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b245-combined.avi

gekko-lp1-vanilla     http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-vanilla.avi
gekko-lp1-list-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-listbased.avi
gekko-lp1-zone-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-zonebased.avi
gekko-lp1-combined    http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp1-combined.avi

gekko-lp4-vanilla     http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-vanilla.avi
gekko-lp4-list-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-listbased.avi
gekko-lp4-zone-based  http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-zonebased.avi
gekko-lp4-combined    http://www.skynet.ie/~mel/anti-frag/2007-02-28/gekkolp4-combined.avi

Notes;
1. The performance figures show small variances, both performance gains and
   regressions. The biggest gains tend to be on x86_64.
2. The x86 figures are based on a numaq which is an ancient machine. I didn't
   have a more modern machine available for running these tests on.
3. The Vanilla kernel is an unpatched 2.6.20-mm2 kernel
4. List-base represents the "list-based" patches desribed above which groups
   pages by mobility type.
5. Zone-base represents the "zone-based" patches which groups movable pages
   together in one zone as described.
6. Combined is with both sets of patches applied
7. The kernbench figures are based on an average of 3 iterations. The figures
   always show that the vanilla and patched kernels have similar performance.
   The anti-frag kernels are usually faster on x86_64.
8. The success rates for the allocation of hugepages should always be at least
   40%. Anything lower implies that reclaim is not reclaiming pages that it
   could. I've included figures below based on kernels patches with additional
   fixes to reclaim from Andy.
9. The bl6-13 figures are incomplete because the machine was deleted from
   the test grid and never came back. They're left in because it was a machine
   that showed reliable performance improvements from the patches
10. The videos are a bit blurry due to quality. High-res images can be
   generated

[*] On my Debian Etch system, xine-ui works for playing videos. On other
	systems, I found ffplay from the ffmpeg package worked. If neither
	of these work for you, the tar.gz contains the JPG files making up
	the frames and you can view them with any image viewer.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
@ 2007-03-02  1:39   ` Balbir Singh
  2007-03-02  2:34   ` KAMEZAWA Hiroyuki
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 104+ messages in thread
From: Balbir Singh @ 2007-03-02  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?
> 
> 
> 
> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 
> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.
> 
> 
> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.  But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.
> 

Paul Menage had suggested something very similar in response to the RFC
for memory controllers I sent out and it was suggested that we create
small zones (roughly 64 MB) to avoid the issue of a zone/node not being
a shareable across containers. Even with a small size, there are some 
issues. The following thread has the details discussed.

	http://lkml.org/lkml/2006/10/30/120

RSS accounting is very easy (with minimal changes to the core mm),
supplemented with an efficient per-container reclaimer, it should be
easy to implement a  good per-container RSS controller.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
@ 2007-03-02  1:52     ` Balbir Singh
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02 16:58     ` Mel Gorman
  2007-03-02 17:05     ` Joel Schopp
  2 siblings, 1 reply; 104+ messages in thread
From: Balbir Singh @ 2007-03-02  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> On Thu, 1 Mar 2007, Andrew Morton wrote:
>> So some urgent questions are: how are we going to do mem hotunplug and
>> per-container RSS?
> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.
> 
> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.

Could you please clarify as to what "that layer" means - is it the
firmware/hardware for virtualization? or does it refer to user space?
With virtualization the linux kernel would end up acting as a hypervisor
and resource management support like per-container RSS support needs to
be built into the kernel.

It would also be useful to have a resource controller like per-container
RSS control (container refers to a task grouping) within the kernel or
non-virtualized environments as well.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-01 10:12 The performance and behaviour of the anti-fragmentation related patches Mel Gorman
@ 2007-03-02  1:52 ` Bill Irwin
  2007-03-02 10:38   ` Mel Gorman
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Bill Irwin @ 2007-03-02  1:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> These are figures based on kernels patches with Andy Whitcrofts reclaim
> patches. You will see that the zone-based kernel is getting success rates
> closer to 40% as one would expect although there is still something amiss.

Yes, combining the two should do at least as well as either in
isolation. Are there videos of each of the two in isolation? Maybe that
would give someone insight into what's happening.


On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> Kernbench Total CPU Time        

Oh dear. How do the other benchmarks look?


On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
> The patches go a long way to making sure that high-order allocations work
> and particularly that the hugepage pool can be resized once the system has
> been running. With the clustering of high-order atomic allocations, I have
> some confidence that allocating contiguous jumbo frames will work even with
> loads performing lots of IO. I think the videos show how the patches actually
> work in the clearest possible manner.
> I am of the opinion that both approaches have their advantages and
> disadvantages. Given a choice between the two, I prefer list-based
> because of it's flexibility and it should also help high-order kernel
> allocations. However, by applying both, the disadvantages of list-based are
> covered and there still appears to be no performance loss as a result. Hence,
> I'd like to see both merged.  Any opinion on merging these patches into -mm
> for wider testing?

Exhibiting a workload where the list patch breaks down and the zone
patch rescues it might help if it's felt that the combination isn't as
good as lists in isolation. I'm sure one can be dredged up somewhere.
Either that or someone will eventually spot why the combination doesn't
get as many available maximally contiguous regions as the list patch.
By and large I'm happy to see anything go in that inches hugetlbfs
closer to a backward compatibility wrapper over ramfs.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
  2007-03-02  1:39   ` Balbir Singh
@ 2007-03-02  2:34   ` KAMEZAWA Hiroyuki
  2007-03-02  3:05   ` Christoph Lameter
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-02  2:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mel, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 16:09:15 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 1 Mar 2007 10:12:50 +0000
> mel@skynet.ie (Mel Gorman) wrote:
> 
> > Any opinion on merging these patches into -mm
> > for wider testing?
> 
> I'm a little reluctant to make changes to -mm's core mm unless those
> changes are reasonably certain to be on track for mainline, so let's talk
> about that.
> 
> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.
> 
> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

About memory-hot-unplug, I'm now writing a new patch-set for memory-unplug for
showing my overview and roadmap. I'm now debugging it. I think I will be able to
post them as RFC in a week.

At least, ZONE_MOVABLE(or something partitioning memory) is necessary for
memory-hot-unplug like DIMM-poweroff. (I'm now using my own ZONE_MOVABLE patch, but
It is O.K. to migrate to Mel's one if it's ready to be merged.)


> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 
> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
Hmm...software-zone seems attractive.
I remember someone posted pesuedo-zone(pzone) patch in past.

-Kame


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
  2007-03-02  1:39   ` Balbir Singh
  2007-03-02  2:34   ` KAMEZAWA Hiroyuki
@ 2007-03-02  3:05   ` Christoph Lameter
  2007-03-02  3:57     ` Nick Piggin
  2007-03-02 13:50   ` Arjan van de Ven
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  3:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007, Andrew Morton wrote:

> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.

Right now it seems that the per container RSS limits differ from the 
statistics calculated per zone. There would be a conceptual overlap but 
the containers are optional and track numbers differently. There is no RSS 
counter in a zone f.e.

memory hot-unplug would directly tap into the anti-frag work. Essentially 
only the zone with movable pages would be unpluggable without additional 
measures. Making slab items and other allocations that is fixed movable 
requires work anyways. A new zone concept will not help.

> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).

I would say that anti-frag / defrag enables memory unplug.

> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).

They relate? How can a container perform antifrag? Meaning a container 
reserves a portion of a hardware zone and becomes a software zone.

> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?

Separately. There is no need to mingle these two together.

> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I

Thats a value judgement that I doubt. Zone based balancing is bad and has 
been repeatedly patched up so that it works with the usual loads.

> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".  All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones".  Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.

Resizable software zones? Are they contiguous or not? If not then we
add another layer to the defrag problem.

> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.

So subzones indeed. How about calling the MAX_ORDER entities that Mel's 
patches create "software zones"?

> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.

zones have hardware characteristics at its core. In a NUMA setting zones 
determine the performance of loads from those areas. I would like to have
zones and nodes merged. Maybe extend node numbers into the negative area
-1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
meerged). One could create additional "virtual" nones after the real nones 
that have hardware characteristics behind them. The virtual nones would be 
something like the software zones? Contain MAX_ORDER portions of hardware 
nones?

> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.  But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.

Hmmm.... My shot:

1. Merge zones/nodes

2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
the real zones/nodes. These may then have additional characteristics such
as 

A. moveable/unmovable
B. DMA restrictions
C. container assignment.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  1:52     ` Balbir Singh
@ 2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
                           ` (3 more replies)
  0 siblings, 4 replies; 104+ messages in thread
From: Linus Torvalds @ 2007-03-02  3:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel



On Fri, 2 Mar 2007, Balbir Singh wrote:
>
> > My personal opinion is that while I'm not a huge fan of virtualization,
> > these kinds of things really _can_ be handled more cleanly at that layer,
> > and not in the kernel at all. Afaik, it's what IBM already does, and has
> > been doing for a while. There's no shame in looking at what already works,
> > especially if it's simpler.
> 
> Could you please clarify as to what "that layer" means - is it the
> firmware/hardware for virtualization? or does it refer to user space?

Virtualization in general. We don't know what it is - in IBM machines it's 
a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
KVM, it's obviously a host Linux kernel/user-process combination.

The point being that in the guests, hotunplug is almost useless (for 
bigger ranges), and we're much better off just telling the virtualization 
hosts on a per-page level whether we care about a page or not, than to 
worry about fragmentation.

And in hosts, we usually don't care EITHER, since it's usually done in a 
hypervisor.

> It would also be useful to have a resource controller like per-container
> RSS control (container refers to a task grouping) within the kernel or
> non-virtualized environments as well.

.. but this has again no impact on anti-fragmentation.

In other words, I really don't see a huge upside. I see *lots* of 
downsides, but upsides? Not so much. Almost everybody who wants unplug 
wants virtualization, and right now none of the "big virtualization" 
people would want to have kernel-level anti-fragmentation anyway sicne 
they'd need to do it on their own.

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:05   ` Christoph Lameter
@ 2007-03-02  3:57     ` Nick Piggin
  2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:20       ` Paul Mundt
  0 siblings, 2 replies; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  3:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > value to the antifrag work (because memory hot-unplug permits DIMM
> > poweroff).
> 
> I would say that anti-frag / defrag enables memory unplug.

Well that really depends. If you want to have any sort of guaranteed
amount of unplugging or shrinking (or hugepage allocating), then antifrag
doesn't work because it is a heuristic.

One thing that worries me about anti-fragmentation is that people might
actually start _using_ higher order pages in the kernel. Then fragmentation
comes back, and it's worse because now it is not just the fringe hugepage or
unplug users (who can anyway work around the fragmentation by allocating
from reserve zones).

> > Our basic unit of memory management is the zone.  Right now, a zone maps
> > onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> 
> Thats a value judgement that I doubt. Zone based balancing is bad and has 
> been repeatedly patched up so that it works with the usual loads.

Shouldn't we fix it instead of deciding it is broken and add another layer
on top that supposedly does better balancing?

> > suspect that a good way to solve both per-container RSS and mem hotunplug
> > is to split the zone concept away from its hardware limitations: create a
> > "software zone" and a "hardware zone".  All the existing page allocator and
> > reclaim code remains basically unchanged, and it operates on "software
> > zones".  Each software zones always lies within a single hardware zone. 
> > The software zones are resizeable.  For per-container RSS we give each
> > container one (or perhaps multiple) resizeable software zones.
> 
> Resizable software zones? Are they contiguous or not? If not then we
> add another layer to the defrag problem.

I think Andrew is proposing that we work out what the problem is first.
I don't know what the defrag problem is, but I know that fragmentation
is unavoidable unless you have fixed size areas for each different size
of unreclaimable allocation.

> > NUMA and cpusets screwed up: they've gone and used nodes as their basic
> > unit of memory management whereas they should have used zones.  This will
> > need to be untangled.
> 
> zones have hardware characteristics at its core. In a NUMA setting zones 
> determine the performance of loads from those areas. I would like to have
> zones and nodes merged. Maybe extend node numbers into the negative area
> -1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones 
> meerged). One could create additional "virtual" nones after the real nones 
> that have hardware characteristics behind them. The virtual nones would be 
> something like the software zones? Contain MAX_ORDER portions of hardware 
> nones?

But just because zones are hardware _now_ doesn't mean they have to stay
that way. The upshot is that a lot of work for zones is already there.

> > Anyway, that's just a shot in the dark.  Could be that we implement unplug
> > and RSS control by totally different means.  But I do wish that we'd sort
> > out what those means will be before we potentially complicate the story a
> > lot by adding antifragmentation.
> 
> Hmmm.... My shot:
> 
> 1. Merge zones/nodes
> 
> 2. Create new virtual zones/nodes that are subsets of MAX_order blocks of 
> the real zones/nodes. These may then have additional characteristics such
> as 
> 
> A. moveable/unmovable
> B. DMA restrictions
> C. container assignment.

There are alternatives to adding a new layer of virtual zones. We could try
using zones, enven.

zones aren't perfect right now, but they are quite similar to what you
want (ie. blocks of memory). I think we should first try to generalise what
we have rather than adding another layer.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
@ 2007-03-02  3:59         ` Andrew Morton
  2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  4:18         ` Balbir Singh
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> In other words, I really don't see a huge upside. I see *lots* of 
> downsides, but upsides? Not so much. Almost everybody who wants unplug 
> wants virtualization, and right now none of the "big virtualization" 
> people would want to have kernel-level anti-fragmentation anyway sicne 
> they'd need to do it on their own.

Agree with all that, but you're missing the other application: power
saving.  FBDIMMs take eight watts a pop.  If we can turn them off when the
system is unloaded we save either four or all eight watts (assuming we can
get Intel to part with the information which is needed to do this.  I fear
an ACPI method will ensue).

There's a whole lot of complexity and work in all of this, but 24*8 watts
is a lot of watts, and it's worth striving for.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:57     ` Nick Piggin
@ 2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:21         ` Nick Piggin
  2007-03-02  4:29         ` Andrew Morton
  2007-03-02  4:20       ` Paul Mundt
  1 sibling, 2 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.

We would need additional measures such as real defrag and make more 
structure movable.

> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).

Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
in order to reduce overhead of managing page structs for large I/O and 
large memory applications. We need appropriate measures to deal with the 
fragmentation problem.

> > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > been repeatedly patched up so that it works with the usual loads.
> 
> Shouldn't we fix it instead of deciding it is broken and add another layer
> on top that supposedly does better balancing?

We need to reduce the real hardware zones as much as possible. Most high 
performance architectures have no need for additional DMA zones f.e. and
do not have to deal with the complexities that arise there.

> But just because zones are hardware _now_ doesn't mean they have to stay
> that way. The upshot is that a lot of work for zones is already there.

Well you cannot get there without the nodes. The control of memory 
allocations with user space support etc only comes with the nodes.

> > A. moveable/unmovable
> > B. DMA restrictions
> > C. container assignment.
> 
> There are alternatives to adding a new layer of virtual zones. We could try
> using zones, enven.

No merge them to one thing and handle them as one. No difference between 
zones and nodes anymore.
 
> zones aren't perfect right now, but they are quite similar to what you
> want (ie. blocks of memory). I think we should first try to generalise what
> we have rather than adding another layer.

Yes that would mean merging nodes and zones. So "nones".


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
@ 2007-03-02  4:18         ` Balbir Singh
  2007-03-02  5:13         ` Jeremy Fitzhardinge
  2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 104+ messages in thread
From: Balbir Singh @ 2007-03-02  4:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 2 Mar 2007, Balbir Singh wrote:
>>> My personal opinion is that while I'm not a huge fan of virtualization,
>>> these kinds of things really _can_ be handled more cleanly at that layer,
>>> and not in the kernel at all. Afaik, it's what IBM already does, and has
>>> been doing for a while. There's no shame in looking at what already works,
>>> especially if it's simpler.
>> Could you please clarify as to what "that layer" means - is it the
>> firmware/hardware for virtualization? or does it refer to user space?
> 
> Virtualization in general. We don't know what it is - in IBM machines it's 
> a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
> KVM, it's obviously a host Linux kernel/user-process combination.
> 

Thanks for clarifying.

> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.
> 
> And in hosts, we usually don't care EITHER, since it's usually done in a 
> hypervisor.
> 
>> It would also be useful to have a resource controller like per-container
>> RSS control (container refers to a task grouping) within the kernel or
>> non-virtualized environments as well.
> 
> .. but this has again no impact on anti-fragmentation.
> 

Yes, I agree that anti-fragmentation and resource management are independent
of each other. I must admit to being a bit selfish here, in that my main
interest is in resource management and we would love to see a well
written  and easy to understand resource management infrastructure and 
controllers to control CPU and memory usage. Since the issue of
per-container RSS control came up, I wanted to ensure that we do not mix
up resource control and anti-fragmentation.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:57     ` Nick Piggin
  2007-03-02  4:06       ` Christoph Lameter
@ 2007-03-02  4:20       ` Paul Mundt
  1 sibling, 0 replies; 104+ messages in thread
From: Paul Mundt @ 2007-03-02  4:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 04:57:51AM +0100, Nick Piggin wrote:
> On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote:
> > On Thu, 1 Mar 2007, Andrew Morton wrote:
> > > For prioritisation purposes I'd judge that memory hot-unplug is of similar
> > > value to the antifrag work (because memory hot-unplug permits DIMM
> > > poweroff).
> > 
> > I would say that anti-frag / defrag enables memory unplug.
> 
> Well that really depends. If you want to have any sort of guaranteed
> amount of unplugging or shrinking (or hugepage allocating), then antifrag
> doesn't work because it is a heuristic.
> 
> One thing that worries me about anti-fragmentation is that people might
> actually start _using_ higher order pages in the kernel. Then fragmentation
> comes back, and it's worse because now it is not just the fringe hugepage or
> unplug users (who can anyway work around the fragmentation by allocating
> from reserve zones).
> 
There's two sides to that, the ability to use higher order pages in the
kernel also means that it's possible to use larger TLB entries while
keeping the base page size small, too. There are already many places in
the kernel that attempt to use the largest possible size when setting up
the entries, and this is something that those of us with tiny
software-managed TLBs are a huge fan of -- some platforms have even opted
to do perverse things such as scanning for contiguous PTEs and bumping to
the next order automatically at set_pte() time.

Unplug is also interesting from a power management point of view.
Powering off is still more attractive than self-refresh, for example, but
could also be used at run-time depending on the workload.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:06       ` Christoph Lameter
@ 2007-03-02  4:21         ` Nick Piggin
  2007-03-02  4:31           ` Christoph Lameter
  2007-03-02  4:29         ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  4:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 08:06:25PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > I would say that anti-frag / defrag enables memory unplug.
> > 
> > Well that really depends. If you want to have any sort of guaranteed
> > amount of unplugging or shrinking (or hugepage allocating), then antifrag
> > doesn't work because it is a heuristic.
> 
> We would need additional measures such as real defrag and make more 
> structure movable.
> 
> > One thing that worries me about anti-fragmentation is that people might
> > actually start _using_ higher order pages in the kernel. Then fragmentation
> > comes back, and it's worse because now it is not just the fringe hugepage or
> > unplug users (who can anyway work around the fragmentation by allocating
> > from reserve zones).
> 
> Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> in order to reduce overhead of managing page structs for large I/O and 
> large memory applications. We need appropriate measures to deal with the 
> fragmentation problem.

I don't understand why, out of any architecture, ia64 would have to hack
around this in software :(

> > > Thats a value judgement that I doubt. Zone based balancing is bad and has 
> > > been repeatedly patched up so that it works with the usual loads.
> > 
> > Shouldn't we fix it instead of deciding it is broken and add another layer
> > on top that supposedly does better balancing?
> 
> We need to reduce the real hardware zones as much as possible. Most high 
> performance architectures have no need for additional DMA zones f.e. and
> do not have to deal with the complexities that arise there.

And then you want to add something else on top of them?

> > But just because zones are hardware _now_ doesn't mean they have to stay
> > that way. The upshot is that a lot of work for zones is already there.
> 
> Well you cannot get there without the nodes. The control of memory 
> allocations with user space support etc only comes with the nodes.
> 
> > > A. moveable/unmovable
> > > B. DMA restrictions
> > > C. container assignment.
> > 
> > There are alternatives to adding a new layer of virtual zones. We could try
> > using zones, enven.
> 
> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.
>  
> > zones aren't perfect right now, but they are quite similar to what you
> > want (ie. blocks of memory). I think we should first try to generalise what
> > we have rather than adding another layer.
> 
> Yes that would mean merging nodes and zones. So "nones".

Yes, this is what Andrew just said. But you then wanted to add virtual zones
or something on top. I just don't understand why. You agree that merging
nodes and zones is a good idea. Did I miss the important post where some
bright person discovered why merging zones and "virtual zones" is a bad
idea?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:06       ` Christoph Lameter
  2007-03-02  4:21         ` Nick Piggin
@ 2007-03-02  4:29         ` Andrew Morton
  2007-03-02  4:33           ` Christoph Lameter
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02  4:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 20:06:25 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> No merge them to one thing and handle them as one. No difference between 
> zones and nodes anymore.

Sorry, but this is crap.  zones and nodes are distinct, physical concepts
and you're kidding yourself if you think you can somehow fudge things to make
one of them just go away.

Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
in which we can fudge away the distinction between
bus-addresses-which-have-the-32-upper-bits-zero and
memory-which-is-local-to-each-socket.

No matter how hard those hands are waving.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:21         ` Nick Piggin
@ 2007-03-02  4:31           ` Christoph Lameter
  2007-03-02  5:06             ` Nick Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > in order to reduce overhead of managing page structs for large I/O and 
> > large memory applications. We need appropriate measures to deal with the 
> > fragmentation problem.
> 
> I don't understand why, out of any architecture, ia64 would have to hack
> around this in software :(

Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
very useful for the large number of small files that are around. But for 
the large streams of data you would want other methods of handling these.

If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
bandwiths and leads to huge scatter gather lists (and we are limited in 
terms of the numbe of items on those lists in many drivers). Our future 
platforms have up to serveral petabytes of memory. There needs to be some 
way to handle these capacities in an efficient way. We cannot wait 
an hour for the terabyte to reach the disk.
 
> > We need to reduce the real hardware zones as much as possible. Most high 
> > performance architectures have no need for additional DMA zones f.e. and
> > do not have to deal with the complexities that arise there.
> 
> And then you want to add something else on top of them?

zones are basically managing a number of MAX_ORDER chunks. The adding of 
something here is dealing with the categorization of these MAX_ORDER 
chunks in order to insure movability and thus defragmentability of
most of them. Or the upper layer may limit the number of those chunks
assigned to a certain container.

> > Yes that would mean merging nodes and zones. So "nones".
> 
> Yes, this is what Andrew just said. But you then wanted to add virtual zones
> or something on top. I just don't understand why. You agree that merging
> nodes and zones is a good idea. Did I miss the important post where some
> bright person discovered why merging zones and "virtual zones" is a bad
> idea?

Hmmm.. I usually talk about the "virtual zones" as virtual nodes. But we 
are basically at the same point there. Node level controls and APIs exist and 
can even be used from user space. A container could just be a special node 
and then the allocations to this container could be controlled via the 
existing APIs.

A virtual zone/node would be assigned a number of MAX_ORDER blocks from 
real zones/nodes. Then it may hopefully be managed like a real node. In 
the original zone/node these MAX_ORDER blocks would show up as 
unavailable. The "upper" layer therefore is the existing node/zone layer. 
The virtual zones/nodes just steal memory from the real ones.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:29         ` Andrew Morton
@ 2007-03-02  4:33           ` Christoph Lameter
  2007-03-02  4:58             ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  4:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007, Andrew Morton wrote:

> Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> and you're kidding yourself if you think you can somehow fudge things to make
> one of them just go away.
> 
> Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> in which we can fudge away the distinction between
> bus-addresses-which-have-the-32-upper-bits-zero and
> memory-which-is-local-to-each-socket.

Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
relevant memory from the base zone/node.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:33           ` Christoph Lameter
@ 2007-03-02  4:58             ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 20:33:04 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Thu, 1 Mar 2007, Andrew Morton wrote:
> 
> > Sorry, but this is crap.  zones and nodes are distinct, physical concepts
> > and you're kidding yourself if you think you can somehow fudge things to make
> > one of them just go away.
> > 
> > Think: ZONE_DMA32 on an Opteron machine.  I don't think there is a sane way
> > in which we can fudge away the distinction between
> > bus-addresses-which-have-the-32-upper-bits-zero and
> > memory-which-is-local-to-each-socket.
> 
> Of course you can. Add a virtual DMA and DMA32 zone/node and extract the 
> relevant memory from the base zone/node.

You're using terms which I've never seen described anywhere.

Please, just stop here.  Give us a complete design proposal which we can
understand and review.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  4:31           ` Christoph Lameter
@ 2007-03-02  5:06             ` Nick Piggin
  2007-03-02  5:40               ` Christoph Lameter
  2007-03-02  5:50               ` Christoph Lameter
  0 siblings, 2 replies; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  5:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 08:31:24PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel 
> > > in order to reduce overhead of managing page structs for large I/O and 
> > > large memory applications. We need appropriate measures to deal with the 
> > > fragmentation problem.
> > 
> > I don't understand why, out of any architecture, ia64 would have to hack
> > around this in software :(
> 
> Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are 
> very useful for the large number of small files that are around. But for 
> the large streams of data you would want other methods of handling these.
> 
> If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has 
> to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O 
> bandwiths and leads to huge scatter gather lists (and we are limited in 
> terms of the numbe of items on those lists in many drivers). Our future 
> platforms have up to serveral petabytes of memory. There needs to be some 
> way to handle these capacities in an efficient way. We cannot wait 
> an hour for the terabyte to reach the disk.

I guess you mean 256 billion page structs.

So what do you mean by efficient? I guess you aren't talking about CPU
efficiency, because even if you make the IO subsystem submit larger
physical IOs, you still have to deal with 256 billion TLB entries, the
pagecache has to deal with 256 billion struct pages, so does the
filesystem code to build the bios.

So you are having problems with your IO controller's handling of sg
lists?



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:59         ` Andrew Morton
@ 2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
  2007-03-02 16:20             ` Mark Gross
  0 siblings, 2 replies; 104+ messages in thread
From: Linus Torvalds @ 2007-03-02  5:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Mel Gorman, npiggin, clameter, mingo, jschopp,
	arjan, mbligh, linux-mm, linux-kernel



On Thu, 1 Mar 2007, Andrew Morton wrote:
>
> On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > In other words, I really don't see a huge upside. I see *lots* of 
> > downsides, but upsides? Not so much. Almost everybody who wants unplug 
> > wants virtualization, and right now none of the "big virtualization" 
> > people would want to have kernel-level anti-fragmentation anyway sicne 
> > they'd need to do it on their own.
> 
> Agree with all that, but you're missing the other application: power
> saving.  FBDIMMs take eight watts a pop.

This is a hardware problem. Let's see how long it takes for Intel to 
realize that FBDIMM's were a hugely bad idea from a power perspective.

Yes, the same issues exist for other DRAM forms too, but to a *much* 
smaller degree.

Also, IN PRACTICE you're never ever going to see this anyway. Almost 
everybody wants bank interleaving, because it's a huge performance win on 
many loads. That, in turn, means that your memory will be spread out over 
multiple DIMM's even for a single page, much less any bigger area.

In other words - forget about DRAM power savings. It's not realistic. And 
if you want low-power, don't use FBDIMM's. It really *is* that simple.

(And yes, maybe FBDIMM controllers in a few years won't use 8 W per 
buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish 
voltage swings at high frequencies.)

Also, on a *truly* idle system, we'll see the power savings whatever we 
do, because the working set will fit in D$, and to get those DRAM power 
savings in reality you need to have the DRAM controller shut down on its 
own anyway (ie sw would only help a bit).

The whole DRAM power story is a bedtime story for gullible children. Don't 
fall for it. It's not realistic. The hardware support for it DOES NOT 
EXIST today, and probably won't for several years. And the real fix is 
elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
is against the whole point of FBDIMM in the first place, but that's what 
you get when you ignore power in the first version!).

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
  2007-03-02  3:59         ` Andrew Morton
  2007-03-02  4:18         ` Balbir Singh
@ 2007-03-02  5:13         ` Jeremy Fitzhardinge
  2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 104+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-02  5:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Andrew Morton, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> Virtualization in general. We don't know what it is - in IBM machines it's 
> a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
> KVM, it's obviously a host Linux kernel/user-process combination.
>
> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.
>
> And in hosts, we usually don't care EITHER, since it's usually done in a 
> hypervisor.
>   

The paravirt_ops patches I just posted implement all the machinery
required to create a pseudo-physical to machine address mapping under
the kernel.  This is used under Xen because it directly exposes the
pagetables to its guests, but there's no reason why you couldn't use
this layer to implement the same mapping without an underlying
hypervisor.  This allows the kernel to see a normal linear "physical"
address space which is in fact its mapped over a discontigious set of
machine ("real physical") pages.

Andrew and I discussed using it for a kdump kernel, so that you could
load it into a random bunch of pages, and set things up so that it sees
itself as being contiguous.

The mapping is pretty simple.  It intercepts __pte (__pmd, etc) to map
the "physical" page to the real machine page, and pte_val does the
reverse mapping.

You could implement this today as a farily simple, thin paravirt_ops
backend.  The main tricky part is making sure all the device drivers are
correct in using bus addresses (which are mapped to real machine
addresses), and that they don't assume that adjacent kernel virtual
pages are physically adjacent.

    J

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:06             ` Nick Piggin
@ 2007-03-02  5:40               ` Christoph Lameter
  2007-03-02  5:49                 ` Nick Piggin
  2007-03-02  5:50               ` Christoph Lameter
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

You do not have to deal with TLB entries if you do buffered I/O.

For mmapped I/O you would want to transparently use 2M TLBs if the 
page size is large.

> So you are having problems with your IO controller's handling of sg
> lists?

We currently have problems with the kernel limits of 128 SG 
entries but the fundamental issue is that we can only do 2 Meg of I/O in 
one go given the default limits of the block layer. Typically the number 
of hardware SG entrie is also limited. We never will be able to put a 




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:40               ` Christoph Lameter
@ 2007-03-02  5:49                 ` Nick Piggin
  2007-03-02  5:53                   ` Christoph Lameter
  0 siblings, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  5:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:40:45PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > So what do you mean by efficient? I guess you aren't talking about CPU
> > efficiency, because even if you make the IO subsystem submit larger
> > physical IOs, you still have to deal with 256 billion TLB entries, the
> > pagecache has to deal with 256 billion struct pages, so does the
> > filesystem code to build the bios.
> 
> You do not have to deal with TLB entries if you do buffered I/O.

Where does the data come from?

> For mmapped I/O you would want to transparently use 2M TLBs if the 
> page size is large.
> 
> > So you are having problems with your IO controller's handling of sg
> > lists?
> 
> We currently have problems with the kernel limits of 128 SG 
> entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> one go given the default limits of the block layer. Typically the number 
> of hardware SG entrie is also limited. We never will be able to put a 

Seems like changing the default limits would be the easiest way to
fix it then?

As far as hardware limits go, I don't think you need to scale that
number linearly with the amount of memory you have, or even with the
IO throughput. You should reach a point where your command overhead
is amortised sufficiently, and the controller will be pipelining the
commands.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:06             ` Nick Piggin
  2007-03-02  5:40               ` Christoph Lameter
@ 2007-03-02  5:50               ` Christoph Lameter
  1 sibling, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> So what do you mean by efficient? I guess you aren't talking about CPU
> efficiency, because even if you make the IO subsystem submit larger
> physical IOs, you still have to deal with 256 billion TLB entries, the
> pagecache has to deal with 256 billion struct pages, so does the
> filesystem code to build the bios.

Re the page cache: It needs also to be able to handle large page sizes of 
course. Scanning gazillions of page structs in vmscan.c will make the 
system slow as a dog. The number of page structs needs to be drastically 
reduced for large I/O. I think this can be done with allowing compound 
pages to be handled throughout the VM. The defrag issues then becomes very 
pressing indeed.

We have discussed the idea of going to kernel with 2M base page size on 
x86_64 but that step is a bit drastic and the overhead for small files 
would be tremendous.

Support for compound pages already exists in the page allocator and the 
slab allocator. Maybe we could extend that support to the I/O subsystem? 
We would also then have more contiguous writes which will further speed up 
I/O efficiency.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:11           ` Linus Torvalds
@ 2007-03-02  5:50             ` KAMEZAWA Hiroyuki
  2007-03-02  6:15               ` Paul Mundt
  2007-03-02 16:20             ` Mark Gross
  1 sibling, 1 reply; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-02  5:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, balbir, mel, npiggin, clameter, mingo, jschopp, arjan,
	mbligh, linux-mm, linux-kernel

On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The whole DRAM power story is a bedtime story for gullible children. Don't 
> fall for it. It's not realistic. The hardware support for it DOES NOT 
> EXIST today, and probably won't for several years. And the real fix is 
> elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> is against the whole point of FBDIMM in the first place, but that's what 
> you get when you ignore power in the first version!).
> 

At first, we have memory hot-add now. So I want to implement hot-removing 
hot-added memory, at least. (in this case, we don't have to write invasive
patches to memory-init-core.)

Our(Fujtisu's) product, ia64-NUMA server, has a feature to offline memory.
It supports dynamic reconfigraion of nodes, node-hoplug.

But there is no *shipped* firmware for hotplug yet. RHEL4 couldn't boot on
such hotplug-supported-firmware...so firmware-team were not in hurry.
It will be shipped after RHEL5 comes.
IMHO, a firmware which supports memory-hot-add are ready to support memory-hot-remove
if OS can handle it.

Note:
I heard embeded people often designs their own memory-power-off control on
embeded Linux. (but it never seems to be posted to the list.) But I don't know
they are interested in generic memory hotremove or not.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:49                 ` Nick Piggin
@ 2007-03-02  5:53                   ` Christoph Lameter
  2007-03-02  6:08                     ` Nick Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > You do not have to deal with TLB entries if you do buffered I/O.
> 
> Where does the data come from?

>From the I/O controller and from the application. 

> > We currently have problems with the kernel limits of 128 SG 
> > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > one go given the default limits of the block layer. Typically the number 
> > of hardware SG entrie is also limited. We never will be able to put a 
> 
> Seems like changing the default limits would be the easiest way to
> fix it then?

This would only be a temporary fix pushing the limits to the double or so?
 
> As far as hardware limits go, I don't think you need to scale that
> number linearly with the amount of memory you have, or even with the
> IO throughput. You should reach a point where your command overhead
> is amortised sufficiently, and the controller will be pipelining the
> commands.

Amortized? The controller still would have to hunt down the 4kb page 
pieces that we have to feed him right now. Result: Huge scatter gather 
lists that may themselves create issues with higher page order.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:53                   ` Christoph Lameter
@ 2007-03-02  6:08                     ` Nick Piggin
  2007-03-02  6:19                       ` Christoph Lameter
  0 siblings, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  6:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:53:42PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > You do not have to deal with TLB entries if you do buffered I/O.
> > 
> > Where does the data come from?
> 
> >From the I/O controller and from the application. 

Why doesn't the application need to deal with TLB entries?


> > > We currently have problems with the kernel limits of 128 SG 
> > > entries but the fundamental issue is that we can only do 2 Meg of I/O in 
> > > one go given the default limits of the block layer. Typically the number 
> > > of hardware SG entrie is also limited. We never will be able to put a 
> > 
> > Seems like changing the default limits would be the easiest way to
> > fix it then?
> 
> This would only be a temporary fix pushing the limits to the double or so?

And using slightly larger page sizes isn't?

> > As far as hardware limits go, I don't think you need to scale that
> > number linearly with the amount of memory you have, or even with the
> > IO throughput. You should reach a point where your command overhead
> > is amortised sufficiently, and the controller will be pipelining the
> > commands.
> 
> Amortized? The controller still would have to hunt down the 4kb page 
> pieces that we have to feed him right now. Result: Huge scatter gather 
> lists that may themselves create issues with higher page order.

What sort of numbers do you have for these controllers that aren't
very good at doing sg?

Isn't the issue was something like your IO controllers have only a
limited number of sg entries, which is fine with 16K pages, but with
4K pages that doesn't give enough data to cover your RAID stripe?

We're never going to do a variable sized pagecache just because of that.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
@ 2007-03-02  6:15               ` Paul Mundt
  2007-03-02 17:01                 ` Mel Gorman
  0 siblings, 1 reply; 104+ messages in thread
From: Paul Mundt @ 2007-03-02  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linus Torvalds, akpm, balbir, mel, npiggin, clameter, mingo,
	jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > EXIST today, and probably won't for several years. And the real fix is 
> > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > is against the whole point of FBDIMM in the first place, but that's what 
> > you get when you ignore power in the first version!).
> > 
> 
> Note:
> I heard embeded people often designs their own memory-power-off control on
> embeded Linux. (but it never seems to be posted to the list.) But I don't know
> they are interested in generic memory hotremove or not.
> 
Yes, this is not that uncommon of a thing. People tend to do this in a
couple of different ways, in some cases the system is too loaded to ever
make doing such a thing at run-time worthwhile, and in those cases these
sorts of things tend to be munged in with the suspend code. Unfortunately
it tends to be quite difficult in practice to keep pages in one place,
so people rely on lame chip-select hacks and limiting the amount of
memory that the kernel treats as RAM instead so it never ends up being an
issue. Having some sort of a balance would certainly be nice, though.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:08                     ` Nick Piggin
@ 2007-03-02  6:19                       ` Christoph Lameter
  2007-03-02  6:29                         ` Nick Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  6:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > >From the I/O controller and from the application. 
> 
> Why doesn't the application need to deal with TLB entries?

Because it may only operate on a small section of the file and hopefully 
splice the rest through? But yes support for mmapped I/O would be 
necessary.

> > This would only be a temporary fix pushing the limits to the double or so?
> 
> And using slightly larger page sizes isn't?

There was no talk about slightly. 1G page size would actually be quite 
convenient for some applications.

> > Amortized? The controller still would have to hunt down the 4kb page 
> > pieces that we have to feed him right now. Result: Huge scatter gather 
> > lists that may themselves create issues with higher page order.
> 
> What sort of numbers do you have for these controllers that aren't
> very good at doing sg?

Writing a terabyte of memory to disk with handling 256 billion page 
structs? In case of a system with 1 petabyte of memory this may be rather 
typical and necessary for the application to be able to save its state
on disk.

> Isn't the issue was something like your IO controllers have only a
> limited number of sg entries, which is fine with 16K pages, but with
> 4K pages that doesn't give enough data to cover your RAID stripe?
> 
> We're never going to do a variable sized pagecache just because of that.

No, we need support for larger page sizes than 16k. 16k has not been fine 
for a couple of years. We only agreed to 16k because that was the common 
consensus. Best performance was always at 64k 4 years ago (but then we 
have no numbers for higher page sizes yet). Now we would prefer much 
larger sizes.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:19                       ` Christoph Lameter
@ 2007-03-02  6:29                         ` Nick Piggin
  2007-03-02  6:51                           ` Christoph Lameter
  0 siblings, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  6:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:19:48PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > >From the I/O controller and from the application. 
> > 
> > Why doesn't the application need to deal with TLB entries?
> 
> Because it may only operate on a small section of the file and hopefully 
> splice the rest through? But yes support for mmapped I/O would be 
> necessary.

So you're talking about copying a file from one location to another?


> > > This would only be a temporary fix pushing the limits to the double or so?
> > 
> > And using slightly larger page sizes isn't?
> 
> There was no talk about slightly. 1G page size would actually be quite 
> convenient for some applications.

But it is far from convenient for the kernel. So we have hugepages, so
we can stay out of the hair of those applications and they can stay out
of hours.

> > > Amortized? The controller still would have to hunt down the 4kb page 
> > > pieces that we have to feed him right now. Result: Huge scatter gather 
> > > lists that may themselves create issues with higher page order.
> > 
> > What sort of numbers do you have for these controllers that aren't
> > very good at doing sg?
> 
> Writing a terabyte of memory to disk with handling 256 billion page 
> structs? In case of a system with 1 petabyte of memory this may be rather 
> typical and necessary for the application to be able to save its state
> on disk.

But you will have newer IO controllers, faster CPUs...

Is it a problem or isn't it? Waving around the 256 billion number isn't
impressive because it doesn't really say anything.

> > Isn't the issue was something like your IO controllers have only a
> > limited number of sg entries, which is fine with 16K pages, but with
> > 4K pages that doesn't give enough data to cover your RAID stripe?
> > 
> > We're never going to do a variable sized pagecache just because of that.
> 
> No, we need support for larger page sizes than 16k. 16k has not been fine 
> for a couple of years. We only agreed to 16k because that was the common 
> consensus. Best performance was always at 64k 4 years ago (but then we 
> have no numbers for higher page sizes yet). Now we would prefer much 
> larger sizes.

But you are in a tiny minority, so it is not so much a question of what
you prefer, but what you can make do with without being too intrusive.

I understand you have controllers (or maybe it is a block layer limit)
that doesn't work well with 4K pages, but works OK with 16K pages.
This is not something that we would introduce variable sized pagecache
for, surely.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:29                         ` Nick Piggin
@ 2007-03-02  6:51                           ` Christoph Lameter
  2007-03-02  7:03                             ` Andrew Morton
  2007-03-02  7:19                             ` Nick Piggin
  0 siblings, 2 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  6:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > There was no talk about slightly. 1G page size would actually be quite 
> > convenient for some applications.
> 
> But it is far from convenient for the kernel. So we have hugepages, so
> we can stay out of the hair of those applications and they can stay out
> of hours.

Huge pages cannot do I/O so we would get back to the gazillions of pages 
to be handled for I/O. I'd love to have I/O support for huge pages. This 
would address some of the issues.

> > Writing a terabyte of memory to disk with handling 256 billion page 
> > structs? In case of a system with 1 petabyte of memory this may be rather 
> > typical and necessary for the application to be able to save its state
> > on disk.
> 
> But you will have newer IO controllers, faster CPUs...

Sure we will. And you believe that the the newer controllers will be able 
to magically shrink the the SG lists somehow? We will offload the 
coalescing of the page structs into bios in hardware or some such thing? 
And the vmscans etc too?

> Is it a problem or isn't it? Waving around the 256 billion number isn't
> impressive because it doesn't really say anything.

It is the number of items that needs to be handled by the I/O layer and 
likely by the SG engine.
 
> I understand you have controllers (or maybe it is a block layer limit)
> that doesn't work well with 4K pages, but works OK with 16K pages.

Really? This is the first that I have heard about it.

> This is not something that we would introduce variable sized pagecache
> for, surely.

I am not sure where you get the idea that this is the sole reason why we 
need to be able to handle larger contiguous chunks of memory.

How about coming up with a response to the issue at hand? How do I write 
back 1 Terabyte effectively? Ok this may be an exotic configuration today 
but in one year this may be much more common. Memory sizes keep on 
increasing and so is the number of page structs to be handled for I/O. At 
some point we need a solution here.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:51                           ` Christoph Lameter
@ 2007-03-02  7:03                             ` Andrew Morton
  2007-03-02  7:19                             ` Nick Piggin
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02  7:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Mel Gorman, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 1 Mar 2007 22:51:00 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> I'd love to have I/O support for huge pages.

direct-IO works.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:51                           ` Christoph Lameter
  2007-03-02  7:03                             ` Andrew Morton
@ 2007-03-02  7:19                             ` Nick Piggin
  2007-03-02  7:44                               ` Christoph Lameter
  1 sibling, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  7:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 10:51:00PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > There was no talk about slightly. 1G page size would actually be quite 
> > > convenient for some applications.
> > 
> > But it is far from convenient for the kernel. So we have hugepages, so
> > we can stay out of the hair of those applications and they can stay out
> > of hours.
> 
> Huge pages cannot do I/O so we would get back to the gazillions of pages 
> to be handled for I/O. I'd love to have I/O support for huge pages. This 
> would address some of the issues.

Can't direct IO from a hugepage?

> > > Writing a terabyte of memory to disk with handling 256 billion page 
> > > structs? In case of a system with 1 petabyte of memory this may be rather 
> > > typical and necessary for the application to be able to save its state
> > > on disk.
> > 
> > But you will have newer IO controllers, faster CPUs...
> 
> Sure we will. And you believe that the the newer controllers will be able 
> to magically shrink the the SG lists somehow? We will offload the 
> coalescing of the page structs into bios in hardware or some such thing? 
> And the vmscans etc too?

As far as pagecache page management goes, is that an issue for you?
I don't want to know about how many billions of pages for some operation,
just some profiles.

> > Is it a problem or isn't it? Waving around the 256 billion number isn't
> > impressive because it doesn't really say anything.
> 
> It is the number of items that needs to be handled by the I/O layer and 
> likely by the SG engine.

The number is irrelevant, it is the rate that is important.

> > I understand you have controllers (or maybe it is a block layer limit)
> > that doesn't work well with 4K pages, but works OK with 16K pages.
> 
> Really? This is the first that I have heard about it.
>

Maybe that's the issue you're running into.

> > This is not something that we would introduce variable sized pagecache
> > for, surely.
> 
> I am not sure where you get the idea that this is the sole reason why we 
> need to be able to handle larger contiguous chunks of memory.

I'm not saying that. You brought up this subject of variable sized pagecache.

> How about coming up with a response to the issue at hand? How do I write 
> back 1 Terabyte effectively? Ok this may be an exotic configuration today 
> but in one year this may be much more common. Memory sizes keep on 
> increasing and so is the number of page structs to be handled for I/O. At 
> some point we need a solution here.

Considering you're just handwaving about the actual problems, I
don't know. I assume you're sitting in front of some workload that has
gone wrong, so can't you elaborate?

Eventually, increasing x86 page size a bit might be an idea. We could even
do it in software if CPU manufacturers don't for us.

That doesn't buy us a great deal if you think there is this huge looming
problem with struct page management though.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  7:19                             ` Nick Piggin
@ 2007-03-02  7:44                               ` Christoph Lameter
  2007-03-02  8:12                                 ` Nick Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  7:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Sure we will. And you believe that the the newer controllers will be able 
> > to magically shrink the the SG lists somehow? We will offload the 
> > coalescing of the page structs into bios in hardware or some such thing? 
> > And the vmscans etc too?
> 
> As far as pagecache page management goes, is that an issue for you?
> I don't want to know about how many billions of pages for some operation,
> just some profiles.

If there are billions of pages in the system and we are allocating and 
deallocating then pages need to be aged. If there are just few pages 
freeable then we run into issues.

> > > I understand you have controllers (or maybe it is a block layer limit)
> > > that doesn't work well with 4K pages, but works OK with 16K pages.
> > Really? This is the first that I have heard about it.
> Maybe that's the issue you're running into.

Oh, I am running into an issue on a system that does not yet exist? I am 
extrapolating from the problems that we commonly see now. Those will get 
worse the more memory increases.

> > > This is not something that we would introduce variable sized pagecache
> > > for, surely.
> > I am not sure where you get the idea that this is the sole reason why we 
> > need to be able to handle larger contiguous chunks of memory.
> I'm not saying that. You brought up this subject of variable sized pagecache.

You keep bringing up the 4k/16k issue into this for some reason. I want 
just the ability to handle large amounts of memory. Larger page sizes are 
a way to accomplish that.

> Eventually, increasing x86 page size a bit might be an idea. We could even
> do it in software if CPU manufacturers don't for us.

A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. 
Some way to handle continuous memory segments of 1GB and larger 
effectively would be great.
  
> That doesn't buy us a great deal if you think there is this huge looming
> problem with struct page management though.

I am not the first one.... See Rik's posts regarding the reasons for his 
new page replacement algorithms.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  7:44                               ` Christoph Lameter
@ 2007-03-02  8:12                                 ` Nick Piggin
  2007-03-02  8:21                                   ` Christoph Lameter
  2007-03-04  1:26                                   ` Rik van Riel
  0 siblings, 2 replies; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  8:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 11:44:05PM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > Sure we will. And you believe that the the newer controllers will be able 
> > > to magically shrink the the SG lists somehow? We will offload the 
> > > coalescing of the page structs into bios in hardware or some such thing? 
> > > And the vmscans etc too?
> > 
> > As far as pagecache page management goes, is that an issue for you?
> > I don't want to know about how many billions of pages for some operation,
> > just some profiles.
> 
> If there are billions of pages in the system and we are allocating and 
> deallocating then pages need to be aged. If there are just few pages 
> freeable then we run into issues.

page writeout and vmscan don't work too badly. What are the issues?

> > > > I understand you have controllers (or maybe it is a block layer limit)
> > > > that doesn't work well with 4K pages, but works OK with 16K pages.
> > > Really? This is the first that I have heard about it.
> > Maybe that's the issue you're running into.
> 
> Oh, I am running into an issue on a system that does not yet exist? I am 
> extrapolating from the problems that we commonly see now. Those will get 
> worse the more memory increases.

So what problems that you commonly see now? Some of us here don't
have 4TB of memory, so you actually have to tell us ;)

> > > > This is not something that we would introduce variable sized pagecache
> > > > for, surely.
> > > I am not sure where you get the idea that this is the sole reason why we 
> > > need to be able to handle larger contiguous chunks of memory.
> > I'm not saying that. You brought up this subject of variable sized pagecache.
> 
> You keep bringing up the 4k/16k issue into this for some reason. I want 
> just the ability to handle large amounts of memory. Larger page sizes are 
> a way to accomplish that.

As I said in my other mail to you, Linux runs on systems with 6 orders
of magnitude more struct pages than when it was first created. What's
the problem?

> > Eventually, increasing x86 page size a bit might be an idea. We could even
> > do it in software if CPU manufacturers don't for us.
> 
> A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. 
> Some way to handle continuous memory segments of 1GB and larger 
> effectively would be great.

How did you come up with that 2MB number?

Anyway, we have hugetlbfs for things like that.

> > That doesn't buy us a great deal if you think there is this huge looming
> > problem with struct page management though.
> 
> I am not the first one.... See Rik's posts regarding the reasons for his 
> new page replacement algorithms.

Different issue, isn't it? Rik wants to be smarter in figuring out which
pages to throw away. More work per page == worse for you.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:12                                 ` Nick Piggin
@ 2007-03-02  8:21                                   ` Christoph Lameter
  2007-03-02  8:38                                     ` Nick Piggin
  2007-03-04  1:26                                   ` Rik van Riel
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02  8:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > If there are billions of pages in the system and we are allocating and 
> > deallocating then pages need to be aged. If there are just few pages 
> > freeable then we run into issues.
> 
> page writeout and vmscan don't work too badly. What are the issues?

Slow downs up to livelocks with large memory configurations.

> So what problems that you commonly see now? Some of us here don't
> have 4TB of memory, so you actually have to tell us ;)

Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
allocs and frees and you will see it too. F.e try Linus tree and mlock 
a large portion of the memory and then see the fun starting. See also 
Rik's list of pathological cases on this.

> How did you come up with that 2MB number?

Huge page size. The only basic choice on x86_64

> Anyway, we have hugetlbfs for things like that.

Good to know that direct io works.

> > I am not the first one.... See Rik's posts regarding the reasons for his 
> > new page replacement algorithms.
> 
> Different issue, isn't it? Rik wants to be smarter in figuring out which
> pages to throw away. More work per page == worse for you.

Rik is trying to solve the same issue in a different way. He is trying to 
manage gazillion entries better instead of reducing the entries to be 
managed. That can only work in a limited way. Drastic reductions in the 
entries to be manages have good effects in multiple ways. Reduce 
management overhead, improve I/O throughput, etc etc.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:21                                   ` Christoph Lameter
@ 2007-03-02  8:38                                     ` Nick Piggin
  2007-03-02 17:09                                       ` Christoph Lameter
  0 siblings, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-02  8:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 12:21:49AM -0800, Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Nick Piggin wrote:
> 
> > > If there are billions of pages in the system and we are allocating and 
> > > deallocating then pages need to be aged. If there are just few pages 
> > > freeable then we run into issues.
> > 
> > page writeout and vmscan don't work too badly. What are the issues?
> 
> Slow downs up to livelocks with large memory configurations.
> 
> > So what problems that you commonly see now? Some of us here don't
> > have 4TB of memory, so you actually have to tell us ;)
> 
> Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
> allocs and frees and you will see it too. F.e try Linus tree and mlock 
> a large portion of the memory and then see the fun starting. See also 
> Rik's list of pathological cases on this.

Ah, so your problem is lots of unreclaimable pages. There are heaps
of things we can try to reduce the rate at which we scan those.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  1:52 ` Bill Irwin
@ 2007-03-02 10:38   ` Mel Gorman
  2007-03-02 16:31     ` Joel Schopp
  0 siblings, 1 reply; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 10:38 UTC (permalink / raw)
  To: Bill Irwin
  Cc: akpm, npiggin, clameter, mingo, Joel Schopp, arjan, torvalds,
	mbligh, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 1 Mar 2007, Bill Irwin wrote:

> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> These are figures based on kernels patches with Andy Whitcrofts reclaim
>> patches. You will see that the zone-based kernel is getting success rates
>> closer to 40% as one would expect although there is still something amiss.
>
> Yes, combining the two should do at least as well as either in
> isolation. Are there videos of each of the two in isolation?

Yes. Towards the end of the mail, I give links to all of the images like 
this for example;

elm3b14-vanilla       http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-vanilla.avi
elm3b14-list-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-listbased.avi
elm3b14-zone-based    http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-zonebased.avi
elm3b14-combined      http://www.skynet.ie/~mel/anti-frag/2007-02-28/elm3b14-combined.avi

In the zone-based figures, there are pages there that could be reclaimed, 
but are ignored by page reclaim because watermarks are satisified.

> Maybe that
> would give someone insight into what's happening.
>
>
> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> Kernbench Total CPU Time
>
> Oh dear. How do the other benchmarks look?
>

What other figures would you like to see and I'll generate them. Often 
kernbench is all people look for for this type of thing.

"Oh dear" implies you think the figures are bad. But on ppc64 and x86_64 
at least, the total CPU times are slightly lower with both 
anti-fragmentation patches - that's not bad. On NUMA-Q (which no one uses 
any more or is even sold), it's very marginally slower.

These are the AIM9 figures I have

AIM9 Results
                                       Vanilla Kernel   List-base Kernel  Zone-base Kernel  Combined Kernel
Machine       Arch      Test              Seconds           Seconds           Seconds          Seconds
-------     ---------  ------         --------------   ----------------  ----------------  ---------------
elm3b14     x86-numaq   page_test      115108.30           112955.68             109773.37            108073.65 
elm3b14     x86-numaq   brk_test       520593.14           494251.92             496801.07            488141.24 
elm3b14     x86-numaq   fork_test      2007.99             2005.66               2011.00              1986.35 
elm3b14     x86-numaq   exec_test      57.11               57.15                 57.27                57.01 
elm3b245    x86_64      page_test      220490.00           218166.67             224371.67            224164.31 
elm3b245    x86_64      brk_test       2178186.97          2337110.48            3025495.75           2445733.33 
elm3b245    x86_64      fork_test      4854.19             4957.51               4900.03              5001.67 
elm3b245    x86_64      exec_test      194.55              196.30                195.55               195.90 
gekko-lp1   ppc64       page_test      300368.27           310651.56             300673.33            308720.00 
gekko-lp1   ppc64       brk_test       1328895.18          1403448.85            1431489.50           1408263.91 
gekko-lp1   ppc64       fork_test      3374.42             3395.00               3367.77              3396.64 
gekko-lp1   ppc64       exec_test      152.87              153.12                151.92               153.39 
gekko-lp4   ppc64       page_test      291643.06           306906.67             294872.52            303796.03 
gekko-lp4   ppc64       brk_test       1322946.18          1366572.24            1378470.25           1403116.15 
gekko-lp4   ppc64       fork_test      3326.11             3335.00               3315.56              3333.33 
gekko-lp4   ppc64       exec_test      149.01              149.90                149.48               149.87

Many of these are showing performance improvements as well, not 
regressions.

>
> On Thu, Mar 01, 2007 at 10:12:50AM +0000, Mel Gorman wrote:
>> The patches go a long way to making sure that high-order allocations work
>> and particularly that the hugepage pool can be resized once the system has
>> been running. With the clustering of high-order atomic allocations, I have
>> some confidence that allocating contiguous jumbo frames will work even with
>> loads performing lots of IO. I think the videos show how the patches actually
>> work in the clearest possible manner.
>> I am of the opinion that both approaches have their advantages and
>> disadvantages. Given a choice between the two, I prefer list-based
>> because of it's flexibility and it should also help high-order kernel
>> allocations. However, by applying both, the disadvantages of list-based are
>> covered and there still appears to be no performance loss as a result. Hence,
>> I'd like to see both merged.  Any opinion on merging these patches into -mm
>> for wider testing?
>
> Exhibiting a workload where the list patch breaks down and the zone
> patch rescues it might help if it's felt that the combination isn't as
> good as lists in isolation. I'm sure one can be dredged up somewhere.

I can't think of a workload that totally makes a mess out of list-based. 
However, list-based makes no guarantees on availability. If a system 
administrator knows they need between 10,000 and 100,000 huge pages and 
doesn't want to waste memory pinning too many huge pages at boot-time, the 
zone-based mechanism would be what he wanted.

> Either that or someone will eventually spot why the combination doesn't
> get as many available maximally contiguous regions as the list patch.
> By and large I'm happy to see anything go in that inches hugetlbfs
> closer to a backward compatibility wrapper over ramfs.
>

Good to hear

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
                     ` (2 preceding siblings ...)
  2007-03-02  3:05   ` Christoph Lameter
@ 2007-03-02 13:50   ` Arjan van de Ven
  2007-03-02 15:29   ` Rik van Riel
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 104+ messages in thread
From: Arjan van de Ven @ 2007-03-02 13:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, torvalds, mbligh,
	linux-mm, linux-kernel

On Thu, 2007-03-01 at 16:09 -0800, Andrew Morton wrote:
> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).


Hi,

the RSS thing is.. .funky.
I'm saying that because we have not been able to define what RSS means,
so before we expand how RSS is used that needs solving first.

This is relevant for the pagetable sharing patches: if RSS can exclude
shared, they're relatively easy. If RSS has to include shared always, we
have currently a problem because hugepages aren't part of RSS right now.

I would really really really like to see this unclarity sorted out on
the concept level before going through massive changes in the code based
on something so fundamentally unclear.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
                     ` (3 preceding siblings ...)
  2007-03-02 13:50   ` Arjan van de Ven
@ 2007-03-02 15:29   ` Rik van Riel
  2007-03-02 16:58     ` Andrew Morton
  2007-03-02 16:32   ` Mel Gorman
       [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
  6 siblings, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

Andrew Morton wrote:

> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).

The RSS bits really worry me, since it looks like they could
exacerbate the scalability problems that we are already running
into on very large memory systems.

Linux is *not* happy on 256GB systems.  Even on some 32GB systems
the swappiness setting *needs* to be tweaked before Linux will even
run in a reasonable way.

Pageout scanning needs to be more efficient, not less.  The RSS
bits are worrysome...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  5:11           ` Linus Torvalds
  2007-03-02  5:50             ` KAMEZAWA Hiroyuki
@ 2007-03-02 16:20             ` Mark Gross
  2007-03-02 17:07               ` Andrew Morton
  2007-03-02 17:16               ` Linus Torvalds
  1 sibling, 2 replies; 104+ messages in thread
From: Mark Gross @ 2007-03-02 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Thu, Mar 01, 2007 at 09:11:58PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> >
> > On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > In other words, I really don't see a huge upside. I see *lots* of 
> > > downsides, but upsides? Not so much. Almost everybody who wants unplug 
> > > wants virtualization, and right now none of the "big virtualization" 
> > > people would want to have kernel-level anti-fragmentation anyway sicne 
> > > they'd need to do it on their own.
> > 
> > Agree with all that, but you're missing the other application: power
> > saving.  FBDIMMs take eight watts a pop.
> 
> This is a hardware problem. Let's see how long it takes for Intel to 
> realize that FBDIMM's were a hugely bad idea from a power perspective.
> 
> Yes, the same issues exist for other DRAM forms too, but to a *much* 
> smaller degree.

DDR3-1333 may be better than FBDIMM's but don't count on it being much
better.

> 
> Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> everybody wants bank interleaving, because it's a huge performance win on 
> many loads. That, in turn, means that your memory will be spread out over 
> multiple DIMM's even for a single page, much less any bigger area.

4-way interleave across banks on systems may not be as common as you may
think for future chip sets.  2-way interleave across DIMMs within a bank
will stay.

Also the performance gains between 2 and 4 way interleave have been
shown to be hard to measure.  It may be counter intuitive but its not
the huge performance win you may expect.  At least in some of the test
cases I've seen reported showed it to be under the noise floor of the
lmbench test cases.  


> 
> In other words - forget about DRAM power savings. It's not realistic. And 
> if you want low-power, don't use FBDIMM's. It really *is* that simple.
>

DDR3-1333 won't be much better.  

> (And yes, maybe FBDIMM controllers in a few years won't use 8 W per 
> buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish 
> voltage swings at high frequencies.)
> 
> Also, on a *truly* idle system, we'll see the power savings whatever we 
> do, because the working set will fit in D$, and to get those DRAM power 
> savings in reality you need to have the DRAM controller shut down on its 
> own anyway (ie sw would only help a bit).
> 
> The whole DRAM power story is a bedtime story for gullible children. Don't 
> fall for it. It's not realistic. The hardware support for it DOES NOT 
> EXIST today, and probably won't for several years. And the real fix is 
> elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> is against the whole point of FBDIMM in the first place, but that's what 
> you get when you ignore power in the first version!).
>

Hardware support for some of this is coming this year in the ATCA space
on the MPCBL0050.  The feature is a bit experimental, and
power/performance benefits will be workload and configuration
dependent.  Its not a bed time story.

--mgross

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 10:38   ` Mel Gorman
@ 2007-03-02 16:31     ` Joel Schopp
  2007-03-02 21:37       ` Bill Irwin
  0 siblings, 1 reply; 104+ messages in thread
From: Joel Schopp @ 2007-03-02 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Bill Irwin, akpm, npiggin, clameter, mingo, arjan, torvalds,
	mbligh, Linux Memory Management List, Linux Kernel Mailing List

>> Exhibiting a workload where the list patch breaks down and the zone
>> patch rescues it might help if it's felt that the combination isn't as
>> good as lists in isolation. I'm sure one can be dredged up somewhere.
> 
> I can't think of a workload that totally makes a mess out of list-based. 
> However, list-based makes no guarantees on availability. If a system 
> administrator knows they need between 10,000 and 100,000 huge pages and 
> doesn't want to waste memory pinning too many huge pages at boot-time, 
> the zone-based mechanism would be what he wanted.

 From our testing with earlier versions of list based for memory hot-unplug on 
pSeries machines we were able to hot-unplug huge amounts of memory after running the 
nastiest workloads we could find for over a week.  Without the patches we were unable 
to hot-unplug anything within minutes of running the same workloads.

If something works for 99.999% of people (list based) and there is an easy way to 
configure it for the other 0.001% of the people ("zone" based) I call that a great 
solution.  I really don't understand what the resistance is to these patches.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
                     ` (4 preceding siblings ...)
  2007-03-02 15:29   ` Rik van Riel
@ 2007-03-02 16:32   ` Mel Gorman
  2007-03-02 17:19     ` Christoph Lameter
  2007-03-03  4:54     ` KAMEZAWA Hiroyuki
       [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
  6 siblings, 2 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 16:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	kamezawa.hiroyu, linux-kernel


On (01/03/07 16:09), Andrew Morton didst pronounce:
> On Thu, 1 Mar 2007 10:12:50 +0000
> mel@skynet.ie (Mel Gorman) wrote:
> 
> > Any opinion on merging these patches into -mm
> > for wider testing?
> 
> I'm a little reluctant to make changes to -mm's core mm unless those
> changes are reasonably certain to be on track for mainline, so let's talk
> about that.
> 

Sounds reasonable.

> What worries me is memory hot-unplug and per-container RSS limits.  We
> don't know how we're going to do either of these yet, and it could well be
> that the anti-frag work significantly complexicates whatever we end up
> doing there.
> 

Ok. I am going to assume as well that all these issues are not mutually
exclusive. To start with, anti-fragmentation in now really two things -
anti-fragmentation and memory partitioning

o Memory partitioning creates an additional zone with hard limits on usage
o Anti-fragmentation groups free pages based on mobility in MAX_ORDER blocks

They both help different things in different ways so it is important not
to conflate them as being the same thing. I would like them both in because
they complement each other nicely from a hugepages perspective.

> For prioritisation purposes I'd judge that memory hot-unplug is of similar
> value to the antifrag work (because memory hot-unplug permits DIMM
> poweroff).
> 
> And I'd judge that per-container RSS limits are of considerably more value
> than antifrag (in fact per-container RSS might be a superset of antifrag,
> in the sense that per-container RSS and containers could be abused to fix
> the i-cant-get-any-hugepages problem, dunno).
> 

It would be serious abuse and would be too easy to trigger OOM-like conditions
because of the constraints containers must work under to be useful. I'll
come back to this briefly later.

> So some urgent questions are: how are we going to do mem hotunplug and
> per-container RSS?
> 

The zone-based patches for memory partitioning should be providing what is
required for memory hot-remove of an entire DIMM or bank of memory (PPC64
also cares about removing smaller blocks of memory but zones are overkill
there and anti-fragmentation on its own is good enough).  Pages hot-added
to ZONE_MOVABLE will always be reclaimable or migratable in the case of
mlock(). Kamezawa Hiroyuki has indicated that his hot-remove patches also
do something like ZONE_MOVABLE. I would hope that his patches could be
easily based on top of my memory partitioning set of patches. The markup
of pages has been tested and the zone definitely works. I've added the
kamezawa.hiroyu@jp.fujitsu.com to the cc list so he can comment :)

What I do not do in my patchset is hot-add to ZONE_MOVABLE because I couldn't
be certain it's what the hotplug people wanted. They will of course need to
hot-add to that zone if they want to be able to remove it later.

For node-based memory hot-add and hot-remove, the node would consist of just
one populated zone - ZONE_MOVABLE.

For the removal of DIMMs, anti-fragmentation has something additional
to offer. The later patches in the anti-fragmentation patchset bias the
placement of unmovable pages towards the lower PFNs. It's not very strict
about this because being strict would cost. A mechanism could be put in place
that enforced the placement of unmovables pages at low PFNS. Due to the cost,
it would need to be disabled by default and enabled on request. On the plus
side, the cost would only be incurred when splitting a MAX_ORDER block of
pages which is a rare event.

One of the reasons why anti-frag doesn't negatively impact kernbench figures
in the majority of cases is because it's actually rare it kicks in to do
anything. Once pages are on the appropriate free lists, everything progresses
as normal.

> Our basic unit of memory management is the zone.  Right now, a zone maps
> onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
> suspect that a good way to solve both per-container RSS and mem hotunplug
> is to split the zone concept away from its hardware limitations: create a
> "software zone" and a "hardware zone".

Ok, lets explore this notion a bit. I am thinking about this both in
terms of making RSS limits work properly and seeing if it collides with
anti-fragmentation or not.

Lets assume that a "hardware zone" is a management structure of pages that
have some addressing limitation. It might be 16MB for ISA devices, 4GB for
32 bit devices etc.

A "software zone" is a collection of pages belonging to a subsystem or
a process.  Containers are an example of a software zone.

That gives us a set of structures like

        Node
	 |
      --------
     /        \
hardware    hardware
zone          zone
              |   \--------------
              |    \             \
           main     container     container
   software zone   software zone  software zone

i.e. Each node has one hardware zone per "zone" today like ZONE_DMA, NORMAL
etc Each hardware zone consists of at least one software zone and an additional
software zone per container.

> All the existing page allocator and
> reclaim code remains basically unchanged, and it operates on "software
> zones". 

Ok. It means that counters and statistics will be per-software-zone instead
of per-hardware-zone. That means that most of the current struct zone would
be split in two.

However, to enforce the containers, software zones should not share pages. The
balancing problems would be considerable but worse, containers could interfere
with each other reducing their effectiveness. This is why anti-fragmentation
also should not be expressed as software zones. The case where the reclaimable
zone is too large and the movable zone is too compressed becomes difficult
to deal with in a sane manner.

> Each software zones always lies within a single hardware zone. 
> The software zones are resizeable.  For per-container RSS we give each
> container one (or perhaps multiple) resizeable software zones.
> 

Ok. That seems fair. I'll take a detour here and discuss how the number
of pages used by a container could be enforced. I'll then come back to
anti-fragmentation.

(reads through the RSS patches) The RSS patches have some notions vagely
related to software zones already.  I'm not convinced they enforce the number
of pages used by a container.

The hardware and software zones would look something like this (mainly taken
from the existing definition of struct zone). I've put big markers around
"new" stuff.

/* Hardware zones */
struct hardware_zone {

/* ==== NEW STUFF HERE ==== */
	/*
	 * Main software_zones. All allocations and pages by default
	 * come from this software_zone. Containers create their own
	 * and associate them with their mm_struct
	 *
	 * kswapd only operates within this software_zone. Containers
	 * are responsible for performing their own reclaim when they
	 * run out of pages
	 *
	 * If a new container is created and the required memory
	 * is not available in the hardware zone, it is taken from this
	 * software_zone. The pages taken from the software zone are
	 * contiguous if at all possible because otherwise it adds a new
	 * layer of fun to external fragmentation problems.
	 */
	struct software_zone 	*software_zone;
/* ==== END OF NEW STUFF ==== */

#ifdef CONFIG_NUMA
	int node;
#endif

	/*
	 * zone_start_pfn, spanned_pages and present_pages are all
	 * protected by span_seqlock.  It is a seqlock because it has
	 * to be read outside of zone->lock, and it is done in the main
	 * allocator path.  But, it is written quite infrequently.
	 *
	 * The lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 */
	unsigned long		spanned_pages;	/* total size, including holes */
	unsigned long		present_pages;	/* amount of memory (excluding holes) */

/* ==== MORE NEW STUFF ==== */
	/*
	 * These are free pages not associated with any software_zone
	 * yet. Software zones can request more pages if necessary
	 * but may not steal pages from each other. This is to 
	 * prevent containers from kicking each other
	 *
	 * Software zones should be given pages from here in as large
	 * as contiguous blocks as possible. This is to keep container
	 * memory grouped together as much as possible.
	 */
	spinlock_t		lock;
#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif
	struct free_area	free_pages[MAX_ORDER];
/* ==== END OF MORE NEW STUFF */

	/*
	 * rarely used fields:
	 */
	const char		*name;

};



/* Software zones */
struct software_zone {

/* ==== NEW STUFF ==== */
	/*
	 * Flags affecting how the software_zone behaves. These flags
	 * determine if the software_zone is allowed to request more
	 * pages from the hardware zone or not for example
	 *
	 * SZONE_CAN_TAKE_MORE: Takes more pages from hardware_zone
	 *			when low on memory instead of reclaiming.
	 *			The main software_zone has this, the
	 *			containers should not. They must reclaim
	 *			within the container or die
	 *
	 * SZONE_MOVABLE_ONLY: Only __GFP_MOVABLE pages are allocated
	 *			from here. If there are no pages left,
	 *			then the hardware zone is allocated
	 *			from if SZONE_CAN_TAKE_MORE is set.
	 *			This is very close to enforcing RSS
	 *			limits and a more sensible way of
	 *			doing things than the current RSS
	 *			patches
	 */
	unsigned long flags;

	/* The minimum size the software zone must be */
	unsigned long		min_nr_pages;

/* ==== END OF NEW STUFF ==== */

	/* Fields commonly accessed by the page allocator */
	unsigned long		pages_min, pages_low, pages_high;

	/*
	 * We don't know if the memory that we're going to allocate will be
	 * freeable or/and it will be released eventually, so to avoid totally
	 * wasting several GB of ram we must reserve some of the lower zone
	 * memory (otherwise we risk to run OOM on the lower zones despite
	 * there's tons of freeable ram on the higher zones). This array is
	 * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
	 * changes
	 */
	unsigned long		lowmem_reserve[MAX_NR_ZONES];

/* === LITTLE BIT OF NEW STUFF === */
	/* The parent hardware_zone */
	struct zone 		*hardware_zone;
/* === END AGAIN === */

#ifdef CONFIG_NUMA
	/*
	 * zone reclaim becomes active if more unmapped pages exist.
	 */
	unsigned long		min_unmapped_pages;
	unsigned long		min_slab_pages;
	struct per_cpu_pageset	*pageset[NR_CPUS];
#else
	struct per_cpu_pageset	pageset[NR_CPUS];
#endif
	/*
	 * free areas of different sizes
	 */
	spinlock_t		lock;
#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif
	struct free_area	free_area[MAX_ORDER];


	ZONE_PADDING(_pad1_)

	/* Fields commonly accessed by the page reclaim scanner */
	spinlock_t		lru_lock;	
	struct list_head	active_list;
	struct list_head	inactive_list;
	unsigned long		nr_scan_active;
	unsigned long		nr_scan_inactive;
	unsigned long		pages_scanned;	   /* since last reclaim */
	unsigned long		total_scanned;	   /* accumulated, may overflow */
	int			all_unreclaimable; /* All pages pinned */

	/* A count of how many reclaimers are scanning this zone */
	atomic_t		reclaim_in_progress;

	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];

	/*
	 * prev_priority holds the scanning priority for this zone.  It is
	 * defined as the scanning priority at which we achieved our reclaim
	 * target at the previous try_to_free_pages() or balance_pgdat()
	 * invokation.
	 *
	 * We use prev_priority as a measure of how much stress page reclaim is
	 * under - it drives the swappiness decision: whether to unmap mapped
	 * pages.
	 *
	 * Access to both this field is quite racy even on uniprocessor.  But
	 * it is expected to average out OK.
	 */
	int prev_priority;


	ZONE_PADDING(_pad2_)
	/* Rarely used or read-mostly fields */

	/*
	 * wait_table		-- the array holding the hash table
	 * wait_table_hash_nr_entries	-- the size of the hash table array
	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
	 *
	 * The purpose of all these is to keep track of the people
	 * waiting for a page to become available and make them
	 * runnable again when possible. The trouble is that this
	 * consumes a lot of space, especially when so few things
	 * wait on pages at a given time. So instead of using
	 * per-page waitqueues, we use a waitqueue hash table.
	 *
	 * The bucket discipline is to sleep on the same queue when
	 * colliding and wake all in that wait queue when removing.
	 * When something wakes, it must check to be sure its page is
	 * truly available, a la thundering herd. The cost of a
	 * collision is great, but given the expected load of the
	 * table, they should be so rare as to be outweighed by the
	 * benefits from the saved space.
	 *
	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
	 * primary users of these fields, and in mm/page_alloc.c
	 * free_area_init_core() performs the initialization of them.
	 */
	wait_queue_head_t	* wait_table;
	unsigned long		wait_table_hash_nr_entries;
	unsigned long		wait_table_bits;

	/*
	 * Discontig memory support fields.
	 */
	struct pglist_data	*zone_pgdat;
	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

} ____cacheline_internodealigned_in_smp;



Hardware zones always have one software zone and it's all the the majority
of systems will have. Todays current behaviour is preserved and it's only
when containers come into play that things change.

A container could create a new software_zone with kmalloc (or a slab cache)
and associate it with their container management structure. They then set
the software zones min_nr_pages so that it gets populated with pages from
the hardware zone. If the hardware zone is depleted, it will take memory from
the main software zone in large contiguous chunks if possible. If anti-frag
is in place, the contiguous chunks will be available.

On allocation, buffered_rmqueue() checks for the presence of a software_zone
for the current mm and if it exists, use it. Otherwise zone->software_zone
is used to satisfy the allocation. This allows containers to use the core
page allocation code without altering it in a very significant way (main
alteration is to use software_zone instead of zone). 

As software zones are used on allocation, RSS limits are enforced without
worrying about the definition of RSS because it's all about availability
of pages. It would be able to do this with the existing infrastructure, not
additional bean counting. There is confusion about the exact definition of RSS
that was never really pinned down from what I can see. My understanding also
is that we don't care about individual processes as such, or even RSS but we
are concerned with the number of pages all the processes within a container
are using.  With software zones, hard limitations on the number of usable
pages can be enforced and all the zone-related infrastructure is there to help.

If shared page accounting is a concern, then what is needed are pairs of
software zones. A container has a private page container that enforces
allocations of private pages. They then use a second software_zone for
shared pages.  Multiple containers may share a software zone for shared
pages so that one container is not unfairly penalised for being the first
container to mmap() MAP_SHARED for example.

If a container runs low on memory, it reclaims pages within the LRU lists
of the software zones it has access to. If it fails to reclaim, it has a
mini-OOM and the OOM killer would need to be sure to target processes in the
container, not elsewhere. I'm not dealing with slab-reclaim here because it
is a issue deserving a discussion all to itself.

All this avoids creating unbounded number of zones that can fallback to each
other. While there may be large numbers of zones, there is no additional
search complexity because they are not related to each other except that
they belong to the same hardware zone.

On free, the software_zone the page belongs to needs to be discovered. There
are a few options that should be no suprise to anyone.

1. Use page bits to store the ID of the software zone. Only doable on 64 bit
2. Use an additional field in struct page for 32 bit (yuck)
3. If anti-frag is in place, only populate software zones with MAX_ORDER sized
   blocks. Use the pageblock bits (patch 1 in anti-frag) to map blocks of pages
   with software zones. This *might* limit the creation of new containers at
   runtime because of external fragmentation.
4. Have kswapd and asynchronous reclaim always go to the main software
   zone. Containers will always be doing direct reclaim because otherwise
   they can interfere with other containers. On free, they know the software
   zone to free to. If IO must be performed, they block waiting for the page
   to free and then free to the correct software zone

Of these, 4 seems like a reasonable choice. Containers that are
under-provisioned will feel the penalty directly by having to wait to perform
their own IO. They will not be able to steal pages from other containers in
any fashion so thje damage will be contained (pun not intended).

I think this covers the RSS-limitation requirements in the context of
software zones.


Now... how does this interact adversely with anti-fragmentation - well, it
doesn't really.  Anti-fragmentation will be mainly concerned with grouping
pages within the main software_zone. It will also group within the container
software zones although the containers probably do not care. The crossover is
when the hardware zone is taking pages for a new container, it will interact
with anti-fragmentation to get large contiguous regions if possible.  We will
not be creating a software zone for each migrate type because sizing those
and balancing them becomes a serious issue and an unnecessary one.

Anti-fragmentations work is really at the __rmqueue() level, not the higher
levels. The higher levels would be concerned with selecting the right software
zone to use and that is where containers should be working because they are
concerned about reclaim as well as page availability.

So. Even if the software_zone concept was developed and anti-fragmentation
was already in place, I do not believe they would collide in a
horrible-to-deal-with fashion.

> For memory hotunplug, some of the hardware zone's software zones are marked
> reclaimable and some are not; DIMMs which are wholly within reclaimable
> zones can be depopulated and powered off or removed.
> 
> NUMA and cpusets screwed up: they've gone and used nodes as their basic
> unit of memory management whereas they should have used zones.  This will
> need to be untangled.
> 

I guess nodes could become hardware zones because they contain essentially
the same information. In that case, what would happen then is that a number
of "main" software zones would be created at boot time for each of the
hardware-based limitations on PFN ranges. buffered_rmqueue would be responsible
for selecting the right software zone to use based on the __GFP flags.

It's not massively different to what we currently though except it plays
nicely with containers.

> Anyway, that's just a shot in the dark.  Could be that we implement unplug
> and RSS control by totally different means.

The unplug can use ZONE_MOVABLE as it stands today. RSS control can be done
as software zones as described above but it doesn't affect anti-fragmentation.

> But I do wish that we'd sort
> out what those means will be before we potentially complicate the story a
> lot by adding antifragmentation.

I believe the difficulty of the problem remains the same whether
anti-fragmentation is entered into the equation or not. Hence, I'd like
to see both anti-fragmentation and the zone-based work go in because they
should not interfere with the RSS work and they help huge pages a lot by
allowing the pool to be resized.

However, if that is objectionable, I'd at least like to see zone-based patches
go into -mm on the expectation that the memory hot-remove patches will be
able to use the infrastructure. It's not ideal for hugepages and it is not my
first preference, but it's a step in the right direction. Is this reasonable?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 15:29   ` Rik van Riel
@ 2007-03-02 16:58     ` Andrew Morton
  2007-03-02 17:09       ` Mel Gorman
  2007-03-02 17:23       ` Christoph Lameter
  0 siblings, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 16:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 10:29:58 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > And I'd judge that per-container RSS limits are of considerably more value
> > than antifrag (in fact per-container RSS might be a superset of antifrag,
> > in the sense that per-container RSS and containers could be abused to fix
> > the i-cant-get-any-hugepages problem, dunno).
> 
> The RSS bits really worry me, since it looks like they could
> exacerbate the scalability problems that we are already running
> into on very large memory systems.

Using a zone-per-container or N-64MB-zones-per-container should actually
move us in the direction of *fixing* any such problems.  Because, to a
first-order, the scanning of such a zone has the same behaviour as a 64MB
machine.

(We'd run into a few other problems, some related to the globalness of the
dirty-memory management, but that's fixable).

> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> the swappiness setting *needs* to be tweaked before Linux will even
> run in a reasonable way.

Please send testcases.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
  2007-03-02  1:52     ` Balbir Singh
@ 2007-03-02 16:58     ` Mel Gorman
  2007-03-02 17:05     ` Joel Schopp
  2 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 16:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, npiggin, clameter, mingo, jschopp, arjan, mbligh,
	linux-mm, linux-kernel

On (01/03/07 16:44), Linus Torvalds didst pronounce:
> 
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
> > 
> > So some urgent questions are: how are we going to do mem hotunplug and
> > per-container RSS?
> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.
> 

I sent a mail out with a fairly detailed treatment of how RSS could be done.
Essentially, I feel that containers should simply limit the number of
pages used by the container, and not try and do anything magic with a
poorly defined concept like RSS. It would do this by creating a
"software zone" and taking pages from a "hardware zone" at creation
time. It has a similar affect to RSS limits except it's better defined.

In that setup, a virtualized environment would create it's own software
zone. It would hand that over to the guest OS and the guest OS could do
whatever it liked. It would be responsible for it's own reclaim and so on
and not have to worry about other containers (or virtualized environments
for that matter) or kswapd interfering with it.

> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.
> 
> 		Linus

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  6:15               ` Paul Mundt
@ 2007-03-02 17:01                 ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 17:01 UTC (permalink / raw)
  To: Paul Mundt, KAMEZAWA Hiroyuki, Linus Torvalds, akpm, balbir,
	npiggin, clameter, mingo, jschopp, arjan, mbligh, linux-mm,
	linux-kernel

On (02/03/07 15:15), Paul Mundt didst pronounce:
> On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 1 Mar 2007 21:11:58 -0800 (PST)
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > > EXIST today, and probably won't for several years. And the real fix is 
> > > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > > is against the whole point of FBDIMM in the first place, but that's what 
> > > you get when you ignore power in the first version!).
> > > 
> > 
> > Note:
> > I heard embeded people often designs their own memory-power-off control on
> > embeded Linux. (but it never seems to be posted to the list.) But I don't know
> > they are interested in generic memory hotremove or not.
> > 
> Yes, this is not that uncommon of a thing. People tend to do this in a
> couple of different ways, in some cases the system is too loaded to ever
> make doing such a thing at run-time worthwhile, and in those cases these
> sorts of things tend to be munged in with the suspend code. Unfortunately
> it tends to be quite difficult in practice to keep pages in one place,
> so people rely on lame chip-select hacks and limiting the amount of
> memory that the kernel treats as RAM instead so it never ends up being an
> issue. Having some sort of a balance would certainly be nice, though.

If the range of memory you want to offline is MAX_ORDER_NR_PAGES,
anti-fragmentation should group pages you can reclaim into those size of
chunks. It might simplify the number of hacks you have to perform to
limit where the kernel uses memory.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
       [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
  2007-03-02  1:52     ` Balbir Singh
  2007-03-02 16:58     ` Mel Gorman
@ 2007-03-02 17:05     ` Joel Schopp
  2007-03-05  3:21       ` Nick Piggin
  2 siblings, 1 reply; 104+ messages in thread
From: Joel Schopp @ 2007-03-02 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, npiggin, clameter, mingo, arjan,
	mbligh, linux-mm, linux-kernel

Linus Torvalds wrote:
> 
> On Thu, 1 Mar 2007, Andrew Morton wrote:
>> So some urgent questions are: how are we going to do mem hotunplug and
>> per-container RSS?

The people who were trying to do memory hot-unplug basically all stopped waiting for 
these patches, or something similar, to solve the fragmentation problem.  Our last 
working set of patches built on top of an earlier version of Mel's list based solution.

> 
> Also: how are we going to do this in virtualized environments? Usually the 
> people who care abotu memory hotunplug are exactly the same people who 
> also care (or claim to care, or _will_ care) about virtualization.

Yes, we are.  And we are very much in favor of these patches.  At last year's OLS 
developers from IBM, HP, Xen coauthored a paper titled "Resizing Memory with Balloons 
and Hotplug".  http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf  Our 
conclusion was that ballooning is simply not good enough and we need memory 
hot-unplug.  Here is a quote from the article I find relevant to today's discussion:

"Memory Hotplug remove is not in mainline.
Patches exist, released under the GPL, but are
only occasionally rebased. To be worthwhile
the existing patches would need either a remappable
kernel, which remains highly doubtful, or
a fragmentation avoidance strategy to keep migrateable
and non-migrateable pages clumped
together nicely."

At IBM all of our Power4, Power5, and future hardware supports a lot of 
virtualization features.  This hardware took "Best Virtualization Solution" at 
LinuxWorld Expo, so we aren't talking research projects here. 
http://www-03.ibm.com/press/us/en/pressrelease/20138.wss

> My personal opinion is that while I'm not a huge fan of virtualization, 
> these kinds of things really _can_ be handled more cleanly at that layer, 
> and not in the kernel at all. Afaik, it's what IBM already does, and has 
> been doing for a while. There's no shame in looking at what already works, 
> especially if it's simpler.

I believe you are talking about the zSeries (aka mainframe) because the rest of IBM 
needs these patches.  zSeries built their whole processor instruction set, memory 
model, etc around their form of virtualization, and I doubt the rest of us are going 
to change our processor instruction set that drastically.  I've had a lot of talks 
with Martin Schwidefsky (the maintainer of Linux on zSeries) about how we could do 
more of what they do and the basic answer is we can't because what they do is so 
fundamentally incompatible.

While I appreciate that we should all dump our current hardware and buy mainframes it 
seems to me that an easier solution is to take a few patches from Mel and work with 
the hardware we already have.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:20             ` Mark Gross
@ 2007-03-02 17:07               ` Andrew Morton
  2007-03-02 17:35                 ` Mark Gross
  2007-03-02 17:16               ` Linus Torvalds
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 17:07 UTC (permalink / raw)
  To: mgross
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 08:20:23 -0800 Mark Gross <mgross@linux.intel.com> wrote:

> > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > EXIST today, and probably won't for several years. And the real fix is 
> > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > is against the whole point of FBDIMM in the first place, but that's what 
> > you get when you ignore power in the first version!).
> >
> 
> Hardware support for some of this is coming this year in the ATCA space
> on the MPCBL0050.  The feature is a bit experimental, and
> power/performance benefits will be workload and configuration
> dependent.  Its not a bed time story.

What is the plan for software support?

Will it be possible to just power the DIMMs off?  I don't see much point in
some half-power non-destructive mode.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:38                                     ` Nick Piggin
@ 2007-03-02 17:09                                       ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Nick Piggin wrote:

> > Oh just run a 32GB SMP system with sparsely freeable pages and lots of 
> > allocs and frees and you will see it too. F.e try Linus tree and mlock 
> > a large portion of the memory and then see the fun starting. See also 
> > Rik's list of pathological cases on this.
> 
> Ah, so your problem is lots of unreclaimable pages. There are heaps
> of things we can try to reduce the rate at which we scan those.

Well this is one possible sympton of the basic issue of having too many 
page structs. I wonder how long we can patch things up.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:58     ` Andrew Morton
@ 2007-03-02 17:09       ` Mel Gorman
  2007-03-02 17:23       ` Christoph Lameter
  1 sibling, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 17:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, npiggin, clameter, mingo, jschopp, arjan, torvalds,
	mbligh, linux-mm, linux-kernel

On (02/03/07 08:58), Andrew Morton didst pronounce:
> On Fri, 02 Mar 2007 10:29:58 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
> > Andrew Morton wrote:
> > 
> > > And I'd judge that per-container RSS limits are of considerably more value
> > > than antifrag (in fact per-container RSS might be a superset of antifrag,
> > > in the sense that per-container RSS and containers could be abused to fix
> > > the i-cant-get-any-hugepages problem, dunno).
> > 
> > The RSS bits really worry me, since it looks like they could
> > exacerbate the scalability problems that we are already running
> > into on very large memory systems.
> 
> Using a zone-per-container or N-64MB-zones-per-container should actually
> move us in the direction of *fixing* any such problems.  Because, to a
> first-order, the scanning of such a zone has the same behaviour as a 64MB
> machine.
> 

Quite possibly. Taking software zones from the other large mail I sent,
one could get the 64MB effect by increasing MAX_ORDER_NR_PAGES to be 64MB
in pages. To avoid external fragmentation issues, I'd prefer of course
if these container zones consisted of mainly contiguous memory but with
anti-fragmentation, that would be possible.

> (We'd run into a few other problems, some related to the globalness of the
> dirty-memory management, but that's fixable).
> 

It would be fixable, especially if containers do their own reclaim on their
container zones and not kswapd. Writing dirty data back periodically would
still need to be global in nature but that's no different to today.

> > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > the swappiness setting *needs* to be tweaked before Linux will even
> > run in a reasonable way.
> 
> Please send testcases.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:20             ` Mark Gross
  2007-03-02 17:07               ` Andrew Morton
@ 2007-03-02 17:16               ` Linus Torvalds
  2007-03-02 18:45                 ` Mark Gross
  2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 2 replies; 104+ messages in thread
From: Linus Torvalds @ 2007-03-02 17:16 UTC (permalink / raw)
  To: Mark Gross
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel



On Fri, 2 Mar 2007, Mark Gross wrote:
> > 
> > Yes, the same issues exist for other DRAM forms too, but to a *much* 
> > smaller degree.
> 
> DDR3-1333 may be better than FBDIMM's but don't count on it being much
> better.

Hey, fair enough. But it's not a problem (and it doesn't have a solution) 
today. I'm not sure it's going to have a solution tomorrow either.

> > Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> > everybody wants bank interleaving, because it's a huge performance win on 
> > many loads. That, in turn, means that your memory will be spread out over 
> > multiple DIMM's even for a single page, much less any bigger area.
> 
> 4-way interleave across banks on systems may not be as common as you may
> think for future chip sets.  2-way interleave across DIMMs within a bank
> will stay.

.. and think about a realistic future.

EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
today, but in the one- to two-year timeframe even Intel will.

What does that mean? It means that in bigger systems, you will no longer 
even *have* 8 or 16 banks where turning off a few banks makes sense. 
You'll quite often have just a few DIMM's per die, because that's what you 
want for latency. Then you'll have CSI or HT or another interconnect.

And with a few DIMM's per die, you're back where even just 2-way 
interleaving basically means that in order to turn off your DIMM, you 
probably need to remove HALF the memory for that CPU.

In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
CHILDREN.

There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
nobody actually does it in practice, and nobody even knows if it's going 
to be viable (yes, DRAM takes energy, but trying to keep memory free will 
likely waste power *too*, and I doubt anybody has any real idea of how 
much any of this would actually help in practice).

And I don't think that will change. See above. The future is *not* moving 
towards more and more DIMMS. Quite the reverse. On workstations, we are 
currently in the "one or two DIMM's per die". Do you really think that 
will change? Hell no. And in big servers, pretty much everybody agrees 
that we will move towards that, rather than away from it.

So:
 - forget about turning DIMM's off. There is *no* actual data supporting 
   the notion that it's a good idea today, and I seriously doubt you can 
   really argue that it will be a good idea in five or ten years. It's a 
   hardware hack for a hardware problem, and the problems are way too 
   complex for us to solve in time for the solution to be relevant.

 - aim for NUMA memory allocation and turning off whole *nodes*. That's 
   much more likely to be productive in the longer timeframe. And yes, we 
   may well want to do memory compaction for that too, but I suspect that 
   the issues are going to be different (ie the way to do it is to simply 
   prefer certain nodes for certain allocations, and then try to keep the 
   jobs that you know can be idle on other nodes)

Do you actually have real data supporting the notion that turning DIMM's 
off will be reasonable and worthwhile? 

			Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:32   ` Mel Gorman
@ 2007-03-02 17:19     ` Christoph Lameter
  2007-03-02 17:28       ` Mel Gorman
  2007-03-03  4:54     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	kamezawa.hiroyu, linux-kernel

On Fri, 2 Mar 2007, Mel Gorman wrote:

> However, if that is objectionable, I'd at least like to see zone-based patches
> go into -mm on the expectation that the memory hot-remove patches will be
> able to use the infrastructure. It's not ideal for hugepages and it is not my
> first preference, but it's a step in the right direction. Is this reasonable?

I still think that the list based approach is sufficient for memory 
hotplug if one restricts  the location of the unmovable MAX_ORDER chunks 
to not overlap the memory area where we would like to be able to remove 
memory. In very pressing memory situations where we have too much 
unmovable memory we could dynamically disable  memory hotplug. There 
would be no need for this partitioning and additional zones.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:58     ` Andrew Morton
  2007-03-02 17:09       ` Mel Gorman
@ 2007-03-02 17:23       ` Christoph Lameter
  2007-03-02 17:35         ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Andrew Morton wrote:

> > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > the swappiness setting *needs* to be tweaked before Linux will even
> > run in a reasonable way.
> 
> Please send testcases.

It is not happy if you put 256GB into one zone. We are fine with 1k nodes 
with 8GB each and a 16k page size (which reduces the number of 
page_structs to manage by a fourth). So the total memory is 8TB which is 
significantly larger than 256GB.

If we do this node/zone merging and reassign MAX_ORDER blocks to virtual 
node/zones for containers (with their own LRU etc) then this would also 
reduce the number of page_structs on the list and may make things a bit 
easier.

We would then produce the same effect as the partitioning via NUMA nodes 
on our 8TB boxes. However, then you still have a bandwidth issue since 
your 256 likely only has a single bus and all memory traffic for the 
node/zones has to go through this single bottleneck. That bottleneck does 
not exist on NUMA machines.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:19     ` Christoph Lameter
@ 2007-03-02 17:28       ` Mel Gorman
  2007-03-02 17:48         ` Christoph Lameter
  0 siblings, 1 reply; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	kamezawa.hiroyu, linux-kernel

On (02/03/07 09:19), Christoph Lameter didst pronounce:
> On Fri, 2 Mar 2007, Mel Gorman wrote:
> 
> > However, if that is objectionable, I'd at least like to see zone-based patches
> > go into -mm on the expectation that the memory hot-remove patches will be
> > able to use the infrastructure. It's not ideal for hugepages and it is not my
> > first preference, but it's a step in the right direction. Is this reasonable?
> 
> I still think that the list based approach is sufficient for memory 
> hotplug if one restricts  the location of the unmovable MAX_ORDER chunks 
> to not overlap the memory area where we would like to be able to remove 
> memory.

Yes, true. In the part where I bias placements of unmovable pages at
lower PFNs, additional steps would need to be taken. Specifically, the
lowest block MAX_ORDER_NR_PAGES used for movable pages would need to be
reclaimed for unmovable allocations.

> In very pressing memory situations where we have too much 
> unmovable memory we could dynamically disable  memory hotplug. There 
> would be no need for this partitioning and additional zones.
> 

It's simply more complex. I believe it's doable. The main plus going for
the zone is that it is a clearly understood concept and it gives hard
guarantees.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:23       ` Christoph Lameter
@ 2007-03-02 17:35         ` Andrew Morton
  2007-03-02 17:43           ` Rik van Riel
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
> > > Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> > > the swappiness setting *needs* to be tweaked before Linux will even
> > > run in a reasonable way.
> > 
> > Please send testcases.
> 
> It is not happy if you put 256GB into one zone.

Oh come on.  What's the workload?  What happens?  system time?  user time?
kernel profiles?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:07               ` Andrew Morton
@ 2007-03-02 17:35                 ` Mark Gross
  2007-03-02 18:02                   ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Mark Gross @ 2007-03-02 17:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 09:07:53AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 08:20:23 -0800 Mark Gross <mgross@linux.intel.com> wrote:
> 
> > > The whole DRAM power story is a bedtime story for gullible children. Don't 
> > > fall for it. It's not realistic. The hardware support for it DOES NOT 
> > > EXIST today, and probably won't for several years. And the real fix is 
> > > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which 
> > > is against the whole point of FBDIMM in the first place, but that's what 
> > > you get when you ignore power in the first version!).
> > >
> > 
> > Hardware support for some of this is coming this year in the ATCA space
> > on the MPCBL0050.  The feature is a bit experimental, and
> > power/performance benefits will be workload and configuration
> > dependent.  Its not a bed time story.
> 
> What is the plan for software support?

The plan is the typical layered approach to enabling.  Post the basic
enabling patch, followed by a patch or software to actually exercise the
feature.

The code to exercise the feature is complicated by the fact that the
memory will need re-training as it comes out of low power state.  The
code doing this is still a bit confidential.

I have the base enabling patch ready for RFC review.
I'm working on the RFC now.

> 
> Will it be possible to just power the DIMMs off?  I don't see much point in
> some half-power non-destructive mode.

I think so, but need to double check with the HW folks.

Technically, the dims could be powered off, and put into 2 different low
power non-destructive states.  (standby and suspend), but putting them
in a low power non-destructive mode has much less latency and provides
good bang for the buck or LOC change needed to make work.

Which lower power mode an application chooses will depend on latency
tolerances of the app.  For the POC activities we are looking at we are
targeting the lower latency option, but that doesn't lock out folks from
trying to do something with the other options.

--mgross


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:35         ` Andrew Morton
@ 2007-03-02 17:43           ` Rik van Riel
  2007-03-02 18:06             ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 17:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:
> 
>> On Fri, 2 Mar 2007, Andrew Morton wrote:
>>
>>>> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
>>>> the swappiness setting *needs* to be tweaked before Linux will even
>>>> run in a reasonable way.
>>> Please send testcases.
>> It is not happy if you put 256GB into one zone.
> 
> Oh come on.  What's the workload?  What happens?  system time?  user time?
> kernel profiles?

I can't share all the details, since a lot of the problems are customer
workloads.

One particular case is a 32GB system with a database that takes most
of memory.  The amount of actually freeable page cache memory is in
the hundreds of MB.   With swappiness at the default level of 60, kswapd
ends up eating most of a CPU, and other tasks also dive into the pageout
code.  Even with swappiness as high as 98, that system still has
problems with the CPU use in the pageout code!

Another typical problem is that people want to back up their database
servers.  During the backup, parts of the working set get evicted from
the VM and performance is horrible.

A third scenario is where a system has way more RAM than swap, and not
a whole lot of freeable page cache.  In this case, the VM ends up
spending WAY too much CPU time scanning and shuffling around essentially
unswappable anonymous memory and tmpfs files.

I have briefly characterized some of these working sets on:

http://linux-mm.org/ProblemWorkloads

One thing I do not yet have are easily runnable test cases.  I know
the problems that happen because customers run into them, but it is
not as easy to reproduce on test systems...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:28       ` Mel Gorman
@ 2007-03-02 17:48         ` Christoph Lameter
  2007-03-02 17:59           ` Mel Gorman
  0 siblings, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 17:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	kamezawa.hiroyu, linux-kernel

On Fri, 2 Mar 2007, Mel Gorman wrote:

> > I still think that the list based approach is sufficient for memory 
> > hotplug if one restricts  the location of the unmovable MAX_ORDER chunks 
> > to not overlap the memory area where we would like to be able to remove 
> > memory.
> 
> Yes, true. In the part where I bias placements of unmovable pages at
> lower PFNs, additional steps would need to be taken. Specifically, the
> lowest block MAX_ORDER_NR_PAGES used for movable pages would need to be
> reclaimed for unmovable allocations.

I think sparsemem can provide some memory maps that show where there are 
section of memory that are hot pluggable. So the MAX_ORDER blocks need
to be categorized as to whether they are in such a section or not. If you 
need another MAX_ORDER block for an unmovable type of allocation then make 
sure that it is not marked as hotpluggable by sparsemem. If we are in an 
emergency situation were we must use a MAX_ORDER block that is currently 
hotpluggable for unmovable allocations then we need to trigger something 
in sparsmem that disabled hotplug for that memory section.

> It's simply more complex. I believe it's doable. The main plus going for
> the zone is that it is a clearly understood concept and it gives hard
> guarantees.

And it gives the sysadmin headaches and increases management VM management 
overhead because we now have more bits in the page struct that tell us 
about the zone that the page belongs to. Another distinction to worry 
about in the VM. If the limit is set too high then we have memory that is 
actually movable but since its on the wrong side of the limit we cannot 
use it. If the limit is set too low then the systewm will crash.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:48         ` Christoph Lameter
@ 2007-03-02 17:59           ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2007-03-02 17:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, npiggin, mingo, Joel Schopp, arjan,
	torvalds, mbligh, kamezawa.hiroyu, Linux Kernel Mailing List

On Fri, 2 Mar 2007, Christoph Lameter wrote:

> On Fri, 2 Mar 2007, Mel Gorman wrote:
>
>>> I still think that the list based approach is sufficient for memory
>>> hotplug if one restricts  the location of the unmovable MAX_ORDER chunks
>>> to not overlap the memory area where we would like to be able to remove
>>> memory.
>>
>> Yes, true. In the part where I bias placements of unmovable pages at
>> lower PFNs, additional steps would need to be taken. Specifically, the
>> lowest block MAX_ORDER_NR_PAGES used for movable pages would need to be
>> reclaimed for unmovable allocations.
>
> I think sparsemem can provide some memory maps that show where there are
> section of memory that are hot pluggable. So the MAX_ORDER blocks need
> to be categorized as to whether they are in such a section or not.

That makes the problem slightly easier. If sparsemem sections are aware of 
whether they are hotpluggable or not, __rmqueue_fallback() (from the 
list-based anti-frag patches) can be taught to never use those sections 
for unmovable allocations.

> If you
> need another MAX_ORDER block for an unmovable type of allocation then make
> sure that it is not marked as hotpluggable by sparsemem. If we are in an
> emergency situation were we must use a MAX_ORDER block that is currently
> hotpluggable for unmovable allocations then we need to trigger something
> in sparsmem that disabled hotplug for that memory section.
>

Which should be doable.

>> It's simply more complex. I believe it's doable. The main plus going for
>> the zone is that it is a clearly understood concept and it gives hard
>> guarantees.
>
> And it gives the sysadmin headaches and increases management VM management
> overhead because we now have more bits in the page struct that tell us
> about the zone that the page belongs to. Another distinction to worry
> about in the VM. If the limit is set too high then we have memory that is
> actually movable but since its on the wrong side of the limit we cannot
> use it. If the limit is set too low then the systewm will crash.
>

I'm aware of this. It believe it could all be done in the context of 
list-based - just that it requires more code. Zones are easier to 
understand for most people and their behavior is better understood. If a 
workload is discovered that list-based doesn't handle, the zone can be 
used until the problem is solved.

This is why the anti-fragmentation and zone-based approaches are no longer 
mutually exclusive as they were in earlier versions.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:35                 ` Mark Gross
@ 2007-03-02 18:02                   ` Andrew Morton
  2007-03-02 19:02                     ` Mark Gross
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 18:02 UTC (permalink / raw)
  To: mgross
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 09:35:27 -0800
Mark Gross <mgross@linux.intel.com> wrote:

> > 
> > Will it be possible to just power the DIMMs off?  I don't see much point in
> > some half-power non-destructive mode.
> 
> I think so, but need to double check with the HW folks.
> 
> Technically, the dims could be powered off, and put into 2 different low
> power non-destructive states.  (standby and suspend), but putting them
> in a low power non-destructive mode has much less latency and provides
> good bang for the buck or LOC change needed to make work.
> 
> Which lower power mode an application chooses will depend on latency
> tolerances of the app.  For the POC activities we are looking at we are
> targeting the lower latency option, but that doesn't lock out folks from
> trying to do something with the other options.
> 

If we don't evacuate all live data from all of the DIMM, we'll never be
able to power the thing down in many situations.

Given that we _have_ emptied the DIMM, we can just turn it off.  And
refilling it will be slow - often just disk speed.

So I don't see a useful use-case for non-destructive states.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:43           ` Rik van Riel
@ 2007-03-02 18:06             ` Andrew Morton
  2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 20:59               ` Bill Irwin
  0 siblings, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 18:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 12:43:42 -0500
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Fri, 2 Mar 2007 09:23:49 -0800 (PST) Christoph Lameter <clameter@engr.sgi.com> wrote:
> > 
> >> On Fri, 2 Mar 2007, Andrew Morton wrote:
> >>
> >>>> Linux is *not* happy on 256GB systems.  Even on some 32GB systems
> >>>> the swappiness setting *needs* to be tweaked before Linux will even
> >>>> run in a reasonable way.
> >>> Please send testcases.
> >> It is not happy if you put 256GB into one zone.
> > 
> > Oh come on.  What's the workload?  What happens?  system time?  user time?
> > kernel profiles?
> 
> I can't share all the details, since a lot of the problems are customer
> workloads.
> 
> One particular case is a 32GB system with a database that takes most
> of memory.  The amount of actually freeable page cache memory is in
> the hundreds of MB.

Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

>   With swappiness at the default level of 60, kswapd
> ends up eating most of a CPU, and other tasks also dive into the pageout
> code.  Even with swappiness as high as 98, that system still has
> problems with the CPU use in the pageout code!
> 
> Another typical problem is that people want to back up their database
> servers.  During the backup, parts of the working set get evicted from
> the VM and performance is horrible.

userspace fixes for this are far, far better than any magic goo the kernel
can implement.  We really need to get off our butts and start educating
people.

> A third scenario is where a system has way more RAM than swap, and not
> a whole lot of freeable page cache.  In this case, the VM ends up
> spending WAY too much CPU time scanning and shuffling around essentially
> unswappable anonymous memory and tmpfs files.

Well we've allegedly fixed that, but it isn't going anywhere without
testing.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:06             ` Andrew Morton
@ 2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 18:23                 ` Andrew Morton
  2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 20:59               ` Bill Irwin
  1 sibling, 2 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 18:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Andrew Morton wrote:

> > One particular case is a 32GB system with a database that takes most
> > of memory.  The amount of actually freeable page cache memory is in
> > the hundreds of MB.
> 
> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

The memory is likely in use but there is enough memory free in unmapped 
clean pagecache pages so that we occasionally are able to free pages. Then 
the app is reading more from disk replenishing that ...
Thus we are forever cycling through the LRU lists moving pages between 
the lists aging etc etc. Can lead to a livelock.

> > A third scenario is where a system has way more RAM than swap, and not
> > a whole lot of freeable page cache.  In this case, the VM ends up
> > spending WAY too much CPU time scanning and shuffling around essentially
> > unswappable anonymous memory and tmpfs files.
> 
> Well we've allegedly fixed that, but it isn't going anywhere without
> testing.

We have fixed the case in which we compile the kernel without swap. Then 
anonymous pages behave like mlocked pages. Did we do more than that?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:15               ` Christoph Lameter
@ 2007-03-02 18:23                 ` Andrew Morton
  2007-03-02 18:23                 ` Rik van Riel
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 10:15:36 -0800 (PST)
Christoph Lameter <clameter@engr.sgi.com> wrote:

> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
> > > One particular case is a 32GB system with a database that takes most
> > > of memory.  The amount of actually freeable page cache memory is in
> > > the hundreds of MB.
> > 
> > Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?
> 
> The memory is likely in use but there is enough memory free in unmapped 
> clean pagecache pages so that we occasionally are able to free pages. Then 
> the app is reading more from disk replenishing that ...
> Thus we are forever cycling through the LRU lists moving pages between 
> the lists aging etc etc. Can lead to a livelock.

Guys, with this level of detail thses problems will never be fixed.

> > > A third scenario is where a system has way more RAM than swap, and not
> > > a whole lot of freeable page cache.  In this case, the VM ends up
> > > spending WAY too much CPU time scanning and shuffling around essentially
> > > unswappable anonymous memory and tmpfs files.
> > 
> > Well we've allegedly fixed that, but it isn't going anywhere without
> > testing.
> 
> We have fixed the case in which we compile the kernel without swap. Then 
> anonymous pages behave like mlocked pages. Did we do more than that?

oh yeah, we took the ran-out-of-swapcache code out.  But if we're going to
do this thing, we should find some way to bring it back.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:15               ` Christoph Lameter
  2007-03-02 18:23                 ` Andrew Morton
@ 2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 19:31                   ` Christoph Lameter
  2007-03-02 21:12                   ` Bill Irwin
  1 sibling, 2 replies; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Andrew Morton wrote:
> 
>>> One particular case is a 32GB system with a database that takes most
>>> of memory.  The amount of actually freeable page cache memory is in
>>> the hundreds of MB.
>> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?
> 
> The memory is likely in use but there is enough memory free in unmapped 
> clean pagecache pages so that we occasionally are able to free pages. Then 
> the app is reading more from disk replenishing that ...
> Thus we are forever cycling through the LRU lists moving pages between 
> the lists aging etc etc. Can lead to a livelock.

In this particular case, the system even has swap free.

The kernel just chooses not to use it until it has scanned
some memory, due to the way the swappiness algorithm works.

With 32 CPUs diving into the page reclaim simultaneously,
each trying to scan a fraction of memory, this is disastrous
for performance.  A 256GB system should be even worse.

>>> A third scenario is where a system has way more RAM than swap, and not
>>> a whole lot of freeable page cache.  In this case, the VM ends up
>>> spending WAY too much CPU time scanning and shuffling around essentially
>>> unswappable anonymous memory and tmpfs files.
>> Well we've allegedly fixed that, but it isn't going anywhere without
>> testing.
> 
> We have fixed the case in which we compile the kernel without swap. Then 
> anonymous pages behave like mlocked pages. Did we do more than that?

Not AFAIK.

I would like to see separate pageout selection queues
for anonymous/tmpfs and page cache backed pages.  That
way we can simply scan only that what we want to scan.

There are several ways available to balance pressure
between both sets of lists.

Splitting them out will also make it possible to do
proper use-once replacement for the page cache pages.
Ie. leaving the really active page cache pages on the
page cache active list, instead of deactivating them
because they're lower priority than anonymous pages.

That way we can do a backup without losting the page
cache working set.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:16               ` Linus Torvalds
@ 2007-03-02 18:45                 ` Mark Gross
  2007-03-02 19:03                   ` Linus Torvalds
  2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 1 reply; 104+ messages in thread
From: Mark Gross @ 2007-03-02 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 09:16:17AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Mar 2007, Mark Gross wrote:
> > > 
> > > Yes, the same issues exist for other DRAM forms too, but to a *much* 
> > > smaller degree.
> > 
> > DDR3-1333 may be better than FBDIMM's but don't count on it being much
> > better.
> 
> Hey, fair enough. But it's not a problem (and it doesn't have a solution) 
> today. I'm not sure it's going to have a solution tomorrow either.
> 
> > > Also, IN PRACTICE you're never ever going to see this anyway. Almost 
> > > everybody wants bank interleaving, because it's a huge performance win on 
> > > many loads. That, in turn, means that your memory will be spread out over 
> > > multiple DIMM's even for a single page, much less any bigger area.
> > 
> > 4-way interleave across banks on systems may not be as common as you may
> > think for future chip sets.  2-way interleave across DIMMs within a bank
> > will stay.
> 
> .. and think about a realistic future.
> 
> EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
> today, but in the one- to two-year timeframe even Intel will.

True.

> 
> What does that mean? It means that in bigger systems, you will no longer 
> even *have* 8 or 16 banks where turning off a few banks makes sense. 
> You'll quite often have just a few DIMM's per die, because that's what you 
> want for latency. Then you'll have CSI or HT or another interconnect.
> 
> And with a few DIMM's per die, you're back where even just 2-way 
> interleaving basically means that in order to turn off your DIMM, you 
> probably need to remove HALF the memory for that CPU.

I think there will be more than just 2 dims per cpu socket on systems
that care about this type of capability.

> 
> In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
> CHILDREN.


Its very true that taking advantage of the first incarnations of this
type of thing will be limited to specific workloads you personally don't
care about, but its got applications and customers.

BTW I hope we aren't talking past each other, there are low power states
where the ram contents are persevered.

> 
> There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
> nobody actually does it in practice, and nobody even knows if it's going 
> to be viable (yes, DRAM takes energy, but trying to keep memory free will 
> likely waste power *too*, and I doubt anybody has any real idea of how 
> much any of this would actually help in practice).
> 
> And I don't think that will change. See above. The future is *not* moving 
> towards more and more DIMMS. Quite the reverse. On workstations, we are 
> currently in the "one or two DIMM's per die". Do you really think that 
> will change? Hell no. And in big servers, pretty much everybody agrees 
> that we will move towards that, rather than away from it.
> 
> So:
>  - forget about turning DIMM's off. There is *no* actual data supporting 
>    the notion that it's a good idea today, and I seriously doubt you can 
>    really argue that it will be a good idea in five or ten years. It's a 
>    hardware hack for a hardware problem, and the problems are way too 
>    complex for us to solve in time for the solution to be relevant.
> 
>  - aim for NUMA memory allocation and turning off whole *nodes*. That's 
>    much more likely to be productive in the longer timeframe. And yes, we 
>    may well want to do memory compaction for that too, but I suspect that 
>    the issues are going to be different (ie the way to do it is to simply 
>    prefer certain nodes for certain allocations, and then try to keep the 
>    jobs that you know can be idle on other nodes)

We doing the NUMA approach.  

> 
> Do you actually have real data supporting the notion that turning DIMM's 
> off will be reasonable and worthwhile? 
> 

Yes we have data from our internal and external customers showing that
this stuff is worthwhile for specific workload that some people care
about.  However; you need to understand that its by definition marketing data.

--mgross

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:02                   ` Andrew Morton
@ 2007-03-02 19:02                     ` Mark Gross
  0 siblings, 0 replies; 104+ messages in thread
From: Mark Gross @ 2007-03-02 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 10:02:57AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 09:35:27 -0800
> Mark Gross <mgross@linux.intel.com> wrote:
> 
> > > 
> > > Will it be possible to just power the DIMMs off?  I don't see much point in
> > > some half-power non-destructive mode.
> > 
> > I think so, but need to double check with the HW folks.
> > 
> > Technically, the dims could be powered off, and put into 2 different low
> > power non-destructive states.  (standby and suspend), but putting them
> > in a low power non-destructive mode has much less latency and provides
> > good bang for the buck or LOC change needed to make work.
> > 
> > Which lower power mode an application chooses will depend on latency
> > tolerances of the app.  For the POC activities we are looking at we are
> > targeting the lower latency option, but that doesn't lock out folks from
> > trying to do something with the other options.
> > 
> 
> If we don't evacuate all live data from all of the DIMM, we'll never be
> able to power the thing down in many situations.
> 
> Given that we _have_ emptied the DIMM, we can just turn it off.  And
> refilling it will be slow - often just disk speed.
> 
> So I don't see a useful use-case for non-destructive states.

I'll post the RFC very soon to provide a better thread context for this
line of discussion, but to answer your question:

There are 2 power management policies we are looking at.  The first one
is allocation based PM, and the other is access base PM.  The access
based PM needs chip set support which is coming at a TBD date.

--mgross

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:45                 ` Mark Gross
@ 2007-03-02 19:03                   ` Linus Torvalds
  0 siblings, 0 replies; 104+ messages in thread
From: Linus Torvalds @ 2007-03-02 19:03 UTC (permalink / raw)
  To: Mark Gross
  Cc: Andrew Morton, Balbir Singh, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel



On Fri, 2 Mar 2007, Mark Gross wrote:
> 
> I think there will be more than just 2 dims per cpu socket on systems
> that care about this type of capability.

I agree. I think you'll have a nice mix of 2 and 4, although not likely a 
lot more. You want to have independent channels, and then within a channel 
you want to have as close to point-to-point as possible. 

But the reason that I think you're better off looking at a "node level" is 
that 

 (a) describing the DIMM setup is a total disaster. The interleaving is 
     part of it, but even in the absense of interleaving, we have so far 
     seen that describing DIMM mapping simply isn't a realistic thing to 
     be widely deplyed, judging by the fact that we cannot even get a 
     first-order approximate mapping for the ECC error events.

     Going node-level means that we just piggy-back on the existing node 
     mapping, which is a lot more likely to actually be correct and 
     available (ie you may not know which bank is bank0 and how the 
     interleaving works, but you usually *do* know which bank is connected 
     to which CPU package)

     (Btw, I shouldn't have used the word "die", since it's really about 
     package - Intel obviously has a penchant for putting two dies per 
     package)

 (b) especially if you can actually shut down the memory, going node-wide 
     may mean that you can shut down the CPU's too (ie per-package sleep). 
     I bet the people who care enough to care about DIMM's would want to 
     have that *anyway*, so tying them together simplifies the problem.

> BTW I hope we aren't talking past each other, there are low power states
> where the ram contents are persevered.

Yes. They are almost as hard to handle, but the advantage is that if we 
get things wrong, it can still work most of the time (ie we don't have to 
migrate everything off, we just need to try to migrate the stuff that gets 
*used* off a DIMM, and hardware will hopefully end up quiescing the right 
memory controller channel totally automatically, without us having to know 
the exact mapping or even having to 100% always get it 100% right).

With FBDIMM in particular, I guess the biggest power cost isn't actually 
the DRAM content, but just the controllers.

Of course, I wonder how much actual point there is to FBDIMM's once you 
have on-die memory controllers and thus the reason for deep queueing is 
basically gone (since you'd spread out the memory rather than having it 
behind a few central controllers).

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:23                 ` Rik van Riel
@ 2007-03-02 19:31                   ` Christoph Lameter
  2007-03-02 19:40                     ` Rik van Riel
  2007-03-02 21:12                   ` Bill Irwin
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Lameter @ 2007-03-02 19:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, Rik van Riel wrote:

> I would like to see separate pageout selection queues
> for anonymous/tmpfs and page cache backed pages.  That
> way we can simply scan only that what we want to scan.
> 
> There are several ways available to balance pressure
> between both sets of lists.
> 
> Splitting them out will also make it possible to do
> proper use-once replacement for the page cache pages.
> Ie. leaving the really active page cache pages on the
> page cache active list, instead of deactivating them
> because they're lower priority than anonymous pages.

Well I would expect this to have marginal improvements and delay the 
inevitable for awhile until we have even bigger memory. If the app uses 
mmapped data areas then the problem is still there. And such tinkering 
does not solve the issue of large scale I/O requiring the handling of 
gazillions of page structs. I do not think that there is a way around 
somehow handling larger chunks of memory in an easier way. We already do 
handle larger page sizes for some limited purposes and with huge pages we 
already have a larger page size. Mel's defrag/anti-frag patches are 
necessary to allow us to deal with the resulting fragmentation problems.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 19:31                   ` Christoph Lameter
@ 2007-03-02 19:40                     ` Rik van Riel
  0 siblings, 0 replies; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Mel Gorman, npiggin, mingo, jschopp, arjan,
	torvalds, mbligh, linux-mm, linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, Rik van Riel wrote:
> 
>> I would like to see separate pageout selection queues
>> for anonymous/tmpfs and page cache backed pages.  That
>> way we can simply scan only that what we want to scan.
>>
>> There are several ways available to balance pressure
>> between both sets of lists.
>>
>> Splitting them out will also make it possible to do
>> proper use-once replacement for the page cache pages.
>> Ie. leaving the really active page cache pages on the
>> page cache active list, instead of deactivating them
>> because they're lower priority than anonymous pages.
> 
> Well I would expect this to have marginal improvements and delay the 
> inevitable for awhile until we have even bigger memory. If the app uses 
> mmapped data areas then the problem is still there.

I suspect we would not need to treat mapped file backed memory any
different from page cache that's not mapped.  After all, if we do
proper use-once accounting, the working set will be on the active
list and other cache will be flushed out the inactive list quickly.

Also, the IO cost for mmapped data areas is the same as the IO
cost for unmapped files, so there's no IO reason to treat them
differently, either.


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:06             ` Andrew Morton
  2007-03-02 18:15               ` Christoph Lameter
@ 2007-03-02 20:59               ` Bill Irwin
  1 sibling, 0 replies; 104+ messages in thread
From: Bill Irwin @ 2007-03-02 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 12:43:42 -0500 Rik van Riel <riel@redhat.com> wrote:
>> I can't share all the details, since a lot of the problems are customer
>> workloads.
>> One particular case is a 32GB system with a database that takes most
>> of memory.  The amount of actually freeable page cache memory is in
>> the hundreds of MB.

On Fri, Mar 02, 2007 at 10:06:19AM -0800, Andrew Morton wrote:
> Where's the rest of the memory? tmpfs?  mlocked?  hugetlb?

I know of one sounding similar to this where unreclaimable pages are
pinned by refcounts held by bio's spread across about 850 spindles.
It's mostly read traffic. Several different tunables could be used
to work around it, nr_requests in particular, but also clamping down
on dirty limits to preposterously low levels and setting preposterously
large values of min_free_kbytes. Their kernel is, of course,
substantially downrev (2.6.9-based IIRC), so douse things heavily with
grains of salt.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 18:23                 ` Rik van Riel
  2007-03-02 19:31                   ` Christoph Lameter
@ 2007-03-02 21:12                   ` Bill Irwin
  2007-03-02 21:19                     ` Rik van Riel
  1 sibling, 1 reply; 104+ messages in thread
From: Bill Irwin @ 2007-03-02 21:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> With 32 CPUs diving into the page reclaim simultaneously,
> each trying to scan a fraction of memory, this is disastrous
> for performance.  A 256GB system should be even worse.

Thundering herds of a sort pounding the LRU locks from direct reclaim
have set off the NMI oopser for users here.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:12                   ` Bill Irwin
@ 2007-03-02 21:19                     ` Rik van Riel
  2007-03-02 21:52                       ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 21:19 UTC (permalink / raw)
  To: Bill Irwin, Rik van Riel, Christoph Lameter, Andrew Morton,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, mbligh,
	linux-mm, linux-kernel

Bill Irwin wrote:
> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>> With 32 CPUs diving into the page reclaim simultaneously,
>> each trying to scan a fraction of memory, this is disastrous
>> for performance.  A 256GB system should be even worse.
> 
> Thundering herds of a sort pounding the LRU locks from direct reclaim
> have set off the NMI oopser for users here.

Ditto here.

The main reason they end up pounding the LRU locks is the
swappiness heuristic.  They scan too much before deciding
that it would be a good idea to actually swap something
out, and with 32 CPUs doing such scanning simultaneously...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:31     ` Joel Schopp
@ 2007-03-02 21:37       ` Bill Irwin
  0 siblings, 0 replies; 104+ messages in thread
From: Bill Irwin @ 2007-03-02 21:37 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Mel Gorman, Bill Irwin, akpm, npiggin, clameter, mingo, arjan,
	torvalds, mbligh, Linux Memory Management List,
	Linux Kernel Mailing List

At some point in the past, Mel Gorman wrote:
>> I can't think of a workload that totally makes a mess out of list-based. 
>> However, list-based makes no guarantees on availability. If a system 
>> administrator knows they need between 10,000 and 100,000 huge pages and 
>> doesn't want to waste memory pinning too many huge pages at boot-time, 
>> the zone-based mechanism would be what he wanted.

On Fri, Mar 02, 2007 at 10:31:39AM -0600, Joel Schopp wrote:
> From our testing with earlier versions of list based for memory hot-unplug 
> on pSeries machines we were able to hot-unplug huge amounts of memory after 
> running the nastiest workloads we could find for over a week.  Without the 
> patches we were unable to hot-unplug anything within minutes of running the 
> same workloads.
> If something works for 99.999% of people (list based) and there is an easy 
> way to configure it for the other 0.001% of the people ("zone" based) I 
> call that a great solution.  I really don't understand what the resistance 
> is to these patches.

Sorry if I was unclear; I was anticipating others' objections and
offering to assist in responding to them. I myself have no concerns
about the above strategy, apart from generally wanting to recover the
list-based patch's hugepage availability without demanding it as a
merging criterion.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:19                     ` Rik van Riel
@ 2007-03-02 21:52                       ` Andrew Morton
  2007-03-02 22:03                         ` Rik van Riel
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 21:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 16:19:19 -0500
Rik van Riel <riel@redhat.com> wrote:

> Bill Irwin wrote:
> > On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> >> With 32 CPUs diving into the page reclaim simultaneously,
> >> each trying to scan a fraction of memory, this is disastrous
> >> for performance.  A 256GB system should be even worse.
> > 
> > Thundering herds of a sort pounding the LRU locks from direct reclaim
> > have set off the NMI oopser for users here.
> 
> Ditto here.

Opterons?

> The main reason they end up pounding the LRU locks is the
> swappiness heuristic.  They scan too much before deciding
> that it would be a good idea to actually swap something
> out, and with 32 CPUs doing such scanning simultaneously...

What kernel version?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 21:52                       ` Andrew Morton
@ 2007-03-02 22:03                         ` Rik van Riel
  2007-03-02 22:22                           ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 02 Mar 2007 16:19:19 -0500
> Rik van Riel <riel@redhat.com> wrote:
>> Bill Irwin wrote:
>>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>>>> With 32 CPUs diving into the page reclaim simultaneously,
>>>> each trying to scan a fraction of memory, this is disastrous
>>>> for performance.  A 256GB system should be even worse.
>>> Thundering herds of a sort pounding the LRU locks from direct reclaim
>>> have set off the NMI oopser for users here.
>> Ditto here.
> 
> Opterons?

It's happened on IA64, too.  Probably on Intel x86-64 as well.

>> The main reason they end up pounding the LRU locks is the
>> swappiness heuristic.  They scan too much before deciding
>> that it would be a good idea to actually swap something
>> out, and with 32 CPUs doing such scanning simultaneously...
> 
> What kernel version?

Customers are on the 2.6.9 based RHEL4 kernel, but I believe
we have reproduced the problem on 2.6.18 too during stress
tests.

I have no reason to believe we should stick our heads in the
sand and pretend it no longer exists on 2.6.21.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:03                         ` Rik van Riel
@ 2007-03-02 22:22                           ` Andrew Morton
  2007-03-02 22:34                             ` Rik van Riel
                                               ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 22:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 17:03:10 -0500
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Fri, 02 Mar 2007 16:19:19 -0500
> > Rik van Riel <riel@redhat.com> wrote:
> >> Bill Irwin wrote:
> >>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
> >>>> With 32 CPUs diving into the page reclaim simultaneously,
> >>>> each trying to scan a fraction of memory, this is disastrous
> >>>> for performance.  A 256GB system should be even worse.
> >>> Thundering herds of a sort pounding the LRU locks from direct reclaim
> >>> have set off the NMI oopser for users here.
> >> Ditto here.
> > 
> > Opterons?
> 
> It's happened on IA64, too.  Probably on Intel x86-64 as well.

Opterons seem to be particularly prone to lock starvation where a cacheline
gets captured in a single package for ever.

> >> The main reason they end up pounding the LRU locks is the
> >> swappiness heuristic.  They scan too much before deciding
> >> that it would be a good idea to actually swap something
> >> out, and with 32 CPUs doing such scanning simultaneously...
> > 
> > What kernel version?
> 
> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
> we have reproduced the problem on 2.6.18 too during stress
> tests.

The prev_priority fixes were post-2.6.18

> I have no reason to believe we should stick our heads in the
> sand and pretend it no longer exists on 2.6.21.

I have no reason to believe anything.  All I see is handwaviness,
speculation and grand plans to rewrite vast amounts of stuff without even a
testcase to demonstrate that said rewrite improved anything.

None of this is going anywhere, is it?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:22                           ` Andrew Morton
@ 2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
                                                 ` (2 more replies)
  2007-03-02 23:16                             ` [PATCH] : Optimizes timespec_trunc() Eric Dumazet
  2007-03-03  0:33                             ` The performance and behaviour of the anti-fragmentation related patches William Lee Irwin III
  2 siblings, 3 replies; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Fri, 02 Mar 2007 17:03:10 -0500
> Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Fri, 02 Mar 2007 16:19:19 -0500
>>> Rik van Riel <riel@redhat.com> wrote:
>>>> Bill Irwin wrote:
>>>>> On Fri, Mar 02, 2007 at 01:23:28PM -0500, Rik van Riel wrote:
>>>>>> With 32 CPUs diving into the page reclaim simultaneously,
>>>>>> each trying to scan a fraction of memory, this is disastrous
>>>>>> for performance.  A 256GB system should be even worse.
>>>>> Thundering herds of a sort pounding the LRU locks from direct reclaim
>>>>> have set off the NMI oopser for users here.
>>>> Ditto here.
>>> Opterons?
>> It's happened on IA64, too.  Probably on Intel x86-64 as well.
> 
> Opterons seem to be particularly prone to lock starvation where a cacheline
> gets captured in a single package for ever.
> 
>>>> The main reason they end up pounding the LRU locks is the
>>>> swappiness heuristic.  They scan too much before deciding
>>>> that it would be a good idea to actually swap something
>>>> out, and with 32 CPUs doing such scanning simultaneously...
>>> What kernel version?
>> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
>> we have reproduced the problem on 2.6.18 too during stress
>> tests.
> 
> The prev_priority fixes were post-2.6.18

We tested them.  They only alleviate the problem slightly in
good situations, but things still fall apart badly with less
friendly workloads.

>> I have no reason to believe we should stick our heads in the
>> sand and pretend it no longer exists on 2.6.21.
> 
> I have no reason to believe anything.  All I see is handwaviness,
> speculation and grand plans to rewrite vast amounts of stuff without even a
> testcase to demonstrate that said rewrite improved anything.

Your attitude is exactly why the VM keeps falling apart over
and over again.

Fixing "a testcase" in the VM tends to introduce problems for
other test cases, ad infinitum. There's a reason we end up
fixing the same bugs over and over again.

I have been looking through a few hundred VM related bugzillas
and have found the same bugs persist over many different
versions of Linux, sometimes temporarily fixed, but they seem
to always come back eventually...

> None of this is going anywhere, is is it?

I will test my changes before I send them to you, but I cannot
promise you that you'll have the computers or software needed
to reproduce the problems.  I doubt I'll have full time access
to such systems myself, either.

32GB is pretty much the minimum size to reproduce some of these
problems. Some workloads may need larger systems to easily trigger
them.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
@ 2007-03-02 22:51                               ` Martin Bligh
  2007-03-02 22:54                                 ` Rik van Riel
  2007-03-02 22:52                               ` Chuck Ebbert
  2007-03-02 22:59                               ` Andrew Morton
  2 siblings, 1 reply; 104+ messages in thread
From: Martin Bligh @ 2007-03-02 22:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

>> None of this is going anywhere, is is it?
> 
> I will test my changes before I send them to you, but I cannot
> promise you that you'll have the computers or software needed
> to reproduce the problems.  I doubt I'll have full time access
> to such systems myself, either.
> 
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger
> them.

We can find a 32GB system here pretty easily to test things on if
need be.  Setting up large commercial databases is much harder.

I don't have such a machine in the public set of machines we're going
to push to test.kernel.org from at the moment, but will see if I can
arrange it in the future if it's important.


M.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
@ 2007-03-02 22:52                               ` Chuck Ebbert
  2007-03-02 22:59                               ` Andrew Morton
  2 siblings, 0 replies; 104+ messages in thread
From: Chuck Ebbert @ 2007-03-02 22:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, mbligh, linux-mm,
	linux-kernel

Rik van Riel wrote:
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger
> them.
> 

Hundreds of disks all doing IO at once may also be needed, as
wli points out. Such systems are not readily available for testing.




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:51                               ` Martin Bligh
@ 2007-03-02 22:54                                 ` Rik van Riel
  2007-03-02 23:28                                   ` Martin J. Bligh
  0 siblings, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 22:54 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

Martin Bligh wrote:
>>> None of this is going anywhere, is is it?
>>
>> I will test my changes before I send them to you, but I cannot
>> promise you that you'll have the computers or software needed
>> to reproduce the problems.  I doubt I'll have full time access
>> to such systems myself, either.
>>
>> 32GB is pretty much the minimum size to reproduce some of these
>> problems. Some workloads may need larger systems to easily trigger
>> them.
> 
> We can find a 32GB system here pretty easily to test things on if
> need be.  Setting up large commercial databases is much harder.

That's my problem, too.

There does not seem to exist any single set of test cases that
accurately predicts how the VM will behave with customer
workloads.

The one thing I can do relatively easily is go through a few
hundred bugzillas and figure out what kinds of problems have
been plaguing the VM consistently over the last few years.
I just finished doing that, and am trying to come up with
fixes for the problems that just don't seem to be easily
fixable with bandaids...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 22:51                               ` Martin Bligh
  2007-03-02 22:52                               ` Chuck Ebbert
@ 2007-03-02 22:59                               ` Andrew Morton
  2007-03-02 23:20                                 ` Rik van Riel
  2007-03-03  1:40                                 ` William Lee Irwin III
  2 siblings, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-02 22:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 02 Mar 2007 17:34:31 -0500
Rik van Riel <riel@redhat.com> wrote:

> >>>> The main reason they end up pounding the LRU locks is the
> >>>> swappiness heuristic.  They scan too much before deciding
> >>>> that it would be a good idea to actually swap something
> >>>> out, and with 32 CPUs doing such scanning simultaneously...
> >>> What kernel version?
> >> Customers are on the 2.6.9 based RHEL4 kernel, but I believe
> >> we have reproduced the problem on 2.6.18 too during stress
> >> tests.
> > 
> > The prev_priority fixes were post-2.6.18
> 
> We tested them.  They only alleviate the problem slightly in
> good situations, but things still fall apart badly with less
> friendly workloads.

What is it with vendors finding MM problems and either not fixing them or
kludging around them and not telling the upstream maintainers about *any*
of it?

> >> I have no reason to believe we should stick our heads in the
> >> sand and pretend it no longer exists on 2.6.21.
> > 
> > I have no reason to believe anything.  All I see is handwaviness,
> > speculation and grand plans to rewrite vast amounts of stuff without even a
> > testcase to demonstrate that said rewrite improved anything.
> 
> Your attitude is exactly why the VM keeps falling apart over
> and over again.
> 
> Fixing "a testcase" in the VM tends to introduce problems for
> other test cases, ad infinitum.

In that case it was a bad fix.  The aim is to fix known problems without
introducing regressions in other areas.  A perfectly legitimate approach.

You seem to be saying that we'd be worse off if we actually had a testcase.

> There's a reason we end up
> fixing the same bugs over and over again.

No we don't.

> I have been looking through a few hundred VM related bugzillas
> and have found the same bugs persist over many different
> versions of Linux, sometimes temporarily fixed, but they seem
> to always come back eventually...
> 
> > None of this is going anywhere, is is it?
> 
> I will test my changes before I send them to you, but I cannot
> promise you that you'll have the computers or software needed
> to reproduce the problems.  I doubt I'll have full time access
> to such systems myself, either.
> 
> 32GB is pretty much the minimum size to reproduce some of these
> problems. Some workloads may need larger systems to easily trigger

32GB isn't particularly large.

Somehow I don't believe that a person or organisation which is incapable of
preparing even a simple testcase will be capable of fixing problems such as
this without breaking things.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH] : Optimizes timespec_trunc() 
  2007-03-02 22:22                           ` Andrew Morton
  2007-03-02 22:34                             ` Rik van Riel
@ 2007-03-02 23:16                             ` Eric Dumazet
  2007-03-03  0:33                             ` The performance and behaviour of the anti-fragmentation related patches William Lee Irwin III
  2 siblings, 0 replies; 104+ messages in thread
From: Eric Dumazet @ 2007-03-02 23:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

The first thing done by timespec_trunc() is :

  if (gran <= jiffies_to_usecs(1) * 1000)

This should really be a test against a constant known at compile time.

Alas, it isnt. jiffies_to_usec() was unilined so C compiler emits a function 
call and a multiply to compute : a CONSTANT.

mov    $0x1,%edi
mov    %rbx,0xffffffffffffffe8(%rbp)
mov    %r12,0xfffffffffffffff0(%rbp)
mov    %edx,%ebx
mov    %rsi,0xffffffffffffffc8(%rbp)
mov    %rsi,%r12
callq  ffffffff80232010 <jiffies_to_usecs>
imul   $0x3e8,%eax,%eax
cmp    %ebx,%eax

This patch reorders kernel/time.c a bit so that jiffies_to_usecs() is defined 
before timespec_trunc() so that compiler now generates :

cmp    $0x3d0900,%edx  (HZ=250 on my machine)

This gives a better code (timespec_trunc() becoming a leaf function), and 
shorter kernel size as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: time.patch --]
[-- Type: text/plain, Size: 2100 bytes --]

diff --git a/kernel/time.c b/kernel/time.c
index c6c80ea..a0cf037 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -247,6 +247,36 @@ struct timespec current_fs_time(struct s
 }
 EXPORT_SYMBOL(current_fs_time);
 
+/*
+ * Convert jiffies to milliseconds and back.
+ *
+ * Avoid unnecessary multiplications/divisions in the
+ * two most common HZ cases:
+ */
+unsigned int inline jiffies_to_msecs(const unsigned long j)
+{
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	return (MSEC_PER_SEC / HZ) * j;
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
+#else
+	return (j * MSEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_msecs);
+
+unsigned int inline jiffies_to_usecs(const unsigned long j)
+{
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+	return (USEC_PER_SEC / HZ) * j;
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
+#else
+	return (j * USEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_usecs);
+
 /**
  * timespec_trunc - Truncate timespec to a granularity
  * @t: Timespec
@@ -471,36 +501,6 @@ struct timeval ns_to_timeval(const s64 n
 }
 
 /*
- * Convert jiffies to milliseconds and back.
- *
- * Avoid unnecessary multiplications/divisions in the
- * two most common HZ cases:
- */
-unsigned int jiffies_to_msecs(const unsigned long j)
-{
-#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (MSEC_PER_SEC / HZ) * j;
-#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
-#else
-	return (j * MSEC_PER_SEC) / HZ;
-#endif
-}
-EXPORT_SYMBOL(jiffies_to_msecs);
-
-unsigned int jiffies_to_usecs(const unsigned long j)
-{
-#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (USEC_PER_SEC / HZ) * j;
-#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
-#else
-	return (j * USEC_PER_SEC) / HZ;
-#endif
-}
-EXPORT_SYMBOL(jiffies_to_usecs);
-
-/*
  * When we convert to jiffies then we interpret incoming values
  * the following way:
  *

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:59                               ` Andrew Morton
@ 2007-03-02 23:20                                 ` Rik van Riel
  2007-03-03  1:40                                 ` William Lee Irwin III
  1 sibling, 0 replies; 104+ messages in thread
From: Rik van Riel @ 2007-03-02 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Christoph Lameter, Mel Gorman, npiggin, mingo,
	jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:

> Somehow I don't believe that a person or organisation which is incapable of
> preparing even a simple testcase will be capable of fixing problems such as
> this without breaking things.

I don't believe anybody who relies on one simple test case will
ever be capable of evaluating a patch without breaking things.

Test cases can show problems, but fixing a test case is no
guarantee at all that your VM will behave ok with real world
workloads.  Test cases for the VM can *never* be relied on
to show that a problem went away.

I'll do my best, but I can't promise a simple test case
for every single problem that's plaguing the VM.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:54                                 ` Rik van Riel
@ 2007-03-02 23:28                                   ` Martin J. Bligh
  2007-03-03  0:24                                     ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Martin J. Bligh @ 2007-03-02 23:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Bill Irwin, Christoph Lameter, Mel Gorman,
	npiggin, mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

>>> 32GB is pretty much the minimum size to reproduce some of these
>>> problems. Some workloads may need larger systems to easily trigger
>>> them.
>>
>> We can find a 32GB system here pretty easily to test things on if
>> need be.  Setting up large commercial databases is much harder.
> 
> That's my problem, too.
> 
> There does not seem to exist any single set of test cases that
> accurately predicts how the VM will behave with customer
> workloads.

Tracing might help? Showing Andrew traces of what happened in
production for the prev_priority change made it much easier to
demonstrate and explain the real problem ...

M.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:16               ` Linus Torvalds
  2007-03-02 18:45                 ` Mark Gross
@ 2007-03-02 23:58                 ` Martin J. Bligh
  1 sibling, 0 replies; 104+ messages in thread
From: Martin J. Bligh @ 2007-03-02 23:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Gross, Andrew Morton, Balbir Singh, Mel Gorman, npiggin,
	clameter, mingo, jschopp, arjan, linux-mm, linux-kernel

> .. and think about a realistic future.
> 
> EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it 
> today, but in the one- to two-year timeframe even Intel will.
> 
> What does that mean? It means that in bigger systems, you will no longer 
> even *have* 8 or 16 banks where turning off a few banks makes sense. 
> You'll quite often have just a few DIMM's per die, because that's what you 
> want for latency. Then you'll have CSI or HT or another interconnect.
> 
> And with a few DIMM's per die, you're back where even just 2-way 
> interleaving basically means that in order to turn off your DIMM, you 
> probably need to remove HALF the memory for that CPU.
> 
> In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED 
> CHILDREN.

Even with only 4 banks per CPU, and 2-way interleaving, we could still
power off half the DIMMs in the system. That's a huge impact on the
power budget for a large cluster.

No, it's not ideal, but what was that quote again ... "perfect is the
enemy of good"? Something like that ;-)

> There are maybe a couple machines IN EXISTENCE TODAY that can do it. But 
> nobody actually does it in practice, and nobody even knows if it's going 
> to be viable (yes, DRAM takes energy, but trying to keep memory free will 
> likely waste power *too*, and I doubt anybody has any real idea of how 
> much any of this would actually help in practice).

Batch jobs across clusters have spikes at different times of the day,
etc that are fairly predictable in many cases.

M.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 23:28                                   ` Martin J. Bligh
@ 2007-03-03  0:24                                     ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-03  0:24 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, linux-mm, linux-kernel

On Fri, 02 Mar 2007 15:28:43 -0800
"Martin J. Bligh" <mbligh@mbligh.org> wrote:

> >>> 32GB is pretty much the minimum size to reproduce some of these
> >>> problems. Some workloads may need larger systems to easily trigger
> >>> them.
> >>
> >> We can find a 32GB system here pretty easily to test things on if
> >> need be.  Setting up large commercial databases is much harder.
> > 
> > That's my problem, too.
> > 
> > There does not seem to exist any single set of test cases that
> > accurately predicts how the VM will behave with customer
> > workloads.
> 
> Tracing might help? Showing Andrew traces of what happened in
> production for the prev_priority change made it much easier to
> demonstrate and explain the real problem ...
> 

Tracing is one way.

The other way is the old scientific method:

- develop a theory
- add sufficient instrumentation to prove or disprove that theory
- run workload, crunch on numbers
- repeat

Of course, multiple theories can be proven/disproven in a single pass.

Practically, this means adding one new /prov/vmstat entry for each `goto
keep*' in shrink_page_list().  And more instrumentation in
shrink_active_list() to determine the behaviour of swap_tendency.

Once that process is finished, we should have a thorough understanding of
what the problem is.  We can then construct a testcase (it'll be a couple
hundred lines only) and use that testcase to determine what implementation
changes are needed, and whether it actually worked.

Then go back to the real workload, verify that it's still fixed.

Then do whitebox testing of other workloads to check that they haven't
regressed.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:22                           ` Andrew Morton
  2007-03-02 22:34                             ` Rik van Riel
  2007-03-02 23:16                             ` [PATCH] : Optimizes timespec_trunc() Eric Dumazet
@ 2007-03-03  0:33                             ` William Lee Irwin III
  2007-03-03  0:54                               ` Andrew Morton
  2007-03-03  3:15                               ` Christoph Lameter
  2 siblings, 2 replies; 104+ messages in thread
From: William Lee Irwin III @ 2007-03-03  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> Opterons seem to be particularly prone to lock starvation where a cacheline
> gets captured in a single package for ever.

AIUI that phenomenon is universal to NUMA. Maybe it's time we
reexamined our locking algorithms in the light of fairness
considerations.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  0:33                             ` The performance and behaviour of the anti-fragmentation related patches William Lee Irwin III
@ 2007-03-03  0:54                               ` Andrew Morton
  2007-03-03  3:15                               ` Christoph Lameter
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2007-03-03  0:54 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 16:33:19 -0800
William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> > Opterons seem to be particularly prone to lock starvation where a cacheline
> > gets captured in a single package for ever.
> 
> AIUI that phenomenon is universal to NUMA. Maybe it's time we
> reexamined our locking algorithms in the light of fairness
> considerations.
> 

It's also a multicore thing.  iirc Kiran was seeing it on Intel CPUs.

I expect the phenomenon would be observeable on a number of locks in the
kernel, give the appropriate workload.  We just hit it first on lru_lock.

I'd have thought that increasing SWAP_CLUSTER_MAX by two or four orders of
magnitude would plug it, simply by decreasing the acquisition frequency but
I think Kiran fiddled with that to no effect.


See below for Linus's thoughts, forwarded without permission..





Begin forwarded message:

Date: Mon, 22 Jan 2007 13:49:02 -0800 (PST)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Ravikiran G Thirumalai <kiran@scalex86.org>
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions



On Mon, 22 Jan 2007, Andrew Morton wrote:
> 
> Please review the whole thread sometime.  I think we're pretty screwed, and
> the problem will only become worse as more cores get rolled out and I don't
> know what to do about it apart from whining to Intel, but that won't fix
> anything.

I think people need to realize that spinlocks are always going to be 
unfair, and *extremely* so under some conditions. And yes, multi-core 
brought those conditions home to roost for some people (two or more cores 
much closer to each other than others, and able to basically ping-pong the 
spinlock to each other, with nobody else ever able to get it).

There's only a few possible solutions:

 - use the much slower semaphores, which actually try to do fairness. 

 - if you cannot sleep, introduce a separate "fair spinlock" type. It's 
   going to be appreciably slower (and will possibly have a bigger memory 
   footprint) than a regular spinlock, though. But it's certainly a 
   possible thing to do.

 - make sure no lock that you care about ever has high enough contention 
   to matter. NOTE! back-off etc simply will not help. This is not a 
   back-off issue. Back-off helps keep down coherency traffic, but it 
   doesn't help fairness.

If somebody wants to play with fair spinlocks, go wild. I looked at it at 
one point, and it was not wonderful. It's pretty complicated to do, and 
the best way I could come up with was literally a list of waiting CPU's 
(but you only need one static list entry per CPU). I didn't bother to 
think a whole lot about it.

The "never enough contention" is the real solution. For example, anything 
that drops and just re-takes the lock again (which some paths do for 
latency reduction) won't do squat. The same CPU that dropped the lock will 
basically always be able to retake it (and multi-core just means that is 
even more true, with the lock staying within one die even if some other 
core can get it).

Of course, "never enough contention" may not be possible for all locks. 
Which is why a "fair spinlock" may be the solution - use it for the few 
locks that care (and the VM locks could easily be it).

What CANNOT work: timeouts. A watchdog won't work. If you have workloads 
with enough contention, once you have enough CPU's, there's no upper bound 
on one of the cores not being able to get the lock.

On the other hand, what CAN work is: not caring. If it's ok to not be 
fair, and it only happens under extreme load, then "we don't care" is a 
perfectly fine option. 

In the "it could work" corner, I used to hope that cache coherency 
protocols in hw would do some kind of fairness thing, but I've come to the 
conclusion that it's just too hard. It's hard enough for software, it's 
likely really painful for hw too. So not only does hw generally not do it 
today (although certain platforms are better at it than others), I don't 
really expect this to change.

If anything, we'll see more of it, since multicore is one thing that makes 
things worse (as does multiple levels of caching - NUMA machines tend to 
have this problem even without multi-core, simply because they don't have 
a shared bus, which happens to hide many cases).

I'm personally in the "don't care" camp, until somebody shows a real-life 
workload. I'd often prefer to disable a watchdog if that's the biggest 
problem, for example. But if there's a real load that shows this as a real 
problem...

			Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 22:59                               ` Andrew Morton
  2007-03-02 23:20                                 ` Rik van Riel
@ 2007-03-03  1:40                                 ` William Lee Irwin III
  2007-03-03  1:58                                   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: William Lee Irwin III @ 2007-03-03  1:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> What is it with vendors finding MM problems and either not fixing them or
> kludging around them and not telling the upstream maintainers about *any*
> of it?

I'm not in the business of defending vendors, but a lot of times the
base is so far downrev it's difficult to relate it to much of anything
current. It may be best not to say precisely how far downrev things can
get, since some of these things are so old even distro vendors won't
touch them.


On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> Somehow I don't believe that a person or organisation which is incapable of
> preparing even a simple testcase will be capable of fixing problems such as
> this without breaking things.

My gut feeling is to agree, but I get nagging doubts when I try to
think of how to boil things like [major benchmarks whose names are
trademarked/copyrighted/etc. censored] down to simple testcases. Some
other things are obvious but require vast resources, like zillions of
disks fooling throttling/etc. heuristics of ancient downrev kernels.
I guess for those sorts of things the voodoo incantations, chicken
blood, and carcasses of freshly slaughtered goats come out. Might as
well throw in a Tarot reading and some tea leaves while I'm at it.

My tack on basic stability was usually testbooting on several arches,
which various people have an active disinterest in (suggesting, for
example, that I throw out all of my sparc32 systems and replace them
with Opterons, or that anything that goes wrong on ia64 is not only
irrelevant but also that neither I nor anyone else should ever fix them;
you know who you are). It's become clear to me that this is insufficient,
and that I'll need to start using some sort of suite of regression tests,
at the very least to save myself the embarrassment of acking a patch that
oopses when exercised, but also to elevate the standard.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  1:40                                 ` William Lee Irwin III
@ 2007-03-03  1:58                                   ` Andrew Morton
  2007-03-03  3:55                                     ` William Lee Irwin III
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-03  1:58 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 17:40:04 -0800
William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Mar 02, 2007 at 02:59:06PM -0800, Andrew Morton wrote:
> > Somehow I don't believe that a person or organisation which is incapable of
> > preparing even a simple testcase will be capable of fixing problems such as
> > this without breaking things.
> 
> My gut feeling is to agree, but I get nagging doubts when I try to
> think of how to boil things like [major benchmarks whose names are
> trademarked/copyrighted/etc. censored] down to simple testcases. Some
> other things are obvious but require vast resources, like zillions of
> disks fooling throttling/etc. heuristics of ancient downrev kernels.

noooooooooo.  You're approaching it from the wrong direction.

Step 1 is to understand what is happening on the affected production
system.  Completely.  Once that is fully understood then it is a relatively
simple matter to concoct a test case which triggers the same failure mode.

It is very hard to go the other way: to poke around with various stress
tests which you think are doing something similar to what you think the
application does in the hope that similar symptoms will trigger so you can
then work out what the kernel is doing.  yuk.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  0:33                             ` The performance and behaviour of the anti-fragmentation related patches William Lee Irwin III
  2007-03-03  0:54                               ` Andrew Morton
@ 2007-03-03  3:15                               ` Christoph Lameter
  2007-03-03  4:19                                 ` William Lee Irwin III
  2007-03-03 17:16                                 ` Martin J. Bligh
  1 sibling, 2 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-03  3:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Rik van Riel, Bill Irwin, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, William Lee Irwin III wrote:

> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> > Opterons seem to be particularly prone to lock starvation where a cacheline
> > gets captured in a single package for ever.
> 
> AIUI that phenomenon is universal to NUMA. Maybe it's time we
> reexamined our locking algorithms in the light of fairness
> considerations.

This is a phenomenon that is usually addressed at the cache logic level. 
Its a hardware maturation issue. A certain package should not be allowed
to hold onto a cacheline forever and other packages must have a mininum 
time when they can operate on that cacheline.




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  1:58                                   ` Andrew Morton
@ 2007-03-03  3:55                                     ` William Lee Irwin III
  0 siblings, 0 replies; 104+ messages in thread
From: William Lee Irwin III @ 2007-03-03  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Bill Irwin, Christoph Lameter, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007 17:40:04 -0800 William Lee Irwin III <wli@holomorphy.com> wrote:
>> My gut feeling is to agree, but I get nagging doubts when I try to
>> think of how to boil things like [major benchmarks whose names are
>> trademarked/copyrighted/etc. censored] down to simple testcases. Some
>> other things are obvious but require vast resources, like zillions of
>> disks fooling throttling/etc. heuristics of ancient downrev kernels.

On Fri, Mar 02, 2007 at 05:58:56PM -0800, Andrew Morton wrote:
> noooooooooo.  You're approaching it from the wrong direction.
> Step 1 is to understand what is happening on the affected production
> system.  Completely.  Once that is fully understood then it is a relatively
> simple matter to concoct a test case which triggers the same failure mode.
> It is very hard to go the other way: to poke around with various stress
> tests which you think are doing something similar to what you think the
> application does in the hope that similar symptoms will trigger so you can
> then work out what the kernel is doing.  yuk.

Yeah, it's really great when it's possible to get debug info out of
people e.g. they're willing to boot into a kernel instrumented with
the appropriate printk's/etc. Most of the time it's all guesswork.
People who post to lkml are much better about all this on average.

I never truly understood the point of kprobes/jprobes/dprobes (or
whatever the probing letter is), crash dumps, and so on until I ran
into this, not that I use personally them (though I may yet start).
Most of the time I just read the code instead and smoke out what
could be going on by something like the process of devising
counterexamples. For instance, I told that colouroff patch guy about
the possibility of getting the wrong page for the start of the buffer
from virt_to_page() on a cache colored buffer pointer (clearly
cache->gfporder >= 4 in such a case). Deriving the head page without
__GFP_COMP might be considered to be ugly-looking, though.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  3:15                               ` Christoph Lameter
@ 2007-03-03  4:19                                 ` William Lee Irwin III
  2007-03-03 17:16                                 ` Martin J. Bligh
  1 sibling, 0 replies; 104+ messages in thread
From: William Lee Irwin III @ 2007-03-03  4:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Rik van Riel, Bill Irwin, Mel Gorman, npiggin,
	mingo, jschopp, arjan, torvalds, mbligh, linux-mm, linux-kernel

On Fri, 2 Mar 2007, William Lee Irwin III wrote:
>> AIUI that phenomenon is universal to NUMA. Maybe it's time we
>> reexamined our locking algorithms in the light of fairness
>> considerations.

On Fri, Mar 02, 2007 at 07:15:38PM -0800, Christoph Lameter wrote:
> This is a phenomenon that is usually addressed at the cache logic level. 
> Its a hardware maturation issue. A certain package should not be allowed
> to hold onto a cacheline forever and other packages must have a mininum 
> time when they can operate on that cacheline.

I think when I last asked about that I was told "cache directories are
too expensive" or something on that order, if I'm not botching this,
too. In any event, the above shows a gross inaccuracy in my statement.


-- wli

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 16:32   ` Mel Gorman
  2007-03-02 17:19     ` Christoph Lameter
@ 2007-03-03  4:54     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-03-03  4:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, npiggin, clameter, mingo, jschopp, arjan, torvalds, mbligh,
	linux-kernel

On Fri, 2 Mar 2007 16:32:07 +0000
mel@skynet.ie (Mel Gorman) wrote:

> The zone-based patches for memory partitioning should be providing what is
> required for memory hot-remove of an entire DIMM or bank of memory (PPC64
> also cares about removing smaller blocks of memory but zones are overkill
> there and anti-fragmentation on its own is good enough).  Pages hot-added
> to ZONE_MOVABLE will always be reclaimable or migratable in the case of
> mlock(). Kamezawa Hiroyuki has indicated that his hot-remove patches also
> do something like ZONE_MOVABLE. I would hope that his patches could be
> easily based on top of my memory partitioning set of patches. The markup
> of pages has been tested and the zone definitely works. I've added the
> kamezawa.hiroyu@jp.fujitsu.com to the cc list so he can comment :)

Thanks. As you wrote, I'm planning to write patch based on ZONE_MOVABLE.
I'm now using my own one just because I can't handle too many patches.
My version has a few just-for-cleanup patches. And it may be a bit different
from yours. I'll add cc to you when I post.
It's not ready to send yet...(found some panic at offlining memory ;)

> What I do not do in my patchset is hot-add to ZONE_MOVABLE because I couldn't
> be certain it's what the hotplug people wanted. They will of course need to
> hot-add to that zone if they want to be able to remove it later.
> 
I'm planing to make hot-added memory as ZONE_MOVABLE. I believe we can find and 
move movable pages in ZONE_NOT_MOVABLE to ZONE_MOVABLE.(or kswapd/pageout will
do this in slow way.) The reason is that my first purpose is removing hot-added-memory.

I think I can add knob to choose a zone for hot-added-memory but I won't do it
until someone wants it.

> For node-based memory hot-add and hot-remove, the node would consist of just
> one populated zone - ZONE_MOVABLE.
> 
yes.

Note:
We are considering to allocate well-known-structures like mem_map and pgdat
from hot-added memory. We can remove it when a node is unplugged.
But no plan/patches yet.


> For the removal of DIMMs, anti-fragmentation has something additional
> to offer. The later patches in the anti-fragmentation patchset bias the
> placement of unmovable pages towards the lower PFNs. It's not very strict
> about this because being strict would cost. A mechanism could be put in place
> that enforced the placement of unmovables pages at low PFNS. Due to the cost,
> it would need to be disabled by default and enabled on request. On the plus
> side, the cost would only be incurred when splitting a MAX_ORDER block of
> pages which is a rare event.

And kernel on VM, IBM's uses 16MB sparsemem for memory-hotplug, can get
enough benefit. While Xen's baloon driver can do memory-unplug, I heard
it causes memory fragmentation. anti-frag patch will help this.

-Kame


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03  3:15                               ` Christoph Lameter
  2007-03-03  4:19                                 ` William Lee Irwin III
@ 2007-03-03 17:16                                 ` Martin J. Bligh
  2007-03-03 17:50                                   ` Christoph Lameter
  1 sibling, 1 reply; 104+ messages in thread
From: Martin J. Bligh @ 2007-03-03 17:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, Andrew Morton, Rik van Riel, Bill Irwin,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, linux-mm,
	linux-kernel

Christoph Lameter wrote:
> On Fri, 2 Mar 2007, William Lee Irwin III wrote:
> 
>> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
>>> Opterons seem to be particularly prone to lock starvation where a cacheline
>>> gets captured in a single package for ever.
>> AIUI that phenomenon is universal to NUMA. Maybe it's time we
>> reexamined our locking algorithms in the light of fairness
>> considerations.
> 
> This is a phenomenon that is usually addressed at the cache logic level. 
> Its a hardware maturation issue. A certain package should not be allowed
> to hold onto a cacheline forever and other packages must have a mininum 
> time when they can operate on that cacheline.

That'd be nice. Unfortunately we're stuck in the real world with
real hardware, and the situation is likely to remain thus for
quite some time ...

M.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-03 17:16                                 ` Martin J. Bligh
@ 2007-03-03 17:50                                   ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2007-03-03 17:50 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Andrew Morton, Rik van Riel, Bill Irwin,
	Mel Gorman, npiggin, mingo, jschopp, arjan, torvalds, linux-mm,
	linux-kernel

On Sat, 3 Mar 2007, Martin J. Bligh wrote:

> That'd be nice. Unfortunately we're stuck in the real world with
> real hardware, and the situation is likely to remain thus for
> quite some time ...

Our real hardware does behave as described and therefore does not suffer 
from the problem.

If you want a software solution then you may want to look at Zoran 
Radovic's work on Hierachical Backoff locks. I had a draft of a patch a 
couple of years back that showed some promise to reduce lock contention. 
HBO locks can solve starvation issues by stopping local lock takers.

See Zoran Radovic "Software Techniques for Distributed Shared Memory", 
Uppsala Universitet, 2005 ISBN 91-554-6385-1.

http://www.gelato.org/pdf/may2005/gelato_may2005_numa_lameter_sgi.pdf

http://www.gelato.unsw.edu.au/archives/linux-ia64/0506/14368.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  8:12                                 ` Nick Piggin
  2007-03-02  8:21                                   ` Christoph Lameter
@ 2007-03-04  1:26                                   ` Rik van Riel
  2007-03-04  1:51                                     ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: Rik van Riel @ 2007-03-04  1:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

Nick Piggin wrote:

> Different issue, isn't it? Rik wants to be smarter in figuring out which
> pages to throw away. More work per page == worse for you.

Being smarter about figuring out which pages to evict does
not equate to spending more work.  One big component is
sorting the pages beforehand, so we do not end up scanning
through (and randomizing the LRU order of) anonymous pages
when we do not want to, or cannot, evict them anyway.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-04  1:26                                   ` Rik van Riel
@ 2007-03-04  1:51                                     ` Andrew Morton
  2007-03-04  1:58                                       ` Rik van Riel
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2007-03-04  1:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Christoph Lameter, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

On Sat, 03 Mar 2007 20:26:15 -0500 Rik van Riel <riel@redhat.com> wrote:

> Nick Piggin wrote:
> 
> > Different issue, isn't it? Rik wants to be smarter in figuring out which
> > pages to throw away. More work per page == worse for you.
> 
> Being smarter about figuring out which pages to evict does
> not equate to spending more work.  One big component is
> sorting the pages beforehand, so we do not end up scanning
> through (and randomizing the LRU order of) anonymous pages
> when we do not want to, or cannot, evict them anyway.
> 

My gut feel is that we could afford to expend a lot more cycles-per-page
doing stuff to avoid IO than we presently do.

At least, reclaim normally just doesn't figure in system CPU time, except
for when it's gone completely stupid.

It could well be that we sleep too much in there though.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-04  1:51                                     ` Andrew Morton
@ 2007-03-04  1:58                                       ` Rik van Riel
  0 siblings, 0 replies; 104+ messages in thread
From: Rik van Riel @ 2007-03-04  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Christoph Lameter, Mel Gorman, mingo, jschopp,
	arjan, torvalds, mbligh, linux-mm, linux-kernel

Andrew Morton wrote:
> On Sat, 03 Mar 2007 20:26:15 -0500 Rik van Riel <riel@redhat.com> wrote:
>> Nick Piggin wrote:
>>
>>> Different issue, isn't it? Rik wants to be smarter in figuring out which
>>> pages to throw away. More work per page == worse for you.
>> Being smarter about figuring out which pages to evict does
>> not equate to spending more work.  One big component is
>> sorting the pages beforehand, so we do not end up scanning
>> through (and randomizing the LRU order of) anonymous pages
>> when we do not want to, or cannot, evict them anyway.
>>
> 
> My gut feel is that we could afford to expend a lot more cycles-per-page
> doing stuff to avoid IO than we presently do.

In general, yes.

In the specific "128GB RAM, 90GB anon/shm/... and 2GB swap" case, no :)

> At least, reclaim normally just doesn't figure in system CPU time, except
> for when it's gone completely stupid.
> 
> It could well be that we sleep too much in there though.

It's all about minimizing IO, I suspect.

Not just the total amount of IO though, also the amount of
pageout IO that's in flight at once, so we do not introduce
stupidly high latencies.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02 17:05     ` Joel Schopp
@ 2007-03-05  3:21       ` Nick Piggin
  2007-03-05 15:20         ` Joel Schopp
  0 siblings, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-05  3:21 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

On Fri, Mar 02, 2007 at 11:05:15AM -0600, Joel Schopp wrote:
> Linus Torvalds wrote:
> >
> >On Thu, 1 Mar 2007, Andrew Morton wrote:
> >>So some urgent questions are: how are we going to do mem hotunplug and
> >>per-container RSS?
> 
> The people who were trying to do memory hot-unplug basically all stopped 
> waiting for these patches, or something similar, to solve the fragmentation 
> problem.  Our last working set of patches built on top of an earlier 
> version of Mel's list based solution.
> 
> >
> >Also: how are we going to do this in virtualized environments? Usually the 
> >people who care abotu memory hotunplug are exactly the same people who 
> >also care (or claim to care, or _will_ care) about virtualization.
> 
> Yes, we are.  And we are very much in favor of these patches.  At last 
> year's OLS developers from IBM, HP, Xen coauthored a paper titled "Resizing 
> Memory with Balloons and Hotplug".  
> http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf  Our 
> conclusion was that ballooning is simply not good enough and we need memory 
> hot-unplug.  Here is a quote from the article I find relevant to today's 
> discussion:

But if you don't require a lot of higher order allocations anyway, then
guest fragmentation caused by ballooning doesn't seem like much problem.

If you need higher order allocations, then ballooning is bad because of
fragmentation, so you need memory unplug, so you need higher order
allocations. Goto 1.

Balooning probably does skew memory management stats and watermarks, but
that's just because it is implemented as a module. A couple of hooks
should be enough to allow things to be adjusted?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05  3:21       ` Nick Piggin
@ 2007-03-05 15:20         ` Joel Schopp
  2007-03-05 16:01           ` Nick Piggin
  2007-05-03  8:49           ` Andy Whitcroft
  0 siblings, 2 replies; 104+ messages in thread
From: Joel Schopp @ 2007-03-05 15:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

> But if you don't require a lot of higher order allocations anyway, then
> guest fragmentation caused by ballooning doesn't seem like much problem.

If you only need to allocate 1 page size and smaller allocations then no it's not a 
problem.  As soon as you go above that it will be.  You don't need to go all the way 
up to MAX_ORDER size to see an impact, it's just increasingly more severe as you get 
away from 1 page and towards MAX_ORDER.

> 
> If you need higher order allocations, then ballooning is bad because of
> fragmentation, so you need memory unplug, so you need higher order
> allocations. Goto 1.

Yes, it's a closed loop.  But hotplug isn't the only one that needs higher order 
allocations.  In fact it's pretty far down the list.  I look at it like this, a lot 
of users need high order allocations for better performance and things like on-demand 
hugepages.  As a bonus you get memory hot-remove.

> Balooning probably does skew memory management stats and watermarks, but
> that's just because it is implemented as a module. A couple of hooks
> should be enough to allow things to be adjusted?

That is a good idea independent of the current discussion.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 15:20         ` Joel Schopp
@ 2007-03-05 16:01           ` Nick Piggin
  2007-03-05 16:45             ` Joel Schopp
  2007-05-03  8:49           ` Andy Whitcroft
  1 sibling, 1 reply; 104+ messages in thread
From: Nick Piggin @ 2007-03-05 16:01 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

On Mon, Mar 05, 2007 at 09:20:10AM -0600, Joel Schopp wrote:
> >But if you don't require a lot of higher order allocations anyway, then
> >guest fragmentation caused by ballooning doesn't seem like much problem.
> 
> If you only need to allocate 1 page size and smaller allocations then no 
> it's not a problem.  As soon as you go above that it will be.  You don't 
> need to go all the way up to MAX_ORDER size to see an impact, it's just 
> increasingly more severe as you get away from 1 page and towards MAX_ORDER.

We allocate order 1 and 2 pages for stuff without too much problem.

> >If you need higher order allocations, then ballooning is bad because of
> >fragmentation, so you need memory unplug, so you need higher order
> >allocations. Goto 1.
> 
> Yes, it's a closed loop.  But hotplug isn't the only one that needs higher 
> order allocations.  In fact it's pretty far down the list.  I look at it 
> like this, a lot of users need high order allocations for better 
> performance and things like on-demand hugepages.  As a bonus you get memory 
> hot-remove.

on-demand hugepages could be done better anyway by having the hypervisor
defrag physical memory and provide some way for the guest to ask for a
hugepage, no?

> >Balooning probably does skew memory management stats and watermarks, but
> >that's just because it is implemented as a module. A couple of hooks
> >should be enough to allow things to be adjusted?
> 
> That is a good idea independent of the current discussion.

Well it shouldn't be too difficult. If you cc linux-mm and/or me with
any thoughts or requirements then I could try to help with it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 16:01           ` Nick Piggin
@ 2007-03-05 16:45             ` Joel Schopp
  0 siblings, 0 replies; 104+ messages in thread
From: Joel Schopp @ 2007-03-05 16:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, clameter, mingo,
	arjan, mbligh, linux-mm, linux-kernel

>> If you only need to allocate 1 page size and smaller allocations then no 
>> it's not a problem.  As soon as you go above that it will be.  You don't 
>> need to go all the way up to MAX_ORDER size to see an impact, it's just 
>> increasingly more severe as you get away from 1 page and towards MAX_ORDER.
> 
> We allocate order 1 and 2 pages for stuff without too much problem.

The question I want to know is where do you draw the line as to what is acceptable to 
allocate in a single contiguous block?

1 page?  8 pages?  256 pages?  4K pages?  Obviously 1 page works fine. With 4K page 
size and 16MB MAX_ORDER 4K pages is theoretically supported, but doesn't work under 
almost any circumstances (unless you use Mel's patches).

> on-demand hugepages could be done better anyway by having the hypervisor
> defrag physical memory and provide some way for the guest to ask for a
> hugepage, no?

Unless you break the 1:1 virt-phys mapping it doesn't matter if the hypervisor can 
defrag this for you as the kernel will have the physical address cached away 
somewhere and expect the data not to move.

I'm a big fan of making this somebody else's problem and the hypervisor would be a 
good place.  I just can't figure out how to actually do it at that layer without 
changing Linux in a way that is unacceptable to the community at large.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-02  3:44       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2007-03-02  5:13         ` Jeremy Fitzhardinge
@ 2007-03-06  4:16         ` Paul Mackerras
  3 siblings, 0 replies; 104+ messages in thread
From: Paul Mackerras @ 2007-03-06  4:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Balbir Singh, Andrew Morton, Mel Gorman, npiggin, clameter,
	mingo, jschopp, arjan, mbligh, linux-mm, linux-kernel

Linus Torvalds writes:

> The point being that in the guests, hotunplug is almost useless (for 
> bigger ranges), and we're much better off just telling the virtualization 
> hosts on a per-page level whether we care about a page or not, than to 
> worry about fragmentation.

We don't have that luxury on IBM System p machines, where the
hypervisor manages memory in much larger units than a page.  Typically
the size of memory block that the hypervisor uses to manage memory is
16MB or more -- which makes sense from the point of view that if the
hypervisor had to manage individual pages, it would end up adding a
lot more overhead than it does.

Paul.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: The performance and behaviour of the anti-fragmentation related patches
  2007-03-05 15:20         ` Joel Schopp
  2007-03-05 16:01           ` Nick Piggin
@ 2007-05-03  8:49           ` Andy Whitcroft
  1 sibling, 0 replies; 104+ messages in thread
From: Andy Whitcroft @ 2007-05-03  8:49 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Mel Gorman, clameter,
	mingo, arjan, mbligh, linux-mm, linux-kernel

Joel Schopp wrote:
>> But if you don't require a lot of higher order allocations anyway, then
>> guest fragmentation caused by ballooning doesn't seem like much problem.
> 
> If you only need to allocate 1 page size and smaller allocations then no
> it's not a problem.  As soon as you go above that it will be.  You don't
> need to go all the way up to MAX_ORDER size to see an impact, it's just
> increasingly more severe as you get away from 1 page and towards MAX_ORDER.

Yep, the allocator thinks of things less than order-4 as "easy to
obtain" in that it is willing to wait indefinatly for one to to appear,
above that they are not expected to appear.  With random placement the
chances of finding a page tend to 0 pretty quickly as order increases.
That was the motivation for the linear reclaim/lumpy reclaim patch
series which do make it significantly more possible to get higher
orders.  However very high orders such as we see with huge pages are
still almost impossible to obtain without placement controls in place.

-apw

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2007-05-03  8:50 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 10:12 The performance and behaviour of the anti-fragmentation related patches Mel Gorman
2007-03-02  1:52 ` Bill Irwin
2007-03-02 10:38   ` Mel Gorman
2007-03-02 16:31     ` Joel Schopp
2007-03-02 21:37       ` Bill Irwin
     [not found] ` <20070301160915.6da876c5.akpm@linux-foundation.org>
2007-03-02  1:39   ` Balbir Singh
2007-03-02  2:34   ` KAMEZAWA Hiroyuki
2007-03-02  3:05   ` Christoph Lameter
2007-03-02  3:57     ` Nick Piggin
2007-03-02  4:06       ` Christoph Lameter
2007-03-02  4:21         ` Nick Piggin
2007-03-02  4:31           ` Christoph Lameter
2007-03-02  5:06             ` Nick Piggin
2007-03-02  5:40               ` Christoph Lameter
2007-03-02  5:49                 ` Nick Piggin
2007-03-02  5:53                   ` Christoph Lameter
2007-03-02  6:08                     ` Nick Piggin
2007-03-02  6:19                       ` Christoph Lameter
2007-03-02  6:29                         ` Nick Piggin
2007-03-02  6:51                           ` Christoph Lameter
2007-03-02  7:03                             ` Andrew Morton
2007-03-02  7:19                             ` Nick Piggin
2007-03-02  7:44                               ` Christoph Lameter
2007-03-02  8:12                                 ` Nick Piggin
2007-03-02  8:21                                   ` Christoph Lameter
2007-03-02  8:38                                     ` Nick Piggin
2007-03-02 17:09                                       ` Christoph Lameter
2007-03-04  1:26                                   ` Rik van Riel
2007-03-04  1:51                                     ` Andrew Morton
2007-03-04  1:58                                       ` Rik van Riel
2007-03-02  5:50               ` Christoph Lameter
2007-03-02  4:29         ` Andrew Morton
2007-03-02  4:33           ` Christoph Lameter
2007-03-02  4:58             ` Andrew Morton
2007-03-02  4:20       ` Paul Mundt
2007-03-02 13:50   ` Arjan van de Ven
2007-03-02 15:29   ` Rik van Riel
2007-03-02 16:58     ` Andrew Morton
2007-03-02 17:09       ` Mel Gorman
2007-03-02 17:23       ` Christoph Lameter
2007-03-02 17:35         ` Andrew Morton
2007-03-02 17:43           ` Rik van Riel
2007-03-02 18:06             ` Andrew Morton
2007-03-02 18:15               ` Christoph Lameter
2007-03-02 18:23                 ` Andrew Morton
2007-03-02 18:23                 ` Rik van Riel
2007-03-02 19:31                   ` Christoph Lameter
2007-03-02 19:40                     ` Rik van Riel
2007-03-02 21:12                   ` Bill Irwin
2007-03-02 21:19                     ` Rik van Riel
2007-03-02 21:52                       ` Andrew Morton
2007-03-02 22:03                         ` Rik van Riel
2007-03-02 22:22                           ` Andrew Morton
2007-03-02 22:34                             ` Rik van Riel
2007-03-02 22:51                               ` Martin Bligh
2007-03-02 22:54                                 ` Rik van Riel
2007-03-02 23:28                                   ` Martin J. Bligh
2007-03-03  0:24                                     ` Andrew Morton
2007-03-02 22:52                               ` Chuck Ebbert
2007-03-02 22:59                               ` Andrew Morton
2007-03-02 23:20                                 ` Rik van Riel
2007-03-03  1:40                                 ` William Lee Irwin III
2007-03-03  1:58                                   ` Andrew Morton
2007-03-03  3:55                                     ` William Lee Irwin III
2007-03-02 23:16                             ` [PATCH] : Optimizes timespec_trunc() Eric Dumazet
2007-03-03  0:33                             ` The performance and behaviour of the anti-fragmentation related patches William Lee Irwin III
2007-03-03  0:54                               ` Andrew Morton
2007-03-03  3:15                               ` Christoph Lameter
2007-03-03  4:19                                 ` William Lee Irwin III
2007-03-03 17:16                                 ` Martin J. Bligh
2007-03-03 17:50                                   ` Christoph Lameter
2007-03-02 20:59               ` Bill Irwin
2007-03-02 16:32   ` Mel Gorman
2007-03-02 17:19     ` Christoph Lameter
2007-03-02 17:28       ` Mel Gorman
2007-03-02 17:48         ` Christoph Lameter
2007-03-02 17:59           ` Mel Gorman
2007-03-03  4:54     ` KAMEZAWA Hiroyuki
     [not found]   ` <Pine.LNX.4.64.0703011642190.12485@woody.linux-foundation.org>
2007-03-02  1:52     ` Balbir Singh
2007-03-02  3:44       ` Linus Torvalds
2007-03-02  3:59         ` Andrew Morton
2007-03-02  5:11           ` Linus Torvalds
2007-03-02  5:50             ` KAMEZAWA Hiroyuki
2007-03-02  6:15               ` Paul Mundt
2007-03-02 17:01                 ` Mel Gorman
2007-03-02 16:20             ` Mark Gross
2007-03-02 17:07               ` Andrew Morton
2007-03-02 17:35                 ` Mark Gross
2007-03-02 18:02                   ` Andrew Morton
2007-03-02 19:02                     ` Mark Gross
2007-03-02 17:16               ` Linus Torvalds
2007-03-02 18:45                 ` Mark Gross
2007-03-02 19:03                   ` Linus Torvalds
2007-03-02 23:58                 ` Martin J. Bligh
2007-03-02  4:18         ` Balbir Singh
2007-03-02  5:13         ` Jeremy Fitzhardinge
2007-03-06  4:16         ` Paul Mackerras
2007-03-02 16:58     ` Mel Gorman
2007-03-02 17:05     ` Joel Schopp
2007-03-05  3:21       ` Nick Piggin
2007-03-05 15:20         ` Joel Schopp
2007-03-05 16:01           ` Nick Piggin
2007-03-05 16:45             ` Joel Schopp
2007-05-03  8:49           ` Andy Whitcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).