LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 00/19] VM pageout scalability improvements
@ 2008-01-08 20:59 Rik van Riel
  2008-01-08 20:59 ` [patch 01/19] move isolate_lru_page() to vmscan.c Rik van Riel
                   ` (22 more replies)
  0 siblings, 23 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

More info on the overall design can be found at:

	http://linux-mm.org/PageReplacementDesign


Changelog:
- merge memcontroller split LRU code into the main split LRU patch,
  since it is not functionally different (it was split up only to help
  people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 01/19] move isolate_lru_page() to vmscan.c
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:03   ` Christoph Lameter
  2008-01-08 20:59 ` [patch 02/19] free swap space on swap-in/activation Rik van Riel
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Nick Piggin, Lee Schermerhorn

[-- Attachment #1: np-01-move-and-rework-isolate_lru_page-v2.patch --]
[-- Type: text/plain, Size: 7134 bytes --]

V1 -> V2 [lts]:
+  fix botched merge -- add back "get_page_unless_zero()"

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Subject: [patch 1/4] mm: move and rework isolate_lru_page
  Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc6-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/migrate.h	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/migrate.h	2008-01-02 12:37:14.000000000 -0500
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ extern int migrate_vmas(struct mm_struct
 static inline int vma_migratable(struct vm_area_struct *vma)
 					{ return 0; }
 
-static inline int isolate_lru_page(struct page *p, struct list_head *list)
-					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private) { return -ENOSYS; }
Index: linux-2.6.24-rc6-mm1/mm/internal.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/internal.h	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/internal.h	2008-01-02 12:37:14.000000000 -0500
@@ -34,6 +34,8 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+extern int isolate_lru_page(struct page *page);
+
 extern void __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-02 12:37:14.000000000 -0500
@@ -36,36 +36,6 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
- * Isolate one page from the LRU lists. If successful put it onto
- * the indicated list with elevated page count.
- *
- * Result:
- *  -EBUSY: page not on LRU list
- *  0: page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page, struct list_head *pagelist)
-{
-	int ret = -EBUSY;
-
-	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
-
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page) && get_page_unless_zero(page)) {
-			ret = 0;
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-	return ret;
-}
-
-/*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page().
  */
@@ -853,14 +823,17 @@ static int do_move_pages(struct mm_struc
 				!migrate_all)
 			goto put_and_set;
 
-		err = isolate_lru_page(page, &pagelist);
+		err = isolate_lru_page(page);
+		if (err) {
 put_and_set:
-		/*
-		 * Either remove the duplicate refcount from
-		 * isolate_lru_page() or drop the page ref if it was
-		 * not isolated.
-		 */
-		put_page(page);
+			/*
+			 * Either remove the duplicate refcount from
+			 * isolate_lru_page() or drop the page ref if it was
+			 * not isolated.
+			 */
+			put_page(page);
+		} else
+			list_add_tail(&page->lru, &pagelist);
 set_status:
 		pp->status = err;
 	}
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-02 12:37:14.000000000 -0500
@@ -829,6 +829,47 @@ static unsigned long clear_active_flags(
 	return nr_active;
 }
 
+/**
+ * isolate_lru_page(@page)
+ *
+ * Isolate one @page from the LRU lists. Must be called with an elevated
+ * refcount on the page, which is a fundamentnal difference from
+ * isolate_lru_pages (which is called without a stable reference).
+ *
+ * The returned page will have PageLru() cleared, and PageActive set,
+ * if it was found on the active list. This flag generally will need to be
+ * cleared by the caller before letting the page go.
+ *
+ * The vmstat page counts corresponding to the list on which the page was
+ * found will be decremented.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ *
+ * Returns:
+ *  -EBUSY: page not on LRU list
+ *  0: page removed from LRU list.
+ */
+int isolate_lru_page(struct page *page)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		struct zone *zone = page_zone(page);
+
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page) && get_page_unless_zero(page)) {
+			ret = 0;
+			ClearPageLRU(page);
+			if (PageActive(page))
+				del_page_from_active_list(zone, page);
+			else
+				del_page_from_inactive_list(zone, page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+	return ret;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
Index: linux-2.6.24-rc6-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mempolicy.c	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mempolicy.c	2008-01-02 12:37:14.000000000 -0500
@@ -93,6 +93,8 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 
+#include "internal.h"
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -603,8 +605,12 @@ static void migrate_page_add(struct page
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
-		isolate_lru_page(page, pagelist);
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+		if (!isolate_lru_page(page)) {
+			get_page(page);
+			list_add_tail(&page->lru, pagelist);
+		}
+	}
 }
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 02/19] free swap space on swap-in/activation
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
  2008-01-08 20:59 ` [patch 01/19] move isolate_lru_page() to vmscan.c Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:10   ` Christoph Lameter
  2008-01-08 20:59 ` [patch 03/19] define page_file_cache() function Rik van Riel
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: rvr-00-linux-2.6-swapfree.patch --]
[-- Type: text/plain, Size: 2997 bytes --]

+ lts' convert anon_vma list lock to reader/write lock patch
+ Nick Piggin's move and rework isolate_lru_page() patch

Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used?].  Uses new pagevec to reduce pressure
on locks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-02 12:37:14.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-02 12:37:18.000000000 -0500
@@ -632,6 +632,9 @@ free_it:
 		continue;
 
 activate_locked:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && vm_swap_full())
+			remove_exclusive_swap_page(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1214,6 +1217,8 @@ static void shrink_active_list(unsigned 
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
+			if (vm_swap_full())
+				pagevec_swap_free(&pvec);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
@@ -1223,6 +1228,8 @@ static void shrink_active_list(unsigned 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
+	if (vm_swap_full())
+		pagevec_swap_free(&pvec);
 
 	pagevec_release(&pvec);
 }
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-02 12:37:18.000000000 -0500
@@ -465,6 +465,24 @@ void pagevec_strip(struct pagevec *pvec)
 	}
 }
 
+/*
+ * Try to free swap space from the pages in a pagevec
+ */
+void pagevec_swap_free(struct pagevec *pvec)
+{
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
+			if (PageSwapCache(page))
+				remove_exclusive_swap_page(page);
+			unlock_page(page);
+		}
+	}
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
Index: linux-2.6.24-rc6-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/pagevec.h	2008-01-02 12:37:12.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/pagevec.h	2008-01-02 12:37:18.000000000 -0500
@@ -26,6 +26,7 @@ void __pagevec_free(struct pagevec *pvec
 void __pagevec_lru_add(struct pagevec *pvec);
 void __pagevec_lru_add_active(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
+void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 03/19] define page_file_cache() function
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
  2008-01-08 20:59 ` [patch 01/19] move isolate_lru_page() to vmscan.c Rik van Riel
  2008-01-08 20:59 ` [patch 02/19] free swap space on swap-in/activation Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:18   ` Christoph Lameter
  2008-01-08 20:59 ` [patch 04/19] Use an indexed array for LRU variables Rik van Riel
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: rvr-01-linux-2.6-page_file_cache.patch --]
[-- Type: text/plain, Size: 6943 bytes --]

Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU.  Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.


Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>


Index: linux-2.6.24-rc6-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mm_inline.h	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mm_inline.h	2008-01-02 12:37:22.000000000 -0500
@@ -1,3 +1,24 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache(@page)
+ * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to survive until the page is last deleted from the LRU, which
+ * could be as far down as __page_cache_release.
+ */
+static inline int page_file_cache(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return 0;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 2;
+}
+
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
@@ -38,3 +59,4 @@ del_page_from_lru(struct zone *zone, str
 	}
 }
 
+#endif
Index: linux-2.6.24-rc6-mm1/mm/shmem.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/shmem.c	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/shmem.c	2008-01-02 12:37:22.000000000 -0500
@@ -1377,6 +1377,7 @@ repeat:
 				goto failed;
 			}
 
+			SetPageSwapBacked(filepage);
 			spin_lock(&info->lock);
 			entry = shmem_swp_alloc(info, idx, sgp);
 			if (IS_ERR(entry))
Index: linux-2.6.24-rc6-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/page-flags.h	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/page-flags.h	2008-01-02 12:37:22.000000000 -0500
@@ -89,6 +89,7 @@
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_swapbacked		20	/* Page is backed by RAM/swap */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
@@ -216,6 +217,10 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageSwapBacked(page)	test_bit(PG_swapbacked, &(page)->flags)
+#define SetPageSwapBacked(page)	set_bit(PG_swapbacked, &(page)->flags)
+#define __ClearPageSwapBacked(page)	__clear_bit(PG_swapbacked, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
Index: linux-2.6.24-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/memory.c	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/memory.c	2008-01-02 12:37:22.000000000 -0500
@@ -1664,6 +1664,7 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
+		SetPageSwapBacked(new_page);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2131,6 +2132,7 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
+	SetPageSwapBacked(page);
 	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2284,6 +2286,7 @@ static int __do_fault(struct mm_struct *
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
+			SetPageSwapBacked(page);
                         lru_cache_add_active(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
Index: linux-2.6.24-rc6-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap_state.c	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap_state.c	2008-01-02 12:37:22.000000000 -0500
@@ -82,6 +82,7 @@ int add_to_swap_cache(struct page *page,
 		if (!error) {
 			page_cache_get(page);
 			SetPageSwapCache(page);
+			SetPageSwapBacked(page);
 			set_page_private(page, entry.val);
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-02 12:37:14.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-02 12:37:22.000000000 -0500
@@ -546,6 +546,8 @@ static int move_to_new_page(struct page 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))
+		SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-02 12:37:22.000000000 -0500
@@ -253,6 +253,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_swapbacked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -485,6 +486,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageSwapBacked(page))
+		__ClearPageSwapBacked(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -631,6 +634,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_swapbacked |
 			1 << PG_buddy ))))
 		bad_page(page);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 04/19] Use an indexed array for LRU variables
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (2 preceding siblings ...)
  2008-01-08 20:59 ` [patch 03/19] define page_file_cache() function Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn, Christoph Lameter

[-- Attachment #1: cl-use-indexed-array-of-lru-lists.patch --]
[-- Type: text/plain, Size: 14661 bytes --]

V1 -> V2 [lts]:
+ Remove extraneous  __dec_zone_state(zone, NR_ACTIVE) pointed
  out by Mel G.

>From clameter@sgi.com Wed Aug 29 11:39:51 2007

Currently we are defining explicit variables for the inactive
and active list. An indexed array can be more generic and avoid
repeating similar code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on
the reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

 include/linux/mm_inline.h |   34 ++++++++---
 include/linux/mmzone.h    |   17 +++--
 mm/page_alloc.c           |    9 +--
 mm/swap.c                 |    2 
 mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
 mm/vmstat.c               |    3 -
 6 files changed, 107 insertions(+), 90 deletions(-)

Index: linux-2.6.24-rc6-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mmzone.h	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mmzone.h	2008-01-02 12:37:32.000000000 -0500
@@ -80,8 +80,8 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,
-	NR_ACTIVE,
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE,	/*  "     "     "   "       "         */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -105,6 +105,13 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum lru_list {
+	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	NR_LRU_LISTS };
+
+#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -258,10 +265,8 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	struct list_head	list[NR_LRU_LISTS];
+	unsigned long		nr_scan[NR_LRU_LISTS];
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.24-rc6-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mm_inline.h	2008-01-02 12:37:27.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mm_inline.h	2008-01-02 12:37:32.000000000 -0500
@@ -30,43 +30,55 @@ static inline int page_file_cache(struct
 }
 
 static inline void
+add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_add(&page->lru, &zone->list[l]);
+	__inc_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_del(&page->lru);
+	__dec_zone_state(zone, NR_INACTIVE + l);
+}
+
+
+static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum lru_list l = LRU_INACTIVE;
+
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		l = LRU_ACTIVE;
 	}
+	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
 #endif
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-02 12:37:22.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-02 12:37:32.000000000 -0500
@@ -3413,6 +3413,7 @@ static void __meminit free_area_init_cor
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
+		enum lru_list l;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
@@ -3462,10 +3463,10 @@ static void __meminit free_area_init_cor
 		zone->prev_priority = DEF_PRIORITY;
 
 		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
+		for_each_lru(l) {
+			INIT_LIST_HEAD(&zone->list[l]);
+			zone->nr_scan[l] = 0;
+		}
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-02 12:37:18.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-02 12:37:32.000000000 -0500
@@ -118,7 +118,7 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->inactive_list);
+			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 			pgmoved++;
 		}
 	}
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-02 12:37:18.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-02 12:37:32.000000000 -0500
@@ -807,10 +807,10 @@ static unsigned long isolate_pages_globa
 					int active)
 {
 	if (active)
-		return isolate_lru_pages(nr, &z->active_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
 						scanned, order, mode);
 	else
-		return isolate_lru_pages(nr, &z->inactive_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
 						scanned, order, mode);
 }
 
@@ -957,10 +957,7 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			add_page_to_lru_list(zone, page, PageActive(page));
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1128,11 +1125,14 @@ static void shrink_active_list(unsigned 
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
-	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
-	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+	struct list_head list[NR_LRU_LISTS];
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
+	enum lru_list l;
+
+	for_each_lru(l)
+		INIT_LIST_HEAD(&list[l]);
 
 	if (sc->may_swap)
 		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
@@ -1160,28 +1160,28 @@ static void shrink_active_list(unsigned 
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
 			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &l_active);
+				list_add(&page->lru, &list[LRU_ACTIVE]);
 				continue;
 			}
 		} else if (TestClearPageReferenced(page)) {
-			list_add(&page->lru, &l_active);
+			list_add(&page->lru, &list[LRU_ACTIVE]);
 			continue;
 		}
-		list_add(&page->lru, &l_inactive);
+		list_add(&page->lru, &list[LRU_INACTIVE]);
 	}
 
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
 	spin_lock_irq(&zone->lru_lock);
-	while (!list_empty(&l_inactive)) {
-		page = lru_to_page(&l_inactive);
-		prefetchw_prev_lru_page(page, &l_inactive, flags);
+	while (!list_empty(&list[LRU_INACTIVE])) {
+		page = lru_to_page(&list[LRU_INACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->inactive_list);
+		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1204,13 +1204,13 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
-	while (!list_empty(&l_active)) {
-		page = lru_to_page(&l_active);
-		prefetchw_prev_lru_page(page, &l_active, flags);
+	while (!list_empty(&list[LRU_ACTIVE])) {
+		page = lru_to_page(&list[LRU_ACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->active_list);
+		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1234,65 +1234,64 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
+static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+	struct zone *zone, struct scan_control *sc, int priority)
+{
+	if (l == LRU_ACTIVE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority);
+		return 0;
+	}
+	return shrink_inactive_list(nr_to_scan, zone, sc);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
 				struct scan_control *sc)
 {
-	unsigned long nr_active;
-	unsigned long nr_inactive;
+	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	enum lru_list l;
 
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
-		zone->nr_scan_active +=
-			(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-		nr_active = zone->nr_scan_active;
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-		nr_inactive = zone->nr_scan_inactive;
-		if (nr_inactive >= sc->swap_cluster_max)
-			zone->nr_scan_inactive = 0;
-		else
-			nr_inactive = 0;
-
-		if (nr_active >= sc->swap_cluster_max)
-			zone->nr_scan_active = 0;
-		else
-			nr_active = 0;
+		for_each_lru(l) {
+			zone->nr_scan[l] += (zone_page_state(zone,
+					NR_INACTIVE + l)  >> priority) + 1;
+			nr[l] = zone->nr_scan[l];
+			if (nr[l] >= sc->swap_cluster_max)
+				zone->nr_scan[l] = 0;
+			else
+				nr[l] = 0;
+		}
 	} else {
 		/*
 		 * This reclaim occurs not because zone memory shortage but
 		 * because memory controller hits its limit.
 		 * Then, don't modify zone reclaim related data.
 		 */
-		nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
+		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
 					zone, priority);
 
-		nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
+		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
 					zone, priority);
 	}
 
-
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
+	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+		for_each_lru(l) {
+			if (nr[l]) {
+				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
-			shrink_active_list(nr_to_scan, zone, sc, priority);
-		}
+				nr[l] -= nr_to_scan;
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
-			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
-								sc);
+				nr_reclaimed += shrink_list(l, nr_to_scan,
+							zone, sc, priority);
+			}
 		}
 	}
 
@@ -1809,6 +1808,7 @@ static unsigned long shrink_all_zones(un
 {
 	struct zone *zone;
 	unsigned long nr_to_scan, ret = 0;
+	enum lru_list l;
 
 	for_each_zone(zone) {
 
@@ -1818,28 +1818,25 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		/* For pass = 0 we don't shrink the active list */
-		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
+		for_each_lru(l) {
+			/* For pass = 0 we don't shrink the active list */
+			if (pass == 0 && l == LRU_ACTIVE)
+				continue;
+
+			zone->nr_scan[l] +=
+				(zone_page_state(zone, NR_INACTIVE + l)
+								>> prio) + 1;
+			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
+				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
-				shrink_active_list(nr_to_scan, zone, sc, prio);
+					zone_page_state(zone,
+							NR_INACTIVE + l));
+				ret += shrink_list(l, nr_to_scan, zone,
+								sc, prio);
+				if (ret >= nr_pages)
+					return ret;
 			}
 		}
-
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
-			ret += shrink_inactive_list(nr_to_scan, zone, sc);
-			if (ret >= nr_pages)
-				return ret;
-		}
 	}
 
 	return ret;
Index: linux-2.6.24-rc6-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmstat.c	2008-01-02 12:37:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmstat.c	2008-01-02 12:37:32.000000000 -0500
@@ -758,7 +758,8 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan_active, zone->nr_scan_inactive,
+		   zone->nr_scan[LRU_ACTIVE],
+		   zone->nr_scan[LRU_INACTIVE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (3 preceding siblings ...)
  2008-01-08 20:59 ` [patch 04/19] Use an indexed array for LRU variables Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:22   ` Christoph Lameter
                     ` (6 more replies)
  2008-01-08 20:59 ` [patch 06/19] SEQ replacement for anonymous pages Rik van Riel
                   ` (17 subsequent siblings)
  22 siblings, 7 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: rvr-02-linux-2.6-vm-split-lrus.patch --]
[-- Type: text/plain, Size: 70335 bytes --]

Split the LRU lists in two, one set for pages that are backed by
real file systems ("file") and one for pages that are backed by
memory and swap ("anon").  The latter includes tmpfs.

Eventually mlocked pages will be taken off the LRUs alltogether.
A patch for that already exists and just needs to be integrated
into this series.

This patch mostly has the infrastructure and a basic policy to
balance how much we scan the anon lists and how much we scan
the file lists. The big policy changes are in separate patches.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/proc/proc_misc.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c	2008-01-07 17:31:18.000000000 -0500
@@ -153,43 +153,47 @@ static int meminfo_read_proc(char *page,
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal:     %8lu kB\n"
-		"MemFree:      %8lu kB\n"
-		"Buffers:      %8lu kB\n"
-		"Cached:       %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:       %8lu kB\n"
-		"Inactive:     %8lu kB\n"
+		"MemTotal:       %8lu kB\n"
+		"MemFree:        %8lu kB\n"
+		"Buffers:        %8lu kB\n"
+		"Cached:         %8lu kB\n"
+		"SwapCached:     %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:    %8lu kB\n"
-		"HighFree:     %8lu kB\n"
-		"LowTotal:     %8lu kB\n"
-		"LowFree:      %8lu kB\n"
-#endif
-		"SwapTotal:    %8lu kB\n"
-		"SwapFree:     %8lu kB\n"
-		"Dirty:        %8lu kB\n"
-		"Writeback:    %8lu kB\n"
-		"AnonPages:    %8lu kB\n"
-		"Mapped:       %8lu kB\n"
-		"Slab:         %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:       %8lu kB\n"
-		"CommitLimit:  %8lu kB\n"
-		"Committed_AS: %8lu kB\n"
-		"VmallocTotal: %8lu kB\n"
-		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"HighTotal:      %8lu kB\n"
+		"HighFree:       %8lu kB\n"
+		"LowTotal:       %8lu kB\n"
+		"LowFree:        %8lu kB\n"
+#endif
+		"SwapTotal:      %8lu kB\n"
+		"SwapFree:       %8lu kB\n"
+		"Dirty:          %8lu kB\n"
+		"Writeback:      %8lu kB\n"
+		"AnonPages:      %8lu kB\n"
+		"Mapped:         %8lu kB\n"
+		"Slab:           %8lu kB\n"
+		"SReclaimable:   %8lu kB\n"
+		"SUnreclaim:     %8lu kB\n"
+		"PageTables:     %8lu kB\n"
+		"NFS_Unstable:   %8lu kB\n"
+		"Bounce:         %8lu kB\n"
+		"CommitLimit:    %8lu kB\n"
+		"Committed_AS:   %8lu kB\n"
+		"VmallocTotal:   %8lu kB\n"
+		"VmallocUsed:    %8lu kB\n"
+		"VmallocChunk:   %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages),
-		K(global_page_state(NR_ACTIVE)),
-		K(global_page_state(NR_INACTIVE)),
+		K(global_page_state(NR_ACTIVE_ANON)),
+		K(global_page_state(NR_INACTIVE_ANON)),
+		K(global_page_state(NR_ACTIVE_FILE)),
+		K(global_page_state(NR_INACTIVE_FILE)),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
Index: linux-2.6.24-rc6-mm1/fs/cifs/file.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/cifs/file.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/cifs/file.c	2008-01-07 17:31:18.000000000 -0500
@@ -1783,7 +1783,7 @@ static void cifs_copy_cache_pages(struct
 		SetPageUptodate(page);
 		unlock_page(page);
 		if (!pagevec_add(plru_pvec, page))
-			__pagevec_lru_add(plru_pvec);
+			__pagevec_lru_add_file(plru_pvec);
 		data += PAGE_CACHE_SIZE;
 	}
 	return;
@@ -1921,7 +1921,7 @@ static int cifs_readpages(struct file *f
 		bytes_read = 0;
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 
 /* need to free smb_read_data buf before exit */
 	if (smb_read_data) {
Index: linux-2.6.24-rc6-mm1/fs/ntfs/file.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/ntfs/file.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/ntfs/file.c	2008-01-07 17:31:18.000000000 -0500
@@ -439,7 +439,7 @@ static inline int __ntfs_grab_cache_page
 			pages[nr] = *cached_page;
 			page_cache_get(*cached_page);
 			if (unlikely(!pagevec_add(lru_pvec, *cached_page)))
-				__pagevec_lru_add(lru_pvec);
+				__pagevec_lru_add_file(lru_pvec);
 			*cached_page = NULL;
 		}
 		index++;
@@ -2084,7 +2084,7 @@ err_out:
 						OSYNC_METADATA|OSYNC_DATA);
 		}
   	}
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
 			(long)status);
Index: linux-2.6.24-rc6-mm1/fs/nfs/dir.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/nfs/dir.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/nfs/dir.c	2008-01-07 17:31:18.000000000 -0500
@@ -1497,7 +1497,7 @@ static int nfs_symlink(struct inode *dir
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
-		pagevec_lru_add(&lru_pvec);
+		pagevec_lru_add_file(&lru_pvec);
 		SetPageUptodate(page);
 		unlock_page(page);
 	} else
Index: linux-2.6.24-rc6-mm1/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/ramfs/file-nommu.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/ramfs/file-nommu.c	2008-01-07 17:31:18.000000000 -0500
@@ -111,12 +111,12 @@ static int ramfs_nommu_expand_for_mappin
 			goto add_error;
 
 		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
+			__pagevec_lru_add_file(&lru_pvec);
 
 		unlock_page(page);
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	return 0;
 
  fsize_exceeded:
Index: linux-2.6.24-rc6-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/drivers/base/node.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/drivers/base/node.c	2008-01-07 17:31:18.000000000 -0500
@@ -45,33 +45,37 @@ static ssize_t node_read_meminfo(struct 
 	si_meminfo_node(&i, nid);
 
 	n = sprintf(buf, "\n"
-		       "Node %d MemTotal:     %8lu kB\n"
-		       "Node %d MemFree:      %8lu kB\n"
-		       "Node %d MemUsed:      %8lu kB\n"
-		       "Node %d Active:       %8lu kB\n"
-		       "Node %d Inactive:     %8lu kB\n"
+		       "Node %d MemTotal:       %8lu kB\n"
+		       "Node %d MemFree:        %8lu kB\n"
+		       "Node %d MemUsed:        %8lu kB\n"
+		       "Node %d Active(anon):   %8lu kB\n"
+		       "Node %d Inactive(anon): %8lu kB\n"
+		       "Node %d Active(file):   %8lu kB\n"
+		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		       "Node %d HighTotal:    %8lu kB\n"
-		       "Node %d HighFree:     %8lu kB\n"
-		       "Node %d LowTotal:     %8lu kB\n"
-		       "Node %d LowFree:      %8lu kB\n"
+		       "Node %d HighTotal:      %8lu kB\n"
+		       "Node %d HighFree:       %8lu kB\n"
+		       "Node %d LowTotal:       %8lu kB\n"
+		       "Node %d LowFree:        %8lu kB\n"
 #endif
-		       "Node %d Dirty:        %8lu kB\n"
-		       "Node %d Writeback:    %8lu kB\n"
-		       "Node %d FilePages:    %8lu kB\n"
-		       "Node %d Mapped:       %8lu kB\n"
-		       "Node %d AnonPages:    %8lu kB\n"
-		       "Node %d PageTables:   %8lu kB\n"
-		       "Node %d NFS_Unstable: %8lu kB\n"
-		       "Node %d Bounce:       %8lu kB\n"
-		       "Node %d Slab:         %8lu kB\n"
-		       "Node %d SReclaimable: %8lu kB\n"
-		       "Node %d SUnreclaim:   %8lu kB\n",
+		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d FilePages:      %8lu kB\n"
+		       "Node %d Mapped:         %8lu kB\n"
+		       "Node %d AnonPages:      %8lu kB\n"
+		       "Node %d PageTables:     %8lu kB\n"
+		       "Node %d NFS_Unstable:   %8lu kB\n"
+		       "Node %d Bounce:         %8lu kB\n"
+		       "Node %d Slab:           %8lu kB\n"
+		       "Node %d SReclaimable:   %8lu kB\n"
+		       "Node %d SUnreclaim:     %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, node_page_state(nid, NR_ACTIVE),
-		       nid, node_page_state(nid, NR_INACTIVE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON),
+		       nid, node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.24-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/memory.c	2008-01-07 17:30:25.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/memory.c	2008-01-07 17:31:18.000000000 -0500
@@ -1665,7 +1665,7 @@ gotten:
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active(new_page);
+		lru_cache_add_active_anon(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
@@ -2133,7 +2133,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active(page);
+	lru_cache_add_active_anon(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2287,7 +2287,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active(page);
+                        lru_cache_add_active_anon(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-07 17:31:18.000000000 -0500
@@ -1889,10 +1889,13 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
+		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
-		global_page_state(NR_ACTIVE),
-		global_page_state(NR_INACTIVE),
+		global_page_state(NR_ACTIVE_ANON),
+		global_page_state(NR_ACTIVE_FILE),
+		global_page_state(NR_INACTIVE_ANON),
+		global_page_state(NR_INACTIVE_FILE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1915,8 +1918,10 @@ void show_free_areas(void)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active:%lukB"
-			" inactive:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1926,8 +1931,10 @@ void show_free_areas(void)
 			K(zone->pages_min),
 			K(zone->pages_low),
 			K(zone->pages_high),
-			K(zone_page_state(zone, NR_ACTIVE)),
-			K(zone_page_state(zone, NR_INACTIVE)),
+			K(zone_page_state(zone, NR_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_INACTIVE_FILE)),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
@@ -3467,6 +3474,9 @@ static void __meminit free_area_init_cor
 			INIT_LIST_HEAD(&zone->list[l]);
 			zone->nr_scan[l] = 0;
 		}
+		zone->recent_rotated_anon = 0;
+		zone->recent_rotated_file = 0;
+//TODO recent_scanned_* ???
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-07 17:31:18.000000000 -0500
@@ -34,8 +34,10 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
-static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_file_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_active_file_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_anon_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_active_anon_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs) = { 0, };
 
 /*
@@ -118,7 +120,13 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
+			if (page_file_cache(page)) {
+				list_move_tail(&page->lru,
+						&zone->list[LRU_INACTIVE_FILE]);
+			} else {
+				list_move_tail(&page->lru,
+						&zone->list[LRU_INACTIVE_ANON]);
+			}
 			pgmoved++;
 		}
 	}
@@ -172,9 +180,13 @@ void activate_page(struct page *page)
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(zone, page);
+		int lru = LRU_BASE;
+		lru += page_file_cache(page);
+		del_page_from_lru_list(zone, page, lru);
+
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		lru += LRU_ACTIVE;
+		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 	}
@@ -204,24 +216,44 @@ EXPORT_SYMBOL(mark_page_accessed);
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
-void lru_cache_add(struct page *page)
+void lru_cache_add_anon(struct page *page)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_anon_pvecs);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add(pvec);
-	put_cpu_var(lru_add_pvecs);
+		__pagevec_lru_add_anon(pvec);
+	put_cpu_var(lru_add_anon_pvecs);
 }
 
-void lru_cache_add_active(struct page *page)
+void lru_cache_add_file(struct page *page)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_file_pvecs);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add_active(pvec);
-	put_cpu_var(lru_add_active_pvecs);
+		__pagevec_lru_add_file(pvec);
+	put_cpu_var(lru_add_file_pvecs);
+}
+
+void lru_cache_add_active_anon(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_active_anon_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_active_anon(pvec);
+	put_cpu_var(lru_add_active_anon_pvecs);
+}
+
+void lru_cache_add_active_file(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_active_file_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_active_file(pvec);
+	put_cpu_var(lru_add_active_file_pvecs);
 }
 
 /*
@@ -233,13 +265,21 @@ static void drain_cpu_pagevecs(int cpu)
 {
 	struct pagevec *pvec;
 
-	pvec = &per_cpu(lru_add_pvecs, cpu);
+	pvec = &per_cpu(lru_add_file_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_file(pvec);
+
+	pvec = &per_cpu(lru_add_anon_pvecs, cpu);
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_anon(pvec);
 
-	pvec = &per_cpu(lru_add_active_pvecs, cpu);
+	pvec = &per_cpu(lru_add_active_file_pvecs, cpu);
 	if (pagevec_count(pvec))
-		__pagevec_lru_add_active(pvec);
+		__pagevec_lru_add_active_file(pvec);
+
+	pvec = &per_cpu(lru_add_active_anon_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_active_anon(pvec);
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -393,7 +433,7 @@ void __pagevec_release_nonlru(struct pag
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void __pagevec_lru_add_file(struct pagevec *pvec)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -410,7 +450,7 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		add_page_to_inactive_list(zone, page);
+		add_page_to_inactive_file_list(zone, page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -418,9 +458,60 @@ void __pagevec_lru_add(struct pagevec *p
 	pagevec_reinit(pvec);
 }
 
-EXPORT_SYMBOL(__pagevec_lru_add);
+EXPORT_SYMBOL(__pagevec_lru_add_file);
+void __pagevec_lru_add_active_file(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page));
+		SetPageActive(page);
+		add_page_to_active_file_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+void __pagevec_lru_add_anon(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		add_page_to_inactive_anon_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
 
-void __pagevec_lru_add_active(struct pagevec *pvec)
+void __pagevec_lru_add_active_anon(struct pagevec *pvec)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -439,7 +530,7 @@ void __pagevec_lru_add_active(struct pag
 		SetPageLRU(page);
 		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		add_page_to_active_anon_list(zone, page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-07 17:30:25.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-07 17:31:18.000000000 -0500
@@ -60,9 +60,15 @@ static inline void move_to_lru(struct pa
 		 * the PG_active bit is off.
 		 */
 		ClearPageActive(page);
-		lru_cache_add_active(page);
+		if (page_file_cache(page))
+			lru_cache_add_active_file(page);
+		else
+			lru_cache_add_active_anon(page);
 	} else {
-		lru_cache_add(page);
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_anon(page);
 	}
 	put_page(page);
 }
Index: linux-2.6.24-rc6-mm1/mm/readahead.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/readahead.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/readahead.c	2008-01-07 17:31:18.000000000 -0500
@@ -229,7 +229,7 @@ int do_page_cache_readahead(struct addre
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
+	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
Index: linux-2.6.24-rc6-mm1/mm/filemap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/filemap.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/filemap.c	2008-01-07 17:31:18.000000000 -0500
@@ -34,6 +34,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h> /* for page_file_cache() */
 #include "internal.h"
 
 /*
@@ -493,8 +494,12 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, gfp_t gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add(page);
+	if (ret == 0) {
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	}
 	return ret;
 }
 
Index: linux-2.6.24-rc6-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmstat.c	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmstat.c	2008-01-07 17:31:18.000000000 -0500
@@ -686,8 +686,10 @@ const struct seq_operations pagetypeinfo
 static const char * const vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_free_pages",
-	"nr_inactive",
-	"nr_active",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
@@ -750,7 +752,7 @@ static void zoneinfo_show_print(struct s
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu (a: %lu i: %lu)"
+		   "\n        scanned  %lu (aa: %lu ia: %lu af: %lu if: %lu)"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
@@ -758,8 +760,10 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan[LRU_ACTIVE],
-		   zone->nr_scan[LRU_INACTIVE],
+		   zone->nr_scan[LRU_ACTIVE_ANON],
+		   zone->nr_scan[LRU_INACTIVE_ANON],
+		   zone->nr_scan[LRU_ACTIVE_FILE],
+		   zone->nr_scan[LRU_INACTIVE_FILE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-07 17:32:53.000000000 -0500
@@ -71,6 +71,9 @@ struct scan_control {
 
 	int order;
 
+	/* The number of pages moved to the active list this pass. */
+	int activated;
+
 	/*
 	 * Pages that have (or should have) IO pending.  If we run into
 	 * a lot of these, we're better off waiting a little for IO to
@@ -85,7 +88,7 @@ struct scan_control {
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
-			int active);
+			int active, int file);
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -243,27 +246,6 @@ unsigned long shrink_slab(unsigned long 
 	return ret;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	return page_count(page) - !!PagePrivate(page) == 2;
@@ -527,8 +509,7 @@ static unsigned long shrink_page_list(st
 
 		referenced = page_referenced(page, 1, sc->mem_cgroup);
 		/* In active use or really unfreeable?  Activate it. */
-		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page))
+		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
 			goto activate_locked;
 
 #ifdef CONFIG_SWAP
@@ -559,8 +540,6 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
-			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
-				goto keep_locked;
 			if (!may_enter_fs) {
 				sc->nr_io_pages++;
 				goto keep_locked;
@@ -647,6 +626,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
+	sc->activated = pgactivate;
 	return nr_reclaimed;
 }
 
@@ -665,7 +645,7 @@ keep:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode)
+int __isolate_lru_page(struct page *page, int mode, int file)
 {
 	int ret = -EINVAL;
 
@@ -681,6 +661,9 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
 		return ret;
 
+	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -711,12 +694,13 @@ int __isolate_lru_page(struct page *page
  * @scanned:	The number of pages that were scanned.
  * @order:	The caller's attempted allocation order
  * @mode:	One of the LRU isolation modes
+ * @file:	True [1] if isolating file [!anon] pages
  *
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode)
+		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long scan;
@@ -733,7 +717,7 @@ static unsigned long isolate_lru_pages(u
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
 			list_move(&page->lru, dst);
 			nr_taken++;
@@ -776,10 +760,11 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			cursor_page = pfn_to_page(pfn);
+
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
-			switch (__isolate_lru_page(cursor_page, mode)) {
+			switch (__isolate_lru_page(cursor_page, mode, file)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
 				nr_taken++;
@@ -804,30 +789,37 @@ static unsigned long isolate_pages_globa
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
+	int lru = LRU_BASE;
 	if (active)
-		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
-						scanned, order, mode);
-	else
-		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
-						scanned, order, mode);
+		lru += LRU_ACTIVE;
+	if (file)
+		lru += LRU_FILE;
+	return isolate_lru_pages(nr, &z->list[lru], dst, scanned, order,
+								mode, !!file);
 }
 
 /*
  * clear_active_flags() is a helper for shrink_active_list(), clearing
  * any active bits from the pages in the list.
  */
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+					unsigned int *count)
 {
 	int nr_active = 0;
+	int lru;
 	struct page *page;
 
-	list_for_each_entry(page, page_list, lru)
+	list_for_each_entry(page, page_list, lru) {
+		lru = page_file_cache(page);
 		if (PageActive(page)) {
+			lru += LRU_ACTIVE;
 			ClearPageActive(page);
 			nr_active++;
 		}
+		count[lru]++;
+	}
 
 	return nr_active;
 }
@@ -861,12 +853,12 @@ int isolate_lru_page(struct page *page)
 
 		spin_lock_irq(&zone->lru_lock);
 		if (PageLRU(page) && get_page_unless_zero(page)) {
+			int lru = LRU_BASE;
 			ret = 0;
 			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
+
+			lru += page_file_cache(page) + !!PageActive(page);
+			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -878,7 +870,7 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-				struct zone *zone, struct scan_control *sc)
+			struct zone *zone, struct scan_control *sc, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -895,18 +887,25 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		unsigned int count[NR_LRU_LISTS] = { 0, };
+		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
-			     &page_list, &nr_scan, sc->order,
-			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
-					     ISOLATE_BOTH : ISOLATE_INACTIVE,
-				zone, sc->mem_cgroup, 0);
-		nr_active = clear_active_flags(&page_list);
+			     &page_list, &nr_scan, sc->order, mode,
+				zone, sc->mem_cgroup, 0, file);
+		nr_active = clear_active_flags(&page_list, count);
 		__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
-		__mod_zone_page_state(zone, NR_INACTIVE,
-						-(nr_taken - nr_active));
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+						-count[LRU_ACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+						-count[LRU_INACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+						-count[LRU_ACTIVE_ANON]);
+		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+						-count[LRU_INACTIVE_ANON]);
+
 		if (scan_global_lru(sc))
 			zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
@@ -928,7 +927,7 @@ static unsigned long shrink_inactive_lis
 			 * The attempt at page out may have made some
 			 * of the pages active, mark them inactive again.
 			 */
-			nr_active = clear_active_flags(&page_list);
+			nr_active = clear_active_flags(&page_list, count);
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
@@ -953,11 +952,20 @@ static unsigned long shrink_inactive_lis
 		 * Put back any unfreeable pages.
 		 */
 		while (!list_empty(&page_list)) {
+			int lru = LRU_BASE;
 			page = lru_to_page(&page_list);
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			add_page_to_lru_list(zone, page, PageActive(page));
+			if (page_file_cache(page)) {
+				lru += LRU_FILE;
+				zone->recent_rotated_file++;
+			} else {
+				zone->recent_rotated_anon++;
+			}
+			if (PageActive(page))
+				lru += LRU_ACTIVE;
+			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -988,115 +996,7 @@ static inline void note_zone_scanning_pr
 
 static inline int zone_is_near_oom(struct zone *zone)
 {
-	return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE))*3;
-}
-
-/*
- * Determine we should try to reclaim mapped pages.
- * This is called only when sc->mem_cgroup is NULL.
- */
-static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
-				int priority)
-{
-	long mapped_ratio;
-	long distress;
-	long swap_tendency;
-	long imbalance;
-	int reclaim_mapped = 0;
-	int prev_priority;
-
-	if (scan_global_lru(sc) && zone_is_near_oom(zone))
-		return 1;
-	/*
-	 * `distress' is a measure of how much trouble we're having
-	 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
-	 */
-	if (scan_global_lru(sc))
-		prev_priority = zone->prev_priority;
-	else
-		prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
-
-	distress = 100 >> min(prev_priority, priority);
-
-	/*
-	 * The point of this algorithm is to decide when to start
-	 * reclaiming mapped memory instead of just pagecache.  Work out
-	 * how much memory
-	 * is mapped.
-	 */
-	if (scan_global_lru(sc))
-		mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
-				global_page_state(NR_ANON_PAGES)) * 100) /
-					vm_total_pages;
-	else
-		mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
-
-	/*
-	 * Now decide how much we really want to unmap some pages.  The
-	 * mapped ratio is downgraded - just because there's a lot of
-	 * mapped memory doesn't necessarily mean that page reclaim
-	 * isn't succeeding.
-	 *
-	 * The distress ratio is important - we don't want to start
-	 * going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm
-	 * altogether.
-	 */
-	swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
-
-	/*
-	 * If there's huge imbalance between active and inactive
-	 * (think active 100 times larger than inactive) we should
-	 * become more permissive, or the system will take too much
-	 * cpu before it start swapping during memory pressure.
-	 * Distress is about avoiding early-oom, this is about
-	 * making swappiness graceful despite setting it to low
-	 * values.
-	 *
-	 * Avoid div by zero with nr_inactive+1, and max resulting
-	 * value is vm_total_pages.
-	 */
-	if (scan_global_lru(sc)) {
-		imbalance  = zone_page_state(zone, NR_ACTIVE);
-		imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
-	} else
-		imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup);
-
-	/*
-	 * Reduce the effect of imbalance if swappiness is low,
-	 * this means for a swappiness very low, the imbalance
-	 * must be much higher than 100 for this logic to make
-	 * the difference.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= (vm_swappiness + 1);
-	imbalance /= 100;
-
-	/*
-	 * If not much of the ram is mapped, makes the imbalance
-	 * less relevant, it's high priority we refill the inactive
-	 * list with mapped pages only in presence of high ratio of
-	 * mapped pages.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= mapped_ratio;
-	imbalance /= 100;
-
-	/* apply imbalance feedback to swap_tendency */
-	swap_tendency += imbalance;
-
-	/*
-	 * Now use this metric to decide whether to start moving mapped
-	 * memory onto the inactive list.
-	 */
-	if (swap_tendency >= 100)
-		reclaim_mapped = 1;
-
-	return reclaim_mapped;
+	return zone->pages_scanned >= (zone_lru_pages(zone) * 3);
 }
 
 /*
@@ -1116,10 +1016,8 @@ static int calc_reclaim_mapped(struct sc
  * The downside is that we have to touch page->_count against each page.
  * But we had to alter page->flags anyway.
  */
-
-
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-				struct scan_control *sc, int priority)
+				struct scan_control *sc, int priority, int file)
 {
 	unsigned long pgmoved;
 	int pgdeactivate = 0;
@@ -1128,64 +1026,65 @@ static void shrink_active_list(unsigned 
 	struct list_head list[NR_LRU_LISTS];
 	struct page *page;
 	struct pagevec pvec;
-	int reclaim_mapped = 0;
-	enum lru_list l;
+	enum lru_list lru;
 
-	for_each_lru(l)
-		INIT_LIST_HEAD(&list[l]);
-
-	if (sc->may_swap)
-		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
+	for_each_lru(lru)
+		INIT_LIST_HEAD(&list[lru]);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
 					ISOLATE_ACTIVE, zone,
-					sc->mem_cgroup, 1);
+					sc->mem_cgroup, 1, file);
 	/*
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
 	 */
 	if (scan_global_lru(sc))
 		zone->pages_scanned += pgscanned;
-
-	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
+	if (file)
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
+	else
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);
 
+	/*
+	 * For sorting active vs inactive pages, we'll use the 'anon'
+	 * elements of the local list[] array and sort out the file vs
+	 * anon pages below.
+	 */
 	while (!list_empty(&l_hold)) {
+		lru = LRU_INACTIVE_ANON;
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_mapped(page)) {
-			if (!reclaim_mapped ||
-			    (total_swap_pages == 0 && PageAnon(page)) ||
-			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &list[LRU_ACTIVE]);
-				continue;
-			}
-		} else if (TestClearPageReferenced(page)) {
-			list_add(&page->lru, &list[LRU_ACTIVE]);
-			continue;
-		}
-		list_add(&page->lru, &list[LRU_INACTIVE]);
+		if (page_referenced(page, 0, sc->mem_cgroup))
+			lru = LRU_ACTIVE_ANON;
+		list_add(&page->lru, &list[lru]);
 	}
 
+	/*
+	 * Now put the pages back to the appropriate [file or anon] inactive
+	 * and active lists.
+	 */
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
+	lru = LRU_BASE + file * LRU_FILE;
 	spin_lock_irq(&zone->lru_lock);
-	while (!list_empty(&list[LRU_INACTIVE])) {
-		page = lru_to_page(&list[LRU_INACTIVE]);
-		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
+	while (!list_empty(&list[LRU_INACTIVE_ANON])) {
+		page = lru_to_page(&list[LRU_INACTIVE_ANON]);
+		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE_ANON], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
 			pgdeactivate += pgmoved;
 			pgmoved = 0;
@@ -1195,7 +1094,7 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
 	pgdeactivate += pgmoved;
 	if (buffer_heads_over_limit) {
 		spin_unlock_irq(&zone->lru_lock);
@@ -1204,17 +1103,19 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
-	while (!list_empty(&list[LRU_ACTIVE])) {
-		page = lru_to_page(&list[LRU_ACTIVE]);
-		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
+	lru = LRU_ACTIVE + file * LRU_FILE;
+	while (!list_empty(&list[LRU_ACTIVE_ANON])) {
+		page = lru_to_page(&list[LRU_ACTIVE_ANON]);
+		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE_ANON], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
 			if (vm_swap_full())
@@ -1223,7 +1124,12 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
+	if (file) {
+		zone->recent_rotated_file += pgmoved;
+	} else {
+		zone->recent_rotated_anon += pgmoved;
+	}
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
@@ -1234,16 +1140,82 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
-static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
-	if (l == LRU_ACTIVE) {
-		shrink_active_list(nr_to_scan, zone, sc, priority);
+	int file = is_file_lru(lru);
+
+	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc);
+	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+}
+
+/*
+ * The utility of the anon and file memory corresponds to the fraction
+ * of pages that were recently referenced in each category.  Pageout
+ * pressure is distributed according to the size of each set, the fraction
+ * of recently referenced pages (except used-once file pages) and the
+ * swappiness parameter.
+ *
+ * We return the relative pressures as percentages so shrink_zone can
+ * easily use them.
+ */
+static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
+					unsigned long *percent)
+{
+	unsigned long anon, file;
+	unsigned long anon_prio, file_prio;
+	unsigned long rotate_sum;
+	unsigned long ap, fp;
+
+	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+		zone_page_state(zone, NR_INACTIVE_ANON);
+	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
+
+	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
+
+	/* Keep a floating average of RECENT references. */
+	if (unlikely(rotate_sum > min(anon, file))) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_rotated_file /= 2;
+		zone->recent_rotated_anon /= 2;
+		spin_unlock_irq(&zone->lru_lock);
+		rotate_sum /= 2;
+	}
+
+	/*
+	 * With swappiness at 100, anonymous and file have the same priority.
+	 * This scanning priority is essentially the inverse of IO cost.
+	 */
+	anon_prio = sc->swappiness;
+	file_prio = 200 - sc->swappiness;
+
+	/*
+	 *                  anon       recent_rotated_anon
+	 * %anon = 100 * ----------- / ------------------- * IO cost
+	 *               anon + file       rotate_sum
+	 */
+	ap = (anon_prio * anon) / (anon + file + 1);
+	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
+	if (ap == 0)
+		ap = 1;
+	else if (ap > 100)
+		ap = 100;
+	percent[0] = ap;
+
+	fp = (file_prio * file) / (anon + file + 1);
+	fp *= rotate_sum / (zone->recent_rotated_file + 1);
+	if (fp == 0)
+		fp = 1;
+	else if (fp > 100)
+		fp = 100;
+	percent[1] = fp;
 }
 
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1253,36 +1225,38 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long percent[2];       /* anon @ 0; file @ 1 */
 	enum lru_list l;
 
-	if (scan_global_lru(sc)) {
-		/*
-		 * Add one to nr_to_scan just to make sure that the kernel
-		 * will slowly sift through the active list.
-		 */
-		for_each_lru(l) {
+	get_scan_ratio(zone, sc, percent);
+
+	for_each_lru(l) {
+		if (scan_global_lru(sc)) {
+			int file = is_file_lru(l);
+			/*
+			 * Add one to nr_to_scan just to make sure that the
+			 * kernel will slowly sift through the active list.
+			 */
 			zone->nr_scan[l] += (zone_page_state(zone,
-					NR_INACTIVE + l)  >> priority) + 1;
-			nr[l] = zone->nr_scan[l];
+				NR_INACTIVE_ANON + l) >> priority) + 1;
+			nr[l] = zone->nr_scan[l] * percent[file] / 100;
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
 				nr[l] = 0;
+		} else {
+			/*
+			 * This reclaim occurs not because zone memory shortage
+			 * but because memory controller hits its limit.
+			 * Then, don't modify zone reclaim related data.
+			 */
+		nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
+							priority, l);
 		}
-	} else {
-		/*
-		 * This reclaim occurs not because zone memory shortage but
-		 * because memory controller hits its limit.
-		 * Then, don't modify zone reclaim related data.
-		 */
-		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
-					zone, priority);
-
-		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
-					zone, priority);
 	}
 
-	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
+				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1356,7 +1330,7 @@ static unsigned long shrink_zones(int pr
 
 	return nr_reclaimed;
 }
- 
+
 /*
  * This is the main entry point to direct page reclaim.
  *
@@ -1393,8 +1367,7 @@ static unsigned long do_try_to_free_page
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 	}
 
@@ -1599,8 +1572,7 @@ loop_again:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 
 		/*
@@ -1644,8 +1616,7 @@ loop_again:
 			if (zone_is_all_unreclaimable(zone))
 				continue;
 			if (nr_slab == 0 && zone->pages_scanned >=
-				(zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE)) * 6)
+						(zone_lru_pages(zone) * 6))
 					zone_set_flag(zone,
 						      ZONE_ALL_UNRECLAIMABLE);
 			/*
@@ -1700,7 +1671,7 @@ out:
 
 /*
  * The background pageout daemon, started as a kernel thread
- * from the init process. 
+ * from the init process.
  *
  * This basically trickles out pages so that we have _some_
  * free memory available even if there is no other activity
@@ -1820,17 +1791,18 @@ static unsigned long shrink_all_zones(un
 
 		for_each_lru(l) {
 			/* For pass = 0 we don't shrink the active list */
-			if (pass == 0 && l == LRU_ACTIVE)
+			if (pass == 0 &&
+				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
 
 			zone->nr_scan[l] +=
-				(zone_page_state(zone, NR_INACTIVE + l)
+				(zone_page_state(zone, NR_INACTIVE_ANON + l)
 								>> prio) + 1;
 			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
 				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
 					zone_page_state(zone,
-							NR_INACTIVE + l));
+							NR_INACTIVE_ANON + l));
 				ret += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
 				if (ret >= nr_pages)
@@ -1842,9 +1814,12 @@ static unsigned long shrink_all_zones(un
 	return ret;
 }
 
-static unsigned long count_lru_pages(void)
+unsigned long global_lru_pages(void)
 {
-	return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+	return global_page_state(NR_ACTIVE_ANON)
+		+ global_page_state(NR_ACTIVE_FILE)
+		+ global_page_state(NR_INACTIVE_ANON)
+		+ global_page_state(NR_INACTIVE_FILE);
 }
 
 /*
@@ -1872,7 +1847,7 @@ unsigned long shrink_all_memory(unsigned
 
 	current->reclaim_state = &reclaim_state;
 
-	lru_pages = count_lru_pages();
+	lru_pages = global_lru_pages();
 	nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
@@ -1915,7 +1890,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1932,7 +1907,7 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
Index: linux-2.6.24-rc6-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap_state.c	2008-01-07 17:30:25.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap_state.c	2008-01-07 17:31:18.000000000 -0500
@@ -300,7 +300,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			lru_cache_add_active_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.24-rc6-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mmzone.h	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mmzone.h	2008-01-07 17:31:18.000000000 -0500
@@ -80,21 +80,23 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE,	/*  "     "     "   "       "         */
+	NR_INACTIVE_ANON,	/* must match order of LRU_[IN]ACTIVE_* */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	/* Second 128 byte cacheline */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	/* Second 128 byte cacheline */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
@@ -105,13 +107,32 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+/*
+ * We do arithmetic on the LRU lists in various places in the code,
+ * so it is important to keep the active lists LRU_ACTIVE higher in
+ * the array than the corresponding inactive lists, and to keep
+ * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
+ */
+#define LRU_BASE 0
+#define LRU_ANON LRU_BASE
+#define LRU_ACTIVE 1
+#define LRU_FILE 2
+
 enum lru_list {
-	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	LRU_INACTIVE_ANON = LRU_BASE,
+	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
+	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
+	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_file_lru(enum lru_list l)
+{
+	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
+	return (l/2 == 1);
+}
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -267,6 +288,10 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	list[NR_LRU_LISTS];
 	unsigned long		nr_scan[NR_LRU_LISTS];
+
+	unsigned long		recent_rotated_anon;
+	unsigned long		recent_rotated_file;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.24-rc6-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mm_inline.h	2008-01-07 17:30:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mm_inline.h	2008-01-07 17:31:18.000000000 -0500
@@ -16,59 +16,84 @@ static inline int page_file_cache(struct
 		return 0;
 
 	/* The page is page cache backed by a normal filesystem. */
-	return 2;
+	return LRU_FILE;
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_add(&page->lru, &zone->list[l]);
-	__inc_zone_state(zone, NR_INACTIVE + l);
+	__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+//TODO:  eventually these can all go away?  just use above 2 fcns?
+static inline void
+add_page_to_active_anon_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+add_page_to_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
+
+static inline void
+del_page_from_active_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+del_page_from_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
 
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_active_file_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_FILE);
 }
 
 static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
+add_page_to_inactive_file_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+del_page_from_active_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_FILE);
 }
 
 static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
+del_page_from_inactive_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
-	enum lru_list l = LRU_INACTIVE;
+	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		l = LRU_ACTIVE;
+		l = LRU_ACTIVE_ANON;
 	}
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	l += page_file_cache(page);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 #endif
Index: linux-2.6.24-rc6-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/pagevec.h	2008-01-07 17:30:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/pagevec.h	2008-01-07 17:31:18.000000000 -0500
@@ -23,8 +23,10 @@ struct pagevec {
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_add_active(struct pagevec *pvec);
+void __pagevec_lru_add_file(struct pagevec *pvec);
+void __pagevec_lru_add_active_file(struct pagevec *pvec);
+void __pagevec_lru_add_anon(struct pagevec *pvec);
+void __pagevec_lru_add_active_anon(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
 void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
@@ -82,10 +84,16 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
-static inline void pagevec_lru_add(struct pagevec *pvec)
+static inline void pagevec_lru_add_file(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_file(pvec);
+}
+
+static inline void pagevec_lru_add_anon(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_anon(pvec);
 }
 
 #endif /* _LINUX_PAGEVEC_H */
Index: linux-2.6.24-rc6-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/vmstat.h	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/vmstat.h	2008-01-07 17:31:18.000000000 -0500
@@ -149,6 +149,16 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
+extern unsigned long global_lru_pages(void);
+
+static inline unsigned long zone_lru_pages(struct zone *zone)
+{
+	return (zone_page_state(zone, NR_ACTIVE_ANON)
+		+ zone_page_state(zone, NR_ACTIVE_FILE)
+		+ zone_page_state(zone, NR_INACTIVE_ANON)
+		+ zone_page_state(zone, NR_INACTIVE_FILE));
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Determine the per node value of a stat item. This function
Index: linux-2.6.24-rc6-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page-writeback.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page-writeback.c	2008-01-07 17:31:18.000000000 -0500
@@ -270,9 +270,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES)
-			+ zone_page_state(z, NR_INACTIVE)
-			+ zone_page_state(z, NR_ACTIVE);
+		x += zone_page_state(z, NR_FREE_PAGES) + zone_lru_pages(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -290,9 +288,7 @@ static unsigned long determine_dirtyable
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES)
-		+ global_page_state(NR_INACTIVE)
-		+ global_page_state(NR_ACTIVE);
+	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
Index: linux-2.6.24-rc6-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/swap.h	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/swap.h	2008-01-07 17:31:18.000000000 -0500
@@ -171,8 +171,10 @@ extern unsigned int nr_free_pagecache_pa
 
 
 /* linux/mm/swap.c */
-extern void FASTCALL(lru_cache_add(struct page *));
-extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_file(struct page *));
+extern void FASTCALL(lru_cache_add_anon(struct page *));
+extern void FASTCALL(lru_cache_add_active_file(struct page *));
+extern void FASTCALL(lru_cache_add_active_anon(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -185,7 +187,7 @@ extern unsigned long try_to_free_pages(s
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
-extern int __isolate_lru_page(struct page *page, int mode);
+extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6.24-rc6-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/memcontrol.h	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/memcontrol.h	2008-01-07 17:32:53.000000000 -0500
@@ -42,7 +42,7 @@ extern unsigned long mem_cgroup_isolate_
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active);
+					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
@@ -69,10 +69,8 @@ extern void mem_cgroup_note_reclaim_prio
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
 
-extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
-extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru);
 
 #else /* CONFIG_CGROUP_MEM_CONT */
 static inline void mm_init_cgroup(struct mm_struct *mm,
@@ -170,14 +168,9 @@ static inline void mem_cgroup_record_rec
 {
 }
 
-static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	return 0;
-}
-
-static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
+					struct zone *zone, int priority,
+					int active, int file)
 {
 	return 0;
 }
Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c	2008-01-07 11:55:09.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/memcontrol.c	2008-01-07 17:32:53.000000000 -0500
@@ -30,6 +30,7 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/mm_inline.h>
 
 #include <asm/uaccess.h>
 
@@ -80,22 +81,13 @@ static s64 mem_cgroup_read_stat(struct m
 /*
  * per-zone information in memory controller.
  */
-
-enum mem_cgroup_zstat_index {
-	MEM_CGROUP_ZSTAT_ACTIVE,
-	MEM_CGROUP_ZSTAT_INACTIVE,
-
-	NR_MEM_CGROUP_ZSTAT,
-};
-
 struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
 	spinlock_t		lru_lock;
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long count[NR_MEM_CGROUP_ZSTAT];
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -160,6 +152,7 @@ struct page_cgroup {
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
 
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -220,7 +213,7 @@ page_cgroup_zoneinfo(struct page_cgroup 
 }
 
 static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
-					enum mem_cgroup_zstat_index idx)
+					enum lru_list idx)
 {
 	int nid, zid;
 	struct mem_cgroup_per_zone *mz;
@@ -346,13 +339,15 @@ static struct page_cgroup *clear_page_cg
 
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = LRU_BASE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
 	list_del_init(&pc->lru);
@@ -360,38 +355,37 @@ static void __mem_cgroup_remove_list(str
 
 static void __mem_cgroup_add_list(struct page_cgroup *pc)
 {
-	int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int lru = LRU_BASE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_add(&pc->lru, &mz->lists[lru]);
 
-	if (!to) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-		list_add(&pc->lru, &mz->inactive_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-		list_add(&pc->lru, &mz->active_list);
-	}
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+	int lru = LRU_FILE * !!file + !!from;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	if (active) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
+	if (active)
 		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->active_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
+	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->inactive_list);
-	}
+
+	lru = LRU_FILE * !!file + !!active;
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -437,20 +431,6 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
-/*
- * This function is called from vmscan.c. In page reclaiming loop. balance
- * between active and inactive list is calculated. For memory controller
- * page reclaiming, we should use using mem_cgroup's imbalance rather than
- * zone's global lru imbalance.
- */
-long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
-{
-	unsigned long active, inactive;
-	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE);
-	return (long) (active / (inactive + 1));
-}
 
 /*
  * prev_priority control...this will be used in memory reclaim path.
@@ -479,29 +459,16 @@ void mem_cgroup_record_reclaim_priority(
  * (see include/linux/mmzone.h)
  */
 
-long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				   struct zone *zone, int priority)
-{
-	long nr_active;
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
-
-	nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE);
-	return (nr_active >> priority);
-}
-
-long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+				int priority, enum lru_list lru)
 {
-	long nr_inactive;
+	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-
-	return (nr_inactive >> priority);
+	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
+	return (nr_pages >> priority);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -509,7 +476,7 @@ unsigned long mem_cgroup_isolate_pages(u
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
 	unsigned long nr_taken = 0;
 	struct page *page;
@@ -519,13 +486,12 @@ unsigned long mem_cgroup_isolate_pages(u
 	struct page_cgroup *pc, *tmp;
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
+	int lru = LRU_FILE * !!file + !!active;
 	struct mem_cgroup_per_zone *mz;
 
+	/* TODO: split file and anon LRUs - Rik */
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	if (active)
-		src = &mz->active_list;
-	else
-		src = &mz->inactive_list;
+	src = &mz->lists[lru];
 
 
 	spin_lock(&mz->lru_lock);
@@ -539,6 +505,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (unlikely(!PageLRU(page)))
 			continue;
 
+		/*
+		 * TODO: play better with lumpy reclaim, grabbing anything.
+		 */
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
 			continue;
@@ -551,7 +520,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
-		if (__isolate_lru_page(page, mode) == 0) {
+		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}
@@ -664,6 +633,8 @@ retry:
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+	if (page_file_cache(page))
+		pc->flags |= PAGE_CGROUP_FLAG_FILE;
 
 	if (!page || page_cgroup_assign_new_page_cgroup(page, pc)) {
 		/*
@@ -833,18 +804,17 @@ retry:
 static void
 mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
-			    int active)
+			    int active, int file)
 {
 	struct page_cgroup *pc;
 	struct page *page;
 	int count;
 	unsigned long flags;
 	struct list_head *list;
+	int lru;
 
-	if (active)
-		list = &mz->active_list;
-	else
-		list = &mz->inactive_list;
+	lru = LRU_FILE * !!file + !!active;
+	list = &mz->lists[lru];
 
 	if (list_empty(list))
 		return;
@@ -895,10 +865,14 @@ int mem_cgroup_force_empty(struct mem_cg
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				/* drop all page_cgroup in active_list */
-				mem_cgroup_force_empty_list(mem, mz, 1);
-				/* drop all page_cgroup in inactive_list */
-				mem_cgroup_force_empty_list(mem, mz, 0);
+				/* drop all page_cgroup in ACTIVE_ANON */
+				mem_cgroup_force_empty_list(mem, mz, 1, 0);
+				/* drop all page_cgroup in INACTIVE_ANON */
+				mem_cgroup_force_empty_list(mem, mz, 0, 0);
+				/* drop all page_cgroup in ACTIVE_FILE */
+				mem_cgroup_force_empty_list(mem, mz, 1, 1);
+				/* drop all page_cgroup in INACTIVE_FILE */
+				mem_cgroup_force_empty_list(mem, mz, 0, 1);
 			}
 	}
 	ret = 0;
@@ -991,14 +965,21 @@ static int mem_control_stat_show(struct 
 	}
 	/* showing # of active pages */
 	{
-		unsigned long active, inactive;
+		unsigned long active_anon, inactive_anon;
+		unsigned long active_file, inactive_file;
 
-		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_INACTIVE);
-		active = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_ACTIVE);
-		seq_printf(m, "active %ld\n", (active) * PAGE_SIZE);
-		seq_printf(m, "inactive %ld\n", (inactive) * PAGE_SIZE);
+		inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_ANON);
+		active_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_ANON);
+		inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_FILE);
+		active_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_FILE);
+		seq_printf(m, "active_anon %ld\n", (active_anon) * PAGE_SIZE);
+		seq_printf(m, "inactive_anon %ld\n", (inactive_anon) * PAGE_SIZE);
+		seq_printf(m, "active_file %ld\n", (active_file) * PAGE_SIZE);
+		seq_printf(m, "inactive_file %ld\n", (inactive_file) * PAGE_SIZE);
 	}
 	return 0;
 }
@@ -1052,6 +1033,7 @@ static int alloc_mem_cgroup_per_zone_inf
 {
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup_per_zone *mz;
+	int i;
 	int zone;
 	/*
 	 * This routine is called against possible nodes.
@@ -1073,8 +1055,8 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		INIT_LIST_HEAD(&mz->active_list);
-		INIT_LIST_HEAD(&mz->inactive_list);
+		for (i = 0; i < NR_LRU_LISTS ; i++)
+			INIT_LIST_HEAD(&mz->lists[i]);
 		spin_lock_init(&mz->lru_lock);
 	}
 	return 0;

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 06/19] SEQ replacement for anonymous pages
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (4 preceding siblings ...)
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 07/19] (NEW) add some sanity checks to get_scan_ratio Rik van Riel
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

[-- Attachment #1: rvr-03-linux-2.6-vm-anon-seq.patch --]
[-- Type: text/plain, Size: 7330 bytes --]

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by
not taking the referenced state into account when deactivating an
anonymous page.  After all, every anonymous page starts out referenced,
so why check?

If an anonymous page gets referenced again before it reaches the end
of the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Index: linux-2.6.24-rc6-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mm_inline.h	2008-01-02 15:55:33.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mm_inline.h	2008-01-02 16:00:39.000000000 -0500
@@ -106,4 +106,16 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+static inline int inactive_anon_low(struct zone *zone)
+{
+	unsigned long active, inactive;
+
+	active = zone_page_state(zone, NR_ACTIVE_ANON);
+	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+
+	if (inactive * zone->inactive_ratio < active)
+		return 1;
+
+	return 0;
+}
 #endif
Index: linux-2.6.24-rc6-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mmzone.h	2008-01-02 15:55:33.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mmzone.h	2008-01-02 16:00:39.000000000 -0500
@@ -313,6 +313,11 @@ struct zone {
 	 */
 	int prev_priority;
 
+	/*
+	 * The ratio of active to inactive pages.
+	 */
+	unsigned int inactive_ratio;
+
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-02 15:55:33.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-02 16:00:39.000000000 -0500
@@ -4230,6 +4230,45 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
+ *
+ * The inactive anon list should be small enough that the VM never has to
+ * do too much work, but large enough that each inactive page has a chance
+ * to be referenced again before it is swapped out.
+ *
+ * The inactive_anon ratio is the ratio of active to inactive anonymous
+ * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
+ * on the inactive list.
+ *
+ * total     return    max
+ * memory    value     inactive anon
+ * -------------------------------------
+ *   10MB       1         5MB
+ *  100MB       1        50MB
+ *    1GB       3       250MB
+ *   10GB      10       0.9GB
+ *  100GB      31         3GB
+ *    1TB     101        10GB
+ *   10TB     320        32GB
+ */
+void setup_per_zone_inactive_ratio(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		unsigned int gb, ratio;
+
+		/* Zone size in gigabytes */
+		gb = zone->present_pages >> (30 - PAGE_SHIFT);
+		ratio = int_sqrt(10 * gb);
+		if (!ratio)
+			ratio = 1;
+
+		zone->inactive_ratio = ratio;
+	}
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4267,6 +4306,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 65536;
 	setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
+	setup_per_zone_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_pages_min)
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-02 15:56:00.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-02 16:00:39.000000000 -0500
@@ -1019,7 +1019,7 @@ static inline int zone_is_near_oom(struc
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 				struct scan_control *sc, int priority, int file)
 {
-	unsigned long pgmoved;
+	unsigned long pgmoved = 0;
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
@@ -1058,12 +1058,25 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_referenced(page, 0, sc->mem_cgroup))
-			lru = LRU_ACTIVE_ANON;
+		if (page_referenced(page, 0, sc->mem_cgroup)) {
+			if (file)
+				/* Referenced file pages stay active. */
+				lru = LRU_ACTIVE_ANON;
+			else
+				/* Anonymous pages always get deactivated. */
+				pgmoved++;
+		}
 		list_add(&page->lru, &list[lru]);
 	}
 
 	/*
+	 * Count the referenced anon pages as rotated, to balance pageout
+	 * scan pressure between file and anonymous pages in get_scan_ratio.
+	 */
+	if (!file)
+		zone->recent_rotated_anon += pgmoved;
+
+	/*
 	 * Now put the pages back to the appropriate [file or anon] inactive
 	 * and active lists.
 	 */
@@ -1145,7 +1158,11 @@ static unsigned long shrink_list(enum lr
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+	if (lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		return 0;
+	}
+	if (lru == LRU_ACTIVE_ANON && inactive_anon_low(zone)) {
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
@@ -1255,8 +1272,8 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
-				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+						 nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1560,6 +1577,14 @@ loop_again:
 			    priority != DEF_PRIORITY)
 				continue;
 
+			/*
+			 * Do some background aging of the anon list, to give
+			 * pages a chance to be referenced before reclaiming.
+			 */
+			if (inactive_anon_low(zone))
+				shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							&sc, priority, 0);
+
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
Index: linux-2.6.24-rc6-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmstat.c	2008-01-02 15:55:33.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmstat.c	2008-01-02 15:56:07.000000000 -0500
@@ -800,10 +800,12 @@ static void zoneinfo_show_print(struct s
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
-		   "\n  start_pfn:         %lu",
+		   "\n  start_pfn:         %lu"
+		   "\n  inactive_ratio:    %u",
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
-		   zone->zone_start_pfn);
+		   zone->zone_start_pfn,
+		   zone->inactive_ratio);
 	seq_putc(m, '\n');
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 07/19] (NEW) add some sanity checks to get_scan_ratio
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (5 preceding siblings ...)
  2008-01-08 20:59 ` [patch 06/19] SEQ replacement for anonymous pages Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-09  4:16   ` KAMEZAWA Hiroyuki
  2008-01-08 20:59 ` [patch 08/19] add newly swapped in pages to the inactive list Rik van Riel
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

[-- Attachment #1: rvr-04-linux-2.6-scan-ratio-fixes.patch --]
[-- Type: text/plain, Size: 1445 bytes --]

The access ratio based scan rate determination in get_scan_ratio
works ok in most situations, but needs to be corrected in some
corner cases:
- if we run out of swap space, do not bother scanning the anon LRUs
- if we have already freed all of the page cache, we need to scan
  the anon LRUs

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-07 17:33:50.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-07 17:57:49.000000000 -0500
@@ -1182,7 +1182,7 @@ static unsigned long shrink_list(enum lr
 static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 					unsigned long *percent)
 {
-	unsigned long anon, file;
+	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long rotate_sum;
 	unsigned long ap, fp;
@@ -1230,6 +1230,20 @@ static void get_scan_ratio(struct zone *
 	else if (fp > 100)
 		fp = 100;
 	percent[1] = fp;
+
+	free = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * If we have no swap space, do not bother scanning anon pages
+	 */
+	if (nr_swap_pages <= 0)
+		percent[0] = 0;
+	/*
+	 * If we already freed most file pages, scan the anon pages
+	 * regardless of the page access ratios or swappiness setting.
+	 */
+	else if (file + free <= zone->pages_high)
+		percent[0] = 100;
 }
 
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 08/19] add newly swapped in pages to the inactive list
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (6 preceding siblings ...)
  2008-01-08 20:59 ` [patch 07/19] (NEW) add some sanity checks to get_scan_ratio Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:28   ` Christoph Lameter
  2008-01-08 20:59 ` [patch 09/19] (NEW) more aggressively use lumpy reclaim Rik van Riel
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

[-- Attachment #1: rvr-swapin-inactive.patch --]
[-- Type: text/plain, Size: 996 bytes --]

Swapin_readahead can read in a lot of data that the processes in
memory never need.  Adding swap cache pages to the inactive list
prevents them from putting too much pressure on the working set.

This has the potential to help the programs that are already in
memory, but it could also be a disadvantage to processes that
are trying to get swapped in.

In short, this patch needs testing.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap_state.c	2008-01-02 12:37:38.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap_state.c	2008-01-02 12:37:52.000000000 -0500
@@ -300,7 +300,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active_anon(new_page);
+			lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 09/19] (NEW) more aggressively use lumpy reclaim
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (7 preceding siblings ...)
  2008-01-08 20:59 ` [patch 08/19] add newly swapped in pages to the inactive list Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 22:30   ` Christoph Lameter
  2008-01-08 20:59 ` [patch 10/19] No Reclaim LRU Infrastructure Rik van Riel
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

[-- Attachment #1: lumpy-reclaim-lower-order.patch --]
[-- Type: text/plain, Size: 2224 bytes --]

During an AIM7 run on a 16GB system, fork started failing around
32000 threads, despite the system having plenty of free swap and
15GB of pageable memory.

If normal pageout does not result in contiguous free pages for
kernel stacks, fall back to lumpy reclaim instead of failing fork
or doing excessive pageout IO.

I do not know whether this change is needed due to the extreme
stress test or because the inactive list is a smaller fraction
of system memory on huge systems.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:21:04.000000000 -0500
@@ -870,7 +870,8 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc, int file)
+			struct zone *zone, struct scan_control *sc,
+			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -888,8 +889,19 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_freed;
 		unsigned long nr_active;
 		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE;
+		int mode = ISOLATE_INACTIVE;
+
+		/*
+		 * If we need a large contiguous chunk of memory, or have
+		 * trouble getting a small set of contiguous pages, we
+		 * will reclaim both active and inactive pages.
+		 *
+		 * We use the same threshold as pageout congestion_wait below.
+		 */
+		if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+			mode = ISOLATE_BOTH;
+		else if (sc->order && priority < DEF_PRIORITY - 2)
+			mode = ISOLATE_BOTH;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
 			     &page_list, &nr_scan, sc->order, mode,
@@ -1166,7 +1178,7 @@ static unsigned long shrink_list(enum lr
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
 }
 
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 10/19] No Reclaim LRU Infrastructure
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (8 preceding siblings ...)
  2008-01-08 20:59 ` [patch 09/19] (NEW) more aggressively use lumpy reclaim Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-11  4:36   ` KOSAKI Motohiro
  2008-01-08 20:59 ` [patch 11/19] Non-reclaimable page statistics Rik van Riel
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.1-no-reclaim-infrastructure.patch --]
[-- Type: text/plain, Size: 26748 bytes --]

V1 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series
+ define NR_NORECLAIM and LRU_NORECLAIM to avoid errors when not
  configured.

V1 -> V2:
+  handle review comments -- various typos and errors.
+  extract "putback_all_noreclaim_pages()" into a separate patch
   and rework as "scan_all_zones_noreclaim_pages().

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.  A separate noreclaim pagevec is provided
for shrink_active_list() to move nonreclaimable pages to the noreclaim
list without over burdening the zone lru_lock.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

Notes:

1.  for now, use bit 30 in page flags.  This restricts the no reclaim
    infrastructure to 64-bit systems.  [The mlock patch, later in this
    series, uses another of these 64-bit-system-only flags.]

    Rationale:  32-bit systems have no free page flags and are less
    likely to have the large amounts of memory that exhibit the problems
    this series attempts to solve.  [I'm sure someone will disabuse me
    of this notion.]

    Thus, NORECLAIM currently depends on [CONFIG_]64BIT.

2.  The pagevec to move pages to the noreclaim list results in another
    loop at the end of shrink_active_list().  If we ultimately adopt Rik
    van Riel's split lru approach, I think we'll need to find a way to
    factor all of these loops into some common code.

3.  TODO:  Memory Controllers maintain separate active and inactive lists.
    Need to consider whether they should also maintain a noreclaim list.  
    Also, convert to use Christoph's array of indexed lru variables?

    See //TODO note in mm/memcontrol.c re:  isolating non-reclaimable
    pages. 

4.  TODO:  more factoring of lru list handling.  But, I want to get this
    as close to functionally correct as possible before introducing those
    perturbations.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.24-rc6-mm1/mm/Kconfig
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/Kconfig	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/Kconfig	2008-01-08 12:17:10.000000000 -0500
@@ -193,3 +193,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM
+	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
+	depends on EXPERIMENTAL && 64BIT
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.24-rc6-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/page-flags.h	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/page-flags.h	2008-01-08 12:17:10.000000000 -0500
@@ -94,6 +94,7 @@
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
@@ -107,6 +108,8 @@
  *         63                            32                              0
  */
 #define PG_uncached		31	/* Page has been mapped as uncached */
+
+#define PG_noreclaim		30	/* Page is "non-reclaimable"  */
 #endif
 
 /*
@@ -160,6 +163,7 @@ static inline void SetPageUptodate(struc
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define __ClearPageActive(page)	__clear_bit(PG_active, &(page)->flags)
+#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
 #define __SetPageSlab(page)	__set_bit(PG_slab, &(page)->flags)
@@ -261,6 +265,21 @@ static inline void __ClearPageTail(struc
 #define PageSwapCache(page)	0
 #endif
 
+#ifdef CONFIG_NORECLAIM
+#define PageNoreclaim(page)	test_bit(PG_noreclaim, &(page)->flags)
+#define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
+#define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
+#define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
+#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
+							 &(page)->flags)
+#else
+#define PageNoreclaim(page)	0
+#define SetPageNoreclaim(page)
+#define ClearPageNoreclaim(page)
+#define __ClearPageNoreclaim(page)
+#define TestClearPageNoreclaim(page) 0
+#endif
+
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
Index: linux-2.6.24-rc6-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mmzone.h	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mmzone.h	2008-01-08 12:17:10.000000000 -0500
@@ -84,6 +84,11 @@ enum zone_stat_item {
 	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
+#ifdef CONFIG_NORECLAIM
+	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#else
+	NR_NORECLAIM=NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -123,10 +128,18 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
-	NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM
+	LRU_NORECLAIM,
+#else
+	LRU_NORECLAIM=LRU_ACTIVE_FILE,	/* avoid compiler errors in dead code */
+#endif
+	NR_LRU_LISTS
+};
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
 static inline int is_file_lru(enum lru_list l)
 {
 	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-08 12:17:10.000000000 -0500
@@ -248,6 +248,9 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -482,6 +485,9 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+#ifdef CONFIG_NORECLAIM
+			1 << PG_noreclaim |
+#endif
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -629,6 +635,9 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
Index: linux-2.6.24-rc6-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mm_inline.h	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mm_inline.h	2008-01-08 12:17:10.000000000 -0500
@@ -82,13 +82,36 @@ del_page_from_inactive_file_list(struct 
 	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
+#ifdef CONFIG_NORECLAIM
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+}
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
+}
+#else
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
+#endif
+
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		l = LRU_NORECLAIM;
+	} else if (PageActive(page)) {
 		__ClearPageActive(page);
 		l = LRU_ACTIVE_ANON;
 	}
Index: linux-2.6.24-rc6-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/swap.h	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/swap.h	2008-01-08 12:17:10.000000000 -0500
@@ -175,6 +175,13 @@ extern void FASTCALL(lru_cache_add_file(
 extern void FASTCALL(lru_cache_add_anon(struct page *));
 extern void FASTCALL(lru_cache_add_active_file(struct page *));
 extern void FASTCALL(lru_cache_add_active_anon(struct page *));
+extern void FASTCALL(lru_cache_add_active_or_noreclaim(struct page *page,
+						struct vm_area_struct *vma));
+#ifdef CONFIG_NORECLAIM
+extern void FASTCALL(lru_cache_add_noreclaim(struct page *page));
+#else
+static inline void lru_cache_add_noreclaim(struct page *page) { }
+#endif
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -206,6 +213,16 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+						struct vm_area_struct *vma)
+{
+	return 1;
+}
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: linux-2.6.24-rc6-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/pagevec.h	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/pagevec.h	2008-01-08 12:17:10.000000000 -0500
@@ -27,6 +27,11 @@ void __pagevec_lru_add_file(struct pagev
 void __pagevec_lru_add_active_file(struct pagevec *pvec);
 void __pagevec_lru_add_anon(struct pagevec *pvec);
 void __pagevec_lru_add_active_anon(struct pagevec *pvec);
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec);
+#else
+static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec) { }
+#endif
 void pagevec_strip(struct pagevec *pvec);
 void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-08 12:17:10.000000000 -0500
@@ -119,7 +119,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page)) {
+	 	if (PageLRU(page) && !PageActive(page) && \
+					!PageNoreclaim(page)) {
 			if (page_file_cache(page)) {
 				list_move_tail(&page->lru,
 						&zone->list[LRU_INACTIVE_FILE]);
@@ -153,7 +154,7 @@ int rotate_reclaimable_page(struct page 
 		return 1;
 	if (PageDirty(page))
 		return 1;
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		return 1;
 	if (!PageLRU(page))
 		return 1;
@@ -179,7 +180,7 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		int lru = LRU_BASE;
 		lru += page_file_cache(page);
 		del_page_from_lru_list(zone, page, lru);
@@ -202,7 +203,8 @@ void activate_page(struct page *page)
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -256,6 +258,50 @@ void lru_cache_add_active_file(struct pa
 	put_cpu_var(lru_add_active_file_pvecs);
 }
 
+#ifdef CONFIG_NORECLAIM
+static DEFINE_PER_CPU(struct pagevec, lru_add_noreclaim_pvecs) = { 0, };
+
+void fastcall lru_cache_add_noreclaim(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_noreclaim_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_noreclaim(pvec);
+	put_cpu_var(lru_add_noreclaim_pvecs);
+}
+
+void fastcall lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_reclaimable(page, vma)) {
+		if (page_file_cache(page))
+			lru_cache_add_active_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	} else
+		lru_cache_add_noreclaim(page);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu)
+{
+	*pvec = &per_cpu(lru_add_noreclaim_pvecs, cpu);
+	if (pagevec_count(*pvec))
+		__pagevec_lru_add_noreclaim(*pvec);
+}
+#else
+void fastcall lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_file_cache(page))
+		lru_cache_add_active_file(page);
+	else
+		lru_cache_add_active_anon(page);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) { }
+#endif
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -290,6 +336,8 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	__drain_noreclaim_pvec(&pvec, cpu);
 }
 
 void lru_add_drain(void)
@@ -361,6 +409,8 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+			int is_lru_page;
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
@@ -368,8 +418,10 @@ void release_pages(struct page **pages, 
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
-			VM_BUG_ON(!PageLRU(page));
-			__ClearPageLRU(page);
+			is_lru_page = PageLRU(page);
+			VM_BUG_ON(!(is_lru_page));
+			if (is_lru_page)
+				__ClearPageLRU(page);
 			del_page_from_lru(zone, page);
 		}
 
@@ -448,6 +500,7 @@ void __pagevec_lru_add_file(struct pagev
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		add_page_to_inactive_file_list(zone, page);
@@ -476,7 +529,7 @@ void __pagevec_lru_add_active_file(struc
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		SetPageActive(page);
 		add_page_to_active_file_list(zone, page);
 	}
@@ -538,6 +591,35 @@ void __pagevec_lru_add_active_anon(struc
 	pagevec_reinit(pvec);
 }
 
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		add_page_to_noreclaim_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+#endif
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-08 12:17:10.000000000 -0500
@@ -52,9 +52,18 @@ int migrate_prep(void)
 	return 0;
 }
 
+/*
+ * move_to_lru() - place @page onto appropriate lru list
+ * based on preserved page flags:  active, noreclaim, none
+ */
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		VM_BUG_ON(PageActive(page));
+		ClearPageNoreclaim(page);
+		lru_cache_add_noreclaim(page);
+	} else if (PageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
 		/*
 		 * lru_cache_add_active checks that
 		 * the PG_active bit is off.
@@ -65,6 +74,7 @@ static inline void move_to_lru(struct pa
 		else
 			lru_cache_add_active_anon(page);
 	} else {
+		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
 		if (page_file_cache(page))
 			lru_cache_add_file(page);
 		else
@@ -341,8 +351,11 @@ static void migrate_page_copy(struct pag
 		SetPageReferenced(newpage);
 	if (PageUptodate(page))
 		SetPageUptodate(newpage);
-	if (PageActive(page))
+	if (TestClearPageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		SetPageActive(newpage);
+	} else if (TestClearPageNoreclaim(page))
+		SetPageNoreclaim(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
@@ -356,7 +369,6 @@ static void migrate_page_copy(struct pag
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
-	ClearPageActive(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:14:30.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:17:10.000000000 -0500
@@ -480,6 +480,11 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (!page_reclaimable(page, NULL)) {
+			SetPageNoreclaim(page);
+			goto keep_locked;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -582,7 +587,7 @@ static unsigned long shrink_page_list(st
 		 * possible for a page to have PageDirty set, but it is actually
 		 * clean (all its buffers are clean).  This happens if the
 		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping. 
+		 * will do this, as well as the blockdev mapping.
 		 * try_to_release_page() will discover that cleanness and will
 		 * drop the buffers and mark the page clean - it can be freed.
 		 *
@@ -614,6 +619,7 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			remove_exclusive_swap_page(page);
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -664,6 +670,14 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
 		return ret;
 
+	/*
+	 * Non-reclaimable pages shouldn't make it onto either the active
+	 * nor the inactive list. However, when doing lumpy reclaim of
+	 * higher order pages we can still run into them.
+	 */
+	if (PageNoreclaim(page))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -775,7 +789,7 @@ static unsigned long isolate_lru_pages(u
 				/* else it is being freed elsewhere */
 				list_move(&cursor_page->lru, src);
 			default:
-				break;
+				break;	/* ! on LRU or wrong list */
 			}
 		}
 	}
@@ -831,9 +845,10 @@ static unsigned long clear_active_flags(
  * refcount on the page, which is a fundamentnal difference from
  * isolate_lru_pages (which is called without a stable reference).
  *
- * The returned page will have PageLru() cleared, and PageActive set,
- * if it was found on the active list. This flag generally will need to be
- * cleared by the caller before letting the page go.
+ * The returned page will have the PageLru() cleared, and the PageActive or
+ * PageNoreclaim will be set, if it was found on the active or noreclaim list,
+ * respectively. This flag generally will need to be cleared by the caller
+ * before letting the page go.
  *
  * The vmstat page counts corresponding to the list on which the page was
  * found will be decremented.
@@ -857,7 +872,13 @@ int isolate_lru_page(struct page *page)
 			ret = 0;
 			ClearPageLRU(page);
 
+			/* Calculate the LRU list for normal pages ... */
 			lru += page_file_cache(page) + !!PageActive(page);
+
+			/* ... except NoReclaim, which has its own list. */
+			if (PageNoreclaim(page))
+				lru = LRU_NORECLAIM;
+
 			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
@@ -967,14 +988,19 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page)) {
-				lru += LRU_FILE;
-				zone->recent_rotated_file++;
+			if (PageNoreclaim(page)) {
+				VM_BUG_ON(PageActive(page));
+				lru = LRU_NORECLAIM;
 			} else {
-				zone->recent_rotated_anon++;
+				if (page_file_cache(page)) {
+					lru += LRU_FILE;
+					zone->recent_rotated_file++;
+				} else {
+					zone->recent_rotated_anon++;
+				}
+				if (PageActive(page))
+					lru += LRU_ACTIVE;
 			}
-			if (PageActive(page))
-				lru += LRU_ACTIVE;
 			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1068,6 +1094,13 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+
+		if (!page_reclaimable(page, NULL)) {
+			/* Non-reclaimable pages go onto their own list. */
+			list_add(&page->lru, &list[LRU_NORECLAIM]);
+			continue;
+		}
+
 		if (page_referenced(page, 0, sc->mem_cgroup)) {
 			if (file)
 				/* Referenced file pages stay active. */
@@ -1154,6 +1187,33 @@ static void shrink_active_list(unsigned 
 		zone->recent_rotated_anon += pgmoved;
 	}
 
+#ifdef CONFIG_NORECLAIM
+	pgmoved = 0;
+	while (!list_empty(&list[LRU_NORECLAIM])) {
+		page = lru_to_page(&list[LRU_NORECLAIM]);
+		prefetchw_prev_lru_page(page, &list[LRU_NORECLAIM], flags);
+
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(!PageActive(page));
+		ClearPageActive(page);
+		VM_BUG_ON(PageNoreclaim(page));
+		SetPageNoreclaim(page);
+
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+		pgmoved++;
+		if (!pagevec_add(&pvec, page)) {
+			__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+//TODO:  count these as deactivations?
+			pgmoved = 0;
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+#endif
+
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
@@ -1271,7 +1331,7 @@ static unsigned long shrink_zone(int pri
 
 	get_scan_ratio(zone, sc, percent);
 
-	for_each_lru(l) {
+	for_each_reclaimable_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
 			/*
@@ -1297,8 +1357,8 @@ static unsigned long shrink_zone(int pri
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-						 nr[LRU_INACTIVE_FILE]) {
-		for_each_lru(l) {
+					nr[LRU_INACTIVE_FILE]) {
+		for_each_reclaimable_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
@@ -1838,8 +1898,8 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		for_each_lru(l) {
-			/* For pass = 0 we don't shrink the active list */
+		for_each_reclaimable_lru(l) {
+			/* For pass = 0, we don't shrink the active list */
 			if (pass == 0 &&
 				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
@@ -2185,3 +2245,29 @@ int zone_reclaim(struct zone *zone, gfp_
 	return ret;
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM
+/*
+ * page_reclaimable(struct page *page, struct vm_area_struct *vma)
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * @page       - page to test
+ * @vma        - vm area in which page is/will be mapped.  May be NULL.
+ *               If !NULL, called from fault path.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ *
+ * TODO:  specify locking assumptions
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return 1;
+}
+#endif
Index: linux-2.6.24-rc6-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mempolicy.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mempolicy.c	2008-01-08 12:17:10.000000000 -0500
@@ -1912,7 +1912,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))
Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c	2008-01-08 12:08:03.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/memcontrol.c	2008-01-08 12:17:10.000000000 -0500
@@ -520,6 +520,10 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
+//TODO:  for now, don't isolate non-reclaimable pages.  When/if
+// mem controller supports a noreclaim list, we'll need to make
+// at least ISOLATE_ACTIVE visible outside of vm_scan and pass
+// the 'take_nonreclaimable' flag accordingly.
 		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 11/19] Non-reclaimable page statistics
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (9 preceding siblings ...)
  2008-01-08 20:59 ` [patch 10/19] No Reclaim LRU Infrastructure Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 12/19] scan noreclaim list for reclaimable pages Rik van Riel
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.2-report-nonreclaimable-memory.patch --]
[-- Type: text/plain, Size: 4994 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

V1 -> V2:
	no changes

Report non-reclaimable pages per zone and system wide.

Note:  may want to track/report some specific reasons for 
nonreclaimability for deciding when to splice the noreclaim
lists back to the normal lru.  That will be tricky,
especially in shrink_active_list(), where we'd need someplace
to save the per page reason for non-reclaimability until the
pages are dumped back onto the noreclaim list from the pagevec.

Note:  my tests indicate that NR_NORECLAIM and probably the
other LRU stats aren't being maintained properly--especially
with large amounts of mlocked memory and the mlock patch in
this series installed.  Can't be sure of this, as I don't 
know why the pages are on the noreclaim list. Needs further
investigation.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-02 12:37:58.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-02 12:38:03.000000000 -0500
@@ -1899,12 +1899,20 @@ void show_free_areas(void)
 	}
 
 	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
-		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" inactive_file:%lu"
+//TODO:  check/adjust line lengths
+#ifdef CONFIG_NORECLAIM
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_ACTIVE_FILE),
 		global_page_state(NR_INACTIVE_ANON),
 		global_page_state(NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1931,6 +1939,9 @@ void show_free_areas(void)
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+#ifdef CONFIG_NORECLAIM
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1944,6 +1955,9 @@ void show_free_areas(void)
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
Index: linux-2.6.24-rc6-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmstat.c	2008-01-02 12:37:48.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmstat.c	2008-01-02 12:38:03.000000000 -0500
@@ -690,6 +690,9 @@ static const char * const vmstat_text[] 
 	"nr_active_anon",
 	"nr_inactive_file",
 	"nr_active_file",
+#ifdef CONFIG_NORECLAIM
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: linux-2.6.24-rc6-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/drivers/base/node.c	2008-01-02 12:37:38.000000000 -0500
+++ linux-2.6.24-rc6-mm1/drivers/base/node.c	2008-01-02 12:38:03.000000000 -0500
@@ -52,6 +52,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(anon): %8lu kB\n"
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		       "Node %d Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
 		       "Node %d HighFree:       %8lu kB\n"
@@ -76,6 +79,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_ANON),
 		       nid, node_page_state(nid, NR_ACTIVE_FILE),
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/proc/proc_misc.c	2008-01-02 12:37:38.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c	2008-01-02 12:38:03.000000000 -0500
@@ -162,6 +162,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(anon): %8lu kB\n"
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		"Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
 		"HighFree:       %8lu kB\n"
@@ -194,6 +197,9 @@ static int meminfo_read_proc(char *page,
 		K(global_page_state(NR_INACTIVE_ANON)),
 		K(global_page_state(NR_ACTIVE_FILE)),
 		K(global_page_state(NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 12/19] scan noreclaim list for reclaimable pages
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (10 preceding siblings ...)
  2008-01-08 20:59 ` [patch 11/19] Non-reclaimable page statistics Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 13/19] ramfs pages are non-reclaimable Rik van Riel
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.3-scan-noreclaim-list-for-reclaimable-pages.patch --]
[-- Type: text/plain, Size: 8849 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

New in V2

This patch adds a function to scan individual or all zones' noreclaim
lists and move any pages that have become reclaimable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

This replaces the function to splice the entire noreclaim list onto the
active list for rescan by shrink_active_list().  That method had problems
with vmstat accounting and complicated '[__]isolate_lru_pages()'.  Now,
__isolate_lru_page() will never isolate a non-reclaimable page.  The
only time it should see one is when scanning nearby pages for lumpy
reclaim.

  TODO:  This approach may still need some refinement.
         E.g., put back to active list?

DEBUGGING ONLY: NOT FOR UPSTREAM MERGE

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>


Index: linux-2.6.24-rc6-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/swap.h	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/swap.h	2008-01-08 12:17:17.000000000 -0500
@@ -7,6 +7,7 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 #include <linux/memcontrol.h>
+#include <linux/node.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -215,12 +216,26 @@ static inline int zone_reclaim(struct zo
 
 #ifdef CONFIG_NORECLAIM
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+extern void scan_zone_noreclaim_pages(struct zone *);
+extern void scan_all_zones_noreclaim_pages(void);
+extern unsigned long scan_noreclaim_pages;
+extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
+					void __user *, size_t *, loff_t *);
+extern int scan_noreclaim_register_node(struct node *node);
+extern void scan_noreclaim_unregister_node(struct node *node);
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
 {
 	return 1;
 }
+static inline void scan_zone_noreclaim_pages(struct zone *z) { }
+static inline void scan_all_zones_noreclaim_pages(void) { }
+static inline int scan_noreclaim_register_node(struct node *node)
+{
+	return 0;
+}
+static inline void scan_noreclaim_unregister_node(struct node *node) { }
 #endif
 
 extern int kswapd_run(int nid);
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:17:17.000000000 -0500
@@ -39,6 +39,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/sysctl.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2270,4 +2271,144 @@ int page_reclaimable(struct page *page, 
 
 	return 1;
 }
+
+/**
+ * scan_zone_noreclaim_pages(@zone)
+ * @zone - zone to scan
+ *
+ * Scan @zone's noreclaim LRU lists to check for pages that have become
+ * reclaimable.  Move those that have to @zone's inactive list where they
+ * become candidates for reclaim, unless shrink_inactive_zone() decides
+ * to reactivate them.  Pages that are still non-reclaimable are rotated
+ * back onto @zone's noreclaim list.
+ */
+#define SCAN_NORECLAIM_BATCH_SIZE 16UL	/* arbitrary lock hold batch size */
+void scan_zone_noreclaim_pages(struct zone *zone)
+{
+	struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
+	struct list_head *l_inactive_anon  = &zone->list[LRU_INACTIVE_ANON];
+	struct list_head *l_inactive_file  = &zone->list[LRU_INACTIVE_FILE];
+	unsigned long scan;
+	unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
+
+	while (nr_to_scan > 0) {
+		unsigned long batch_size = min(nr_to_scan,
+						SCAN_NORECLAIM_BATCH_SIZE);
+
+		spin_lock_irq(&zone->lru_lock);
+		for (scan = 0;  scan < batch_size; scan++) {
+			struct page* page = lru_to_page(l_noreclaim);
+
+			if (unlikely(!PageLRU(page) || !PageNoreclaim(page)))
+				continue;
+
+			prefetchw_prev_lru_page(page, l_noreclaim, flags);
+
+			ClearPageNoreclaim(page); /* for page_reclaimable() */
+			if(page_reclaimable(page, NULL)) {
+				__dec_zone_state(zone, NR_NORECLAIM);
+				if (page_file_cache(page)) {
+					list_move(&page->lru, l_inactive_file);
+					__inc_zone_state(zone, NR_INACTIVE_FILE);
+				} else {
+					list_move(&page->lru, l_inactive_anon);
+					__inc_zone_state(zone, NR_INACTIVE_ANON);
+				}
+			} else {
+				SetPageNoreclaim(page);
+				list_move(&page->lru, l_noreclaim);
+			}
+
+		}
+		spin_unlock_irq(&zone->lru_lock);
+
+		nr_to_scan -= batch_size;
+	}
+}
+
+
+/**
+ * scan_all_zones_noreclaim_pages()
+ *
+ * A really big hammer:  scan all zones' noreclaim LRU lists to check for
+ * pages that have become reclaimable.  Move those back to the zones'
+ * inactive list where they become candidates for reclaim.
+ * This occurs when, e.g., we have unswappable pages on the noreclaim lists,
+ * and we add swap to the system.  As such, it runs in the context of a task
+ * that has possibly/probably made some previously non-reclaimable pages
+ * reclaimable.
+//TODO:  or as a last resort under extreme memory pressure--before OOM?
+ */
+void scan_all_zones_noreclaim_pages(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		scan_zone_noreclaim_pages(zone);
+	}
+}
+
+/*
+ * scan_noreclaim_pages [vm] sysctl handler.  On demand re-scan of
+ * all nodes' noreclaim lists for reclaimable pages
+ */
+unsigned long scan_noreclaim_pages;
+
+int scan_noreclaim_handler( struct ctl_table *table, int write,
+			   struct file *file, void __user *buffer,
+			   size_t *length, loff_t *ppos)
+{
+	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+	if (write && *(unsigned long *)table->data)
+		scan_all_zones_noreclaim_pages();
+
+	scan_noreclaim_pages = 0;
+	return 0;
+}
+
+/*
+ * per node 'scan_noreclaim_pages' attribute.  On demand re-scan of
+ * a specified node's per zone noreclaim lists for reclaimable pages.
+ */
+
+static ssize_t read_scan_noreclaim_node(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "0\n");	/* always zero; should fit... */
+}
+
+static ssize_t write_scan_noreclaim_node(struct sys_device *dev,
+                                       const char *buf, size_t count)
+{
+	struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
+	struct zone *zone;
+	unsigned long req = simple_strtoul(buf, NULL, 10);
+
+	if (!req)
+		return 1;	/* zero is no-op */
+
+	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+		if (!populated_zone(zone))
+			continue;
+		scan_zone_noreclaim_pages(zone);
+	}
+	return 1;
+}
+
+
+static SYSDEV_ATTR(scan_noreclaim_pages, S_IRUGO | S_IWUSR,
+			read_scan_noreclaim_node,
+			write_scan_noreclaim_node);
+
+int scan_noreclaim_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+void scan_noreclaim_unregister_node(struct node *node)
+{
+	sysdev_remove_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+
 #endif
Index: linux-2.6.24-rc6-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/kernel/sysctl.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/kernel/sysctl.c	2008-01-08 12:17:17.000000000 -0500
@@ -1151,6 +1151,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "scan_noreclaim_pages",
+		.data		= &scan_noreclaim_pages,
+		.maxlen		= sizeof(scan_noreclaim_pages),
+		.mode		= 0644,
+		.proc_handler	= &scan_noreclaim_handler,
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux-2.6.24-rc6-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/drivers/base/node.c	2008-01-08 12:17:14.000000000 -0500
+++ linux-2.6.24-rc6-mm1/drivers/base/node.c	2008-01-08 12:17:17.000000000 -0500
@@ -13,6 +13,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/swap.h>
 
 static struct sysdev_class node_class = {
 	.name = "node",
@@ -162,6 +163,8 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+
+		scan_noreclaim_register_node(node);
 	}
 	return error;
 }
@@ -180,6 +183,8 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
+	scan_noreclaim_unregister_node(node);
+
 	sysdev_unregister(&node->sysdev);
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 13/19] ramfs pages are non-reclaimable
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (11 preceding siblings ...)
  2008-01-08 20:59 ` [patch 12/19] scan noreclaim list for reclaimable pages Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 14/19] SHM_LOCKED pages are nonreclaimable Rik van Riel
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-02-ramdisk-and-ramfs-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 4711 bytes --]

V3 -> V4:
+ drivers/block/rd.c was replaced by brd.c in 24-rc4-mm1.
  Update patch to add brd_open() to mark mapping as nonreclaimable

V2 -> V3:
+  rebase to 23-mm1 atop RvR's split LRU series [no changes]

V1 -> V2:
+  add ramfs pages to this class of non-reclaimable pages by
   marking ramfs address_space [mapping] as non-reclaimble.

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
address_space flag for new ramfs inodes.

These changes depend on [CONFIG_]NORECLAIM.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/pagemap.h	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/pagemap.h	2008-01-08 12:17:21.000000000 -0500
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	if (mapping && (mapping->flags & AS_NORECLAIM))
+		return 1;
+	return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:17:17.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:17:21.000000000 -0500
@@ -2258,6 +2258,7 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
+ * + page's mapping marked non-reclaimable
  * TODO - later patches
  *
  * TODO:  specify locking assumptions
@@ -2267,6 +2268,9 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
+	if (mapping_non_reclaimable(page_mapping(page)))
+		return 0;
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.24-rc6-mm1/fs/ramfs/inode.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/ramfs/inode.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/ramfs/inode.c	2008-01-08 12:17:21.000000000 -0500
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_noreclaim(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
Index: linux-2.6.24-rc6-mm1/drivers/block/brd.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/drivers/block/brd.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/drivers/block/brd.c	2008-01-08 12:17:21.000000000 -0500
@@ -373,8 +373,21 @@ static int brd_ioctl(struct inode *inode
 	return error;
 }
 
+/*
+ * brd_open():
+ * Just mark the mapping as containing non-reclaimable pages
+ */
+static int brd_open(struct inode *inode, struct file *filp)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	mapping_set_noreclaim(mapping);
+	return 0;
+}
+
 static struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
+	.open  =		brd_open,
 	.ioctl =		brd_ioctl,
 #ifdef CONFIG_BLK_DEV_XIP
 	.direct_access =	brd_direct_access,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 14/19] SHM_LOCKED pages are nonreclaimable
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (12 preceding siblings ...)
  2008-01-08 20:59 ` [patch 13/19] ramfs pages are non-reclaimable Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 15/19] non-reclaimable mlocked pages Rik van Riel
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-03-SHM_LOCKed-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 8017 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series.
+ Use scan_mapping_noreclaim_page() on unlock.  See below.

V1 -> V2:
+  modify to use reworked 'scan_all_zones_noreclaim_pages()'
   See 'TODO' below - still pending.

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
pages in the shmem segment's mapping [struct address_space] for
reclaimability now that they're no longer locked.  If so, move
them to the appropriate zone lru list.

Changes depend on [CONFIG_]NORECLAIM.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/shmem.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/shmem.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/shmem.c	2008-01-08 12:17:25.000000000 -0500
@@ -1468,10 +1468,13 @@ int shmem_lock(struct file *file, int lo
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
+		mapping_set_noreclaim(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
+		mapping_clear_noreclaim(file->f_mapping);
+		scan_mapping_noreclaim_pages(file->f_mapping);
 	}
 	retval = 0;
 out_nomem:
Index: linux-2.6.24-rc6-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/pagemap.h	2008-01-08 12:17:21.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/pagemap.h	2008-01-08 12:17:25.000000000 -0500
@@ -38,14 +38,20 @@ static inline void mapping_set_noreclaim
 	set_bit(AS_NORECLAIM, &mapping->flags);
 }
 
+static inline void mapping_clear_noreclaim(struct address_space *mapping)
+{
+	clear_bit(AS_NORECLAIM, &mapping->flags);
+}
+
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
-	if (mapping && (mapping->flags & AS_NORECLAIM))
-		return 1;
+	if (mapping)
+		return test_bit(AS_NORECLAIM, &mapping->flags);
 	return 0;
 }
 #else
 static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline void mapping_clear_noreclaim(struct address_space *mapping) { }
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
 	return 0;
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:17:21.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:17:25.000000000 -0500
@@ -2276,6 +2276,30 @@ int page_reclaimable(struct page *page, 
 	return 1;
 }
 
+/*
+ * check_move_noreclaim_page() -- check @page for reclaimability and move
+ * to appropriate @zone lru list.
+ * @zone->lru_lock held on entry/exit.
+ * @page is on LRU and has PageNoreclaim true
+ */
+static void check_move_noreclaim_page(struct page *page, struct zone* zone)
+{
+
+	ClearPageNoreclaim(page); /* for page_reclaimable() */
+	if(page_reclaimable(page, NULL)) {
+		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+		__dec_zone_state(zone, NR_NORECLAIM);
+		list_move(&page->lru, &zone->list[l]);
+		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
+	} else {
+		/*
+		 * rotate noreclaim list
+		 */
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+	}
+}
+
 /**
  * scan_zone_noreclaim_pages(@zone)
  * @zone - zone to scan
@@ -2290,8 +2314,6 @@ int page_reclaimable(struct page *page, 
 void scan_zone_noreclaim_pages(struct zone *zone)
 {
 	struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
-	struct list_head *l_inactive_anon  = &zone->list[LRU_INACTIVE_ANON];
-	struct list_head *l_inactive_file  = &zone->list[LRU_INACTIVE_FILE];
 	unsigned long scan;
 	unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
 
@@ -2303,26 +2325,15 @@ void scan_zone_noreclaim_pages(struct zo
 		for (scan = 0;  scan < batch_size; scan++) {
 			struct page* page = lru_to_page(l_noreclaim);
 
-			if (unlikely(!PageLRU(page) || !PageNoreclaim(page)))
+			if (TestSetPageLocked(page))
 				continue;
 
 			prefetchw_prev_lru_page(page, l_noreclaim, flags);
 
-			ClearPageNoreclaim(page); /* for page_reclaimable() */
-			if(page_reclaimable(page, NULL)) {
-				__dec_zone_state(zone, NR_NORECLAIM);
-				if (page_file_cache(page)) {
-					list_move(&page->lru, l_inactive_file);
-					__inc_zone_state(zone, NR_INACTIVE_FILE);
-				} else {
-					list_move(&page->lru, l_inactive_anon);
-					__inc_zone_state(zone, NR_INACTIVE_ANON);
-				}
-			} else {
-				SetPageNoreclaim(page);
-				list_move(&page->lru, l_noreclaim);
-			}
+			if (likely(PageLRU(page) && PageNoreclaim(page)))
+				check_move_noreclaim_page(page, zone);
 
+			unlock_page(page);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 
@@ -2352,6 +2363,62 @@ void scan_all_zones_noreclaim_pages(void
 	}
 }
 
+/**
+ * scan_mapping_noreclaim_pages(mapping)
+ * @mapping - struct address_space to scan for reclaimable pages
+ *
+ * scan all pages in mapping.  check non-reclaimable pages for
+ * reclaimabililty and move them to the appropriate zone lru list.
+ */
+void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+	pgoff_t next = 0;
+	pgoff_t end   = i_size_read(mapping->host);
+	struct zone *zone;
+	struct pagevec pvec;
+
+	if (mapping->nrpages == 0)
+		return;
+
+	pagevec_init(&pvec, 0);
+	while (next < end &&
+		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		int i;
+
+		zone = NULL;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index = page->index;
+			struct zone *pagezone = page_zone(page);
+
+			if (page_index > next)
+				next = page_index;
+			next++;
+
+			if (TestSetPageLocked(page))
+				continue;
+
+			if (pagezone != zone) {
+				if (zone)
+					spin_unlock(&zone->lru_lock);
+				zone = pagezone;
+				spin_lock(&zone->lru_lock);
+			}
+
+			if (PageLRU(page) && PageNoreclaim(page))
+				check_move_noreclaim_page(page, zone);
+
+			unlock_page(page);
+
+		}
+		if (zone)
+			spin_unlock(&zone->lru_lock);
+		pagevec_release(&pvec);
+	}
+
+}
+
 /*
  * scan_noreclaim_pages [vm] sysctl handler.  On demand re-scan of
  * all nodes' noreclaim lists for reclaimable pages
Index: linux-2.6.24-rc6-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/swap.h	2008-01-08 12:17:17.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/swap.h	2008-01-08 12:17:25.000000000 -0500
@@ -218,6 +218,7 @@ static inline int zone_reclaim(struct zo
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void scan_zone_noreclaim_pages(struct zone *);
 extern void scan_all_zones_noreclaim_pages(void);
+extern void scan_mapping_noreclaim_pages(struct address_space *);
 extern unsigned long scan_noreclaim_pages;
 extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
@@ -231,6 +232,9 @@ static inline int page_reclaimable(struc
 }
 static inline void scan_zone_noreclaim_pages(struct zone *z) { }
 static inline void scan_all_zones_noreclaim_pages(void) { }
+static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+}
 static inline int scan_noreclaim_register_node(struct node *node)
 {
 	return 0;

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 15/19] non-reclaimable mlocked pages
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (13 preceding siblings ...)
  2008-01-08 20:59 ` [patch 14/19] SHM_LOCKED pages are nonreclaimable Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 16/19] mlock vma pages under mmap_sem held for read Rik van Riel
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.1-prepare-for-mlocked-pages.patch --]
[-- Type: text/plain, Size: 31493 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix page flags macros for *PageMlocked() when not configured.
+ ensure lru_add_drain_all() runs on all cpus when NORECLIM_MLOCK
  configured.  Was just for NUMA.

V1 -> V2:
+ moved this patch [and related patches] up to right after
  ramdisk/ramfs and SHM_LOCKed patches.
+ add [back] missing put_page() in putback_lru_page().
  This solved page leakage as seen by stats in previous
  version.
+ fix up munlock_vma_page() to isolate page from lru
  before calling try_to_unlock().  Think I detected a
  race here.
+ use TestClearPageMlock() on old page in migrate.c's
  migrate_page_copy() to clean up old page.
+ live dangerously:  remove TestSetPageLocked() in 
  is_mlocked_vma()--should only be called on new pages in
  the fault path--iff we chose to cull there [later patch].
+ Add PG_mlocked to free_pages_check() etc to detect mlock
  state mismanagement.
  NOTE:  temporarily [???] commented out--tripping over it
  under load.  Why?

Rework of a patch by Nick Piggin -- part 1 of 2.

This patch:

1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
   stub version of the mlock/noreclaim APIs when it's
   not configured.  Depends on [CONFIG_]NORECLAIM.

2) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   nonreclaimable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Uses a bit available only to 64-bit systems.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_noreclaim
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

3) add the mlock/noreclaim infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on noreclaim
   LRU list.

4) update vmscan.c:page_reclaimable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull nonreclaimable pages in fault
   path" patch is included.

5) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.  
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

mm/internal.h and mm/mlock.c changes:
Originally Signed-off-by: Nick Piggin <npiggin@suse.de>

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>


Index: linux-2.6.24-rc6-mm1/mm/Kconfig
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/Kconfig	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/Kconfig	2008-01-08 12:17:30.000000000 -0500
@@ -203,3 +203,17 @@ config NORECLAIM
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_MLOCK
+	bool "Exclude mlock'ed pages from reclaim"
+	depends on NORECLAIM
+	help
+	  Treats mlock'ed pages as no-reclaimable.  Removing these pages from
+	  the LRU [in]active lists avoids the overhead of attempting to reclaim
+	  them.  Pages marked non-reclaimable for this reason will become
+	  reclaimable again when the last mlock is removed.
+	  when no swap space exists.  Removing these pages from the LRU lists
+	  avoids the overhead of attempting to reclaim them.  Pages marked
+	  non-reclaimable for this reason will become reclaimable again when/if
+	  sufficient swap space is added to the system.
+
Index: linux-2.6.24-rc6-mm1/mm/internal.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/internal.h	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/internal.h	2008-01-08 12:17:30.000000000 -0500
@@ -39,6 +39,64 @@ extern int isolate_lru_page(struct page 
 extern void __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * in mm/vmscan.c -- currently only used for NORECLAIM_MLOCK
+ */
+extern void putback_lru_page(struct page *page);
+
+/*
+ * called only for new pages in fault path
+ */
+extern int is_mlocked_vma(struct vm_area_struct *, struct page *);
+
+/*
+ * must be called with vma's mmap_sem held for read, and page locked.
+ */
+extern void mlock_vma_page(struct page *page);
+
+extern int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock);
+
+/*
+ * mlock all pages in this vma range.  For mmap()/mremap()/...
+ */
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	__mlock_vma_pages_range(vma, start, end, 1);
+}
+
+/*
+ * munlock range of pages.   For munmap() and exit().
+ * Always called to operate on a full vma that is being unmapped.
+ */
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+// TODO:  verify my assumption.  Should we just drop the start/end args?
+	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
+
+	vma->vm_flags &= ~VM_LOCKED;    /* try_to_unlock() needs this */
+	__mlock_vma_pages_range(vma, start, end, 0);
+}
+
+extern void clear_page_mlock(struct page *page);
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+{
+	return 0;
+}
+static inline void clear_page_mlock(struct page *page) { }
+static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
 /*
  * function for dealing with page's order in buddy system.
  * zone->lock is already acquired when we use these.
Index: linux-2.6.24-rc6-mm1/mm/mlock.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mlock.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mlock.c	2008-01-08 12:17:30.000000000 -0500
@@ -8,10 +8,16 @@
 #include <linux/capability.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/mmzone.h>
+
+#include "internal.h"
 
 int can_do_mlock(void)
 {
@@ -23,19 +29,209 @@ int can_do_mlock(void)
 }
 EXPORT_SYMBOL(can_do_mlock);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Mlocked pages are marked with PageMlocked() flag for efficient testing
+ * in vmscan and, possibly, the fault path.
+ *
+ * An mlocked page [PageMlocked(page)] is non-reclaimable.  As such, it will
+ * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
+ * The noreclaim list is an LRU sibling list to the [in]active lists.
+ * PageNoreclaim is set to indicate the non-reclaimable state.
+ *
+//TODO:  no longer counting, but does this still apply to lazy setting
+// of PageMlocked() ??
+ * When lazy incrementing via vmscan, it is important to ensure that the
+ * vma's VM_LOCKED status is not concurrently being modified, otherwise we
+ * may have elevated mlock_count of a page that is being munlocked. So lazy
+ * mlocked must take the mmap_sem for read, and verify that the vma really
+ * is locked (see mm/rmap.c).
+ */
+
+/*
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
+ */
+void clear_page_mlock(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (likely(!PageMlocked(page)))
+		return;
+	ClearPageMlocked(page);
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+/*
+ * Mark page as mlocked if not already.
+ * If page on LRU, isolate and putback to move to noreclaim list.
+ */
+void mlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+			putback_lru_page(page);
+}
+
+/*
+ * called from munlock()/munmap() path with page supposedly on the LRU.
+ *
+ * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
+ * [in try_to_unlock()] and then attempt to isolate the page.  We must
+ * isolate the page() to keep others from messing with its noreclaim
+ * and mlocked state while trying to unlock.  However, we pre-clear the
+ * mlocked state anyway as we might lose the isolation race and we might
+ * not get another chance to clear PageMlocked.  If we successfully
+ * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
+ * mapping the page, we just restore the PageMlocked state.  If we lose
+ * the isolation race, and the page is mapped by other VM_LOCKED vmas,
+ * we'll detect this in try_to_unmap() and we'll call mlock_vma_page()
+ * above, if/when we try to reclaim the page.
+ */
+static void munlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
+		if (try_to_unlock(page) == SWAP_MLOCK)
+			SetPageMlocked(page);	/* still VM_LOCKED */
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * Called in fault path via page_reclaimable() for a new page
+ * to determine if it's being mapped into a LOCKED vma.
+ * If so, mark page as mlocked.
+ */
+int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+{
+	VM_BUG_ON(PageMlocked(page));	// TODO:  needed?
+	VM_BUG_ON(PageLRU(page));
+
+	if (likely(!(vma->vm_flags & VM_LOCKED)))
+		return 0;
+
+	SetPageMlocked(page);
+	return 1;
+}
+
+/*
+ * mlock or munlock a range of pages in the vma depending on whether
+ * @lock is 1 or 0, respectively.  @lock must match vm_flags VM_LOCKED
+ * state.
+TODO:   we don't really need @lock, as we can determine it from vm_flags
+ *
+ * This takes care of making the pages present too.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start;
+	struct page *pages[16]; /* 16 gives a reasonable batch */
+	int write = !!(vma->vm_flags & VM_WRITE);
+	int nr_pages;
+	int ret = 0;
+
+	BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+	VM_BUG_ON(lock != !!(vma->vm_flags & VM_LOCKED));
+
+	if (vma->vm_flags & VM_IO)
+		return ret;
+
+	lru_add_drain_all();	/* push cached pages to LRU */
+
+	nr_pages = (end - start) / PAGE_SIZE;
+
+	while (nr_pages > 0) {
+		int i;
+
+		cond_resched();
+
+		/*
+		 * get_user_pages makes pages present if we are
+		 * setting mlock.
+		 */
+		ret = get_user_pages(current, mm, addr,
+				min_t(int, nr_pages, ARRAY_SIZE(pages)),
+				write, 0, pages, NULL);
+		if (ret < 0)
+			break;
+		if (ret == 0) {
+			/*
+			 * We know the vma is there, so the only time
+			 * we cannot get a single page should be an
+			 * error (ret < 0) case.
+			 */
+			WARN_ON(1);
+			ret = -EFAULT;
+			break;
+		}
+
+		lru_add_drain();	/* push cached pages to LRU */
+
+		for (i = 0; i < ret; i++) {
+			struct page *page = pages[i];
+
+			lock_page(page);
+			if (lock)
+				mlock_vma_page(page);
+			else
+				munlock_vma_page(page);
+			unlock_page(page);
+			put_page(page);		/* ref from get_user_pages() */
+
+			addr += PAGE_SIZE;	/* for next get_user_pages() */
+			nr_pages--;
+		}
+	}
+
+	lru_add_drain_all();	/* to update stats */
+
+	return ret;
+}
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * Just make pages present if @lock true.  No-op if unlocking.
+ */
+int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	int ret = 0;
+
+	if (!lock || vma->vm_flags & VM_IO)
+		return ret;
+
+	return make_pages_present(start, end);
+}
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	unsigned long start, unsigned long end, unsigned int newflags)
 {
-	struct mm_struct * mm = vma->vm_mm;
+	struct mm_struct *mm = vma->vm_mm;
 	pgoff_t pgoff;
-	int pages;
+	int nr_pages;
 	int ret = 0;
+	int lock;
 
 	if (newflags == vma->vm_flags) {
 		*prev = vma;
 		goto out;
 	}
 
+//TODO:  linear_page_index() ?   non-linear pages?
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma));
@@ -59,24 +255,25 @@ static int mlock_fixup(struct vm_area_st
 	}
 
 success:
+	lock = !!(newflags & VM_LOCKED);
+
+	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!lock)
+		nr_pages = -nr_pages;
+	mm->locked_vm += nr_pages;
+
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
+	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
-	}
+	__mlock_vma_pages_range(vma, start, end, lock);
 
-	mm->locked_vm -= pages;
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-08 12:17:25.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-08 12:17:30.000000000 -0500
@@ -887,6 +887,44 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * putback_lru_page(@page)
+ *
+ * Add previously isolated @page to appropriate LRU list.
+ * Page may still be non-reclaimable for other reasons.
+ *
+ * The vmstat page counts corresponding to the list on which the page
+ * will be placed will be incremented.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ */
+void putback_lru_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	int lru = LRU_INACTIVE_ANON;
+
+	VM_BUG_ON(PageLRU(page));
+
+	ClearPageNoreclaim(page);
+	ClearPageActive(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (page_reclaimable(page, NULL)) {
+		lru += page_file_cache(page);
+	} else {
+		lru = LRU_NORECLAIM;
+		SetPageNoreclaim(page);
+	}
+
+	SetPageLRU(page);
+	add_page_to_lru_list(zone, page, lru);
+	put_page(page);		/* drop ref from isolate */
+
+	spin_unlock_irq(&zone->lru_lock);
+}
+#endif
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -2255,10 +2293,11 @@ int zone_reclaim(struct zone *zone, gfp_
  *
  * @page       - page to test
  * @vma        - vm area in which page is/will be mapped.  May be NULL.
- *               If !NULL, called from fault path.
+ *               If !NULL, called from fault path for a new page.
  *
  * Reasons page might not be reclaimable:
- * + page's mapping marked non-reclaimable
+ * 1) page's mapping marked non-reclaimable
+ * 2) page is mlock'ed into memory.
  * TODO - later patches
  *
  * TODO:  specify locking assumptions
@@ -2271,6 +2310,11 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+		return 0;
+#endif
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.24-rc6-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/page-flags.h	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/page-flags.h	2008-01-08 12:17:30.000000000 -0500
@@ -110,6 +110,7 @@
 #define PG_uncached		31	/* Page has been mapped as uncached */
 
 #define PG_noreclaim		30	/* Page is "non-reclaimable"  */
+#define PG_mlocked		29	/* Page is vma mlocked */
 #endif
 
 /*
@@ -163,6 +164,7 @@ static inline void SetPageUptodate(struc
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define __ClearPageActive(page)	__clear_bit(PG_active, &(page)->flags)
+#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
 #define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
@@ -270,8 +272,17 @@ static inline void __ClearPageTail(struc
 #define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
 #define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
 #define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
-#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
-							 &(page)->flags)
+#define TestClearPageNoreclaim(page) \
+				test_and_clear_bit(PG_noreclaim, &(page)->flags)
+#ifdef CONFIG_NORECLAIM_MLOCK
+#define PageMlocked(page)	test_bit(PG_mlocked, &(page)->flags)
+#define SetPageMlocked(page)	set_bit(PG_mlocked, &(page)->flags)
+#define ClearPageMlocked(page) clear_bit(PG_mlocked, &(page)->flags)
+#define __ClearPageMlocked(page) __clear_bit(PG_mlocked, &(page)->flags)
+#define TestSetPageMlocked(page) test_and_set_bit(PG_mlocked, &(page)->flags)
+#define TestClearPageMlocked(page) \
+				test_and_clear_bit(PG_mlocked, &(page)->flags)
+#endif
 #else
 #define PageNoreclaim(page)	0
 #define SetPageNoreclaim(page)
@@ -279,6 +290,14 @@ static inline void __ClearPageTail(struc
 #define __ClearPageNoreclaim(page)
 #define TestClearPageNoreclaim(page) 0
 #endif
+#ifndef CONFIG_NORECLAIM_MLOCK
+#define PageMlocked(page)	0
+#define SetPageMlocked(page)
+#define ClearPageMlocked(page)
+#define __ClearPageMlocked(page)
+#define TestSetPageMlocked(page) 0
+#define TestClearPageMlocked(page) 0
+#endif
 
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
Index: linux-2.6.24-rc6-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/rmap.h	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/rmap.h	2008-01-08 12:17:30.000000000 -0500
@@ -109,6 +109,17 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called in munlock()/munmap() path to check for other vmas holding
+ * the page mlocked.
+ */
+int try_to_unlock(struct page *);
+#define TRY_TO_UNLOCK 1
+#else
+#define TRY_TO_UNLOCK 0		/* for compiler -- dead code elimination */
+#endif
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -132,5 +143,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS	0
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
+#define SWAP_MLOCK	3
 
 #endif	/* _LINUX_RMAP_H */
Index: linux-2.6.24-rc6-mm1/mm/rmap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/rmap.c	2008-01-08 12:08:02.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/rmap.c	2008-01-08 12:17:30.000000000 -0500
@@ -52,6 +52,8 @@
 
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 struct kmem_cache *anon_vma_cachep;
 
 /* This must be called under the mmap_sem. */
@@ -284,10 +286,17 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
+	/*
+	 * Don't want to elevate referenced for mlocked page that gets this far,
+	 * in order that it progresses to try_to_unmap and is moved to the
+	 * noreclaim list.
+	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+		goto out_unmap;
+	}
+
+	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -296,6 +305,7 @@ static int page_referenced_one(struct pa
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
+out_unmap:
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
 out:
@@ -384,11 +394,6 @@ static int page_referenced_file(struct p
 		 */
 		if (mem_cont && (mm_cgroup(vma->vm_mm) != mem_cont))
 			continue;
-		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
-				  == (VM_LOCKED|VM_MAYSHARE)) {
-			referenced++;
-			break;
-		}
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
 			break;
@@ -712,10 +717,15 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+	if (!migration) {
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
+		if (ptep_clear_flush_young(vma, address, pte)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */
@@ -797,6 +807,10 @@ out:
  * For very sparsely populated VMAs this is a little inefficient - chances are
  * there there won't be many ptes located within the scan cluster.  In this case
  * maybe we could scan further - to the end of the pte page, perhaps.
+ *
+TODO:  still accurate with noreclaim infrastructure?
+ * Mlocked pages also aren't handled very well at the moment: they aren't
+ * moved off the LRU like they are for linear pages.
  */
 #define CLUSTER_SIZE	min(32*PAGE_SIZE, PMD_SIZE)
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
@@ -868,10 +882,28 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
+/**
+ * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the anon_vma struct it points to.
+ *
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * anonymous pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
+ */
+static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
+	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
 
 	anon_vma = page_lock_anon_vma(page);
@@ -879,25 +911,53 @@ static int try_to_unmap_anon(struct page
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			break;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+		}
 	}
-
 	page_unlock_anon_vma(anon_vma);
+
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
+
 	return ret;
 }
 
 /**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
+ * try_to_unmap_file - unmap or unlock file page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * This function is only called from try_to_unmap for object-based pages.
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * object-based pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file(struct page *page, int unlock, int migration)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -908,19 +968,46 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
+	unsigned int mlocked = 0;
 
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			goto out;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+			if (unlikely(unlock))
+				break;  /* stop on 1st mlocked vma */
+		}
 	}
 
+	if (mlocked)
+		goto out;
+
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			goto out;		/* no need to look further */
+		}
 		if ((vma->vm_flags & VM_LOCKED) && !migration)
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
@@ -955,8 +1042,6 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if ((vma->vm_flags & VM_LOCKED) && !migration)
-				continue;
 			cursor = (unsigned long) vma->vm_private_data;
 			while ( cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
@@ -981,6 +1066,10 @@ static int try_to_unmap_file(struct page
 		vma->vm_private_data = NULL;
 out:
 	spin_unlock(&mapping->i_mmap_lock);
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
 	return ret;
 }
 
@@ -995,6 +1084,7 @@ out:
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
+ * SWAP_MLOCK	- page is mlocked.
  */
 int try_to_unmap(struct page *page, int migration)
 {
@@ -1003,12 +1093,32 @@ int try_to_unmap(struct page *page, int 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, migration);
+		ret = try_to_unmap_anon(page, 0, migration);
 	else
-		ret = try_to_unmap_file(page, migration);
-
-	if (!page_mapped(page))
+		ret = try_to_unmap_file(page, 0, migration);
+	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * try_to_unlock - Check page's rmap for other vma's holding page locked.
+ * @page: the page to be unlocked.   will be returned with PG_mlocked
+ * cleared if no vmas are VM_LOCKED.
+ *
+ * Return values are:
+ *
+ * SWAP_SUCCESS	- no vma's holding page locked.
+ * SWAP_MLOCK	- page is mlocked.
+ */
+int try_to_unlock(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
+
+	if (PageAnon(page))
+		return(try_to_unmap_anon(page, 1, 0));
+	else
+		return(try_to_unmap_file(page, 1, 0));
+}
+#endif
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-08 12:17:30.000000000 -0500
@@ -366,6 +366,9 @@ static void migrate_page_copy(struct pag
 		set_page_dirty(newpage);
  	}
 
+	if (TestClearPageMlocked(page))
+		SetPageMlocked(newpage);
+
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
Index: linux-2.6.24-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/page_alloc.c	2008-01-08 12:17:14.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/page_alloc.c	2008-01-08 12:17:30.000000000 -0500
@@ -257,6 +257,7 @@ static void bad_page(struct page *page)
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_swapbacked |
+			1 << PG_mlocked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -488,6 +489,9 @@ static inline int free_pages_check(struc
 #ifdef CONFIG_NORECLAIM
 			1 << PG_noreclaim |
 #endif
+// TODO:  always trip this under heavy workloads.
+//  Why isn't this being cleared on last unmap/unlock?
+//  			1 << PG_mlocked |
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -644,6 +648,8 @@ static int prep_new_page(struct page *pa
 			1 << PG_writeback |
 			1 << PG_reserved |
 			1 << PG_swapbacked |
+//TODO:  why hitting this?
+//			1 << PG_mlocked |
 			1 << PG_buddy ))))
 		bad_page(page);
 
@@ -656,7 +662,9 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk |
+//TODO take care of it here, for now.
+			1 << PG_mlocked );
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
Index: linux-2.6.24-rc6-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap.c	2008-01-08 12:17:10.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap.c	2008-01-08 12:17:30.000000000 -0500
@@ -346,7 +346,7 @@ void lru_add_drain(void)
 	put_cpu();
 }
 
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_NORECLAIM_MLOCK)
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
 	lru_add_drain();

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 16/19] mlock vma pages under mmap_sem held for read
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (14 preceding siblings ...)
  2008-01-08 20:59 ` [patch 15/19] non-reclaimable mlocked pages Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 17/19] handle mlocked pages during map/unmap and truncate Rik van Riel
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 6681 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/mlock.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mlock.c	2008-01-02 14:59:18.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mlock.c	2008-01-02 15:06:32.000000000 -0500
@@ -200,6 +200,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range
+ * @vma - vm area to mlock into memory
+ * @start - start address in @vma of range to mlock,
+ * @end   - end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end, 1);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -266,14 +297,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end, lock);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for changes while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: linux-2.6.24-rc6-mm1/mm/internal.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/internal.h	2008-01-02 14:58:22.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/internal.h	2008-01-02 15:07:37.000000000 -0500
@@ -61,24 +61,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end, 1);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;    /* try_to_unlock() needs this */
-	__mlock_vma_pages_range(vma, start, end, 0);
+	return __mlock_vma_pages_range(vma, start, end, 0);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -90,10 +87,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 17/19] handle mlocked pages during map/unmap and truncate
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (15 preceding siblings ...)
  2008-01-08 20:59 ` [patch 16/19] mlock vma pages under mmap_sem held for read Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-08 20:59 ` [patch 18/19] account mlocked pages Rik van Riel
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.2-move-mlocked-pages-off-the-LRU.patch --]
[-- Type: text/plain, Size: 6985 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no changes]

V1 -> V2:
+  modified mmap.c:mmap_region() to return error if mlock_vma_pages_range()
   does.  This can only occur if the vma gets removed/changed while
   we're switching mmap_sem lock modes.   Most callers don't care, but
   sys_remap_file_pages() appears to.

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.

Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap()/mremap().  Try to move back to normal LRU lists on
munmap() when last locked mapping removed.  Removed PageMlocked()
status when page truncated from file.


Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/mmap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mmap.c	2007-12-23 23:45:44.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mmap.c	2008-01-02 15:08:07.000000000 -0500
@@ -32,6 +32,8 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
@@ -1201,9 +1203,13 @@ out:	
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
-	}
-	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+		/*
+		 * makes pages present; downgrades, drops, requires mmap_sem
+		 */
+		error = mlock_vma_pages_range(vma, addr, addr + len);
+		if (error)
+			return error;	/* vma gone! */
+	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
 	return addr;
 
@@ -1886,6 +1892,19 @@ int do_munmap(struct mm_struct *mm, unsi
 	vma = prev? prev->vm_next: mm->mmap;
 
 	/*
+	 * unlock any mlock()ed ranges before detaching vmas
+	 */
+	if (mm->locked_vm) {
+		struct vm_area_struct *tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (tmp->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(tmp,
+						 tmp->vm_start, tmp->vm_end);
+			tmp = tmp->vm_next;
+		}
+	}
+
+	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
@@ -2021,7 +2040,7 @@ out:
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		mlock_vma_pages_range(vma, addr, addr + len);
 	}
 	return addr;
 }
@@ -2032,13 +2051,26 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
 
+	if (mm->locked_vm) {
+		vma = mm->mmap;
+		while (vma) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(vma,
+						vma->vm_start, vma->vm_end);
+			vma = vma->vm_next;
+		}
+	}
+
+	vma = mm->mmap;
+
+
 	lru_add_drain();
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
Index: linux-2.6.24-rc6-mm1/mm/mremap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mremap.c	2007-12-23 23:45:36.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mremap.c	2008-01-02 15:08:07.000000000 -0500
@@ -23,6 +23,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -232,8 +234,8 @@ static unsigned long move_vma(struct vm_
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
-					   new_addr + new_len);
+			mlock_vma_pages_range(vma, new_addr + old_len,
+						   new_addr + new_len);
 	}
 
 	return new_addr;
@@ -373,7 +375,7 @@ unsigned long do_mremap(unsigned long ad
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				mlock_vma_pages_range(vma, addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-02 15:04:11.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-02 15:08:07.000000000 -0500
@@ -540,6 +540,10 @@ static unsigned long shrink_page_list(st
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
+			case SWAP_MLOCK:
+				ClearPageActive(page);
+				SetPageNoreclaim(page);
+				goto keep_locked;	/* to noreclaim list */
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
Index: linux-2.6.24-rc6-mm1/mm/filemap.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/filemap.c	2008-01-02 12:37:38.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/filemap.c	2008-01-02 15:08:07.000000000 -0500
@@ -2525,8 +2525,16 @@ generic_file_direct_IO(int rw, struct ki
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
 		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-	       	if (mapping_mapped(mapping))
+		if (mapping_mapped(mapping)) {
+			/*
+			 * Calling unmap_mapping_range like this is wrong,
+			 * because it can lead to mlocked pages being
+			 * discarded (this is true even before the Noreclaim
+			 * mlock work). direct-IO vs pagecache is a load of
+			 * junk anyway, so who cares.
+			 */
 			unmap_mapping_range(mapping, offset, write_len, 0);
+		}
 	}
 
 	retval = filemap_write_and_wait(mapping);
Index: linux-2.6.24-rc6-mm1/mm/truncate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/truncate.c	2007-12-23 23:45:44.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/truncate.c	2008-01-02 15:08:07.000000000 -0500
@@ -18,6 +18,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
+#include "internal.h"
 
 
 /**
@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
 	remove_from_page_cache(page);
+	clear_page_mlock(page);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	page_cache_release(page);	/* pagecache ref */
@@ -128,6 +130,7 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
+	clear_page_mlock(page);
 	ret = remove_mapping(mapping, page);
 
 	return ret;
@@ -354,6 +357,7 @@ invalidate_complete_page2(struct address
 	if (PageDirty(page))
 		goto failed;
 
+	clear_page_mlock(page);
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 18/19] account mlocked pages
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (16 preceding siblings ...)
  2008-01-08 20:59 ` [patch 17/19] handle mlocked pages during map/unmap and truncate Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-11 12:51   ` Balbir Singh
  2008-01-08 20:59 ` [patch 19/19] cull non-reclaimable anon pages from the LRU at fault time Rik van Riel
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Nick Piggin, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.3-account-mlocked-pages.patch --]
[-- Type: text/plain, Size: 6895 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix definitions of NR_MLOCK to fix build errors when not configured.

V1 -> V2:
+  new in V2 -- pulled in & reworked from Nick's previous series

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Cc: Nick Piggin <npiggin@suse.de>, Andrew Morton <akpm@osdl.org>
  Subject: [patch 4/4] mm: account mlocked pages
  Date:	Mon, 12 Mar 2007 07:39:14 +0100 (CET)

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.  I don't know whether we'll want to keep
these stats in the long run, but during testing of this series, I find
them useful.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>


Index: linux-2.6.24-rc6-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/drivers/base/node.c	2008-01-02 17:08:16.000000000 -0500
+++ linux-2.6.24-rc6-mm1/drivers/base/node.c	2008-01-02 17:08:17.000000000 -0500
@@ -55,6 +55,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM
 		       "Node %d Noreclaim:    %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       "Node %d Mlocked:       %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
@@ -82,6 +85,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_NORECLAIM
 		       nid, node_page_state(nid, NR_NORECLAIM),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       nid, K(node_page_state(nid, NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
Index: linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/proc/proc_misc.c	2008-01-02 16:28:35.000000000 -0500
+++ linux-2.6.24-rc6-mm1/fs/proc/proc_misc.c	2008-01-02 17:08:17.000000000 -0500
@@ -164,6 +164,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM
 		"Noreclaim:    %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		"Mlocked:      %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
@@ -199,6 +202,9 @@ static int meminfo_read_proc(char *page,
 		K(global_page_state(NR_INACTIVE_FILE)),
 #ifdef CONFIG_NORECLAIM
 		K(global_page_state(NR_NORECLAIM)),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		K(global_page_state(NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
Index: linux-2.6.24-rc6-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1.orig/include/linux/mmzone.h	2008-01-02 16:28:35.000000000 -0500
+++ linux-2.6.24-rc6-mm1/include/linux/mmzone.h	2008-01-02 17:08:17.000000000 -0500
@@ -86,8 +86,12 @@ enum zone_stat_item {
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 #ifdef CONFIG_NORECLAIM
 	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#ifdef CONFIG_NORECLAIM_MLOCK
+	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
+#endif
 #else
-	NR_NORECLAIM=NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+	NR_NORECLAIM=NR_ACTIVE_FILE,	/* avoid compiler errors in dead code */
+	NR_MLOCK=NR_ACTIVE_FILE,	/* avoid compiler errors... */
 #endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
Index: linux-2.6.24-rc6-mm1/mm/mlock.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/mlock.c	2008-01-02 17:08:17.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/mlock.c	2008-01-02 17:08:17.000000000 -0500
@@ -60,11 +60,11 @@ void clear_page_mlock(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (likely(!PageMlocked(page)))
-		return;
-	ClearPageMlocked(page);
-	if (!isolate_lru_page(page))
-		putback_lru_page(page);
+	if (unlikely(TestClearPageMlocked(page))) {
+		dec_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page))
+			putback_lru_page(page);
+	}
 }
 
 /*
@@ -75,8 +75,11 @@ void mlock_vma_page(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+	if (!TestSetPageMlocked(page)) {
+		inc_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page))
 			putback_lru_page(page);
+	}
 }
 
 /*
@@ -98,10 +101,22 @@ static void munlock_vma_page(struct page
 {
 	BUG_ON(!PageLocked(page));
 
-	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
-		if (try_to_unlock(page) == SWAP_MLOCK)
-			SetPageMlocked(page);	/* still VM_LOCKED */
-		putback_lru_page(page);
+	if (TestClearPageMlocked(page)) {
+		dec_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page)) {
+			if (try_to_unlock(page) == SWAP_MLOCK) {
+				SetPageMlocked(page);	/* still VM_LOCKED */
+				inc_zone_page_state(page, NR_MLOCK);
+			}
+			putback_lru_page(page);
+		}
+		/*
+		 * Else we lost the race.  let try_to_unmap() deal with it.
+		 * At least we get the page state and mlock stats right.
+		 * However, page is still on the noreclaim list.  We'll fix
+		 * that up when the page is eventually freed or we scan the
+		 * noreclaim list.
+		 */
 	}
 }
 
@@ -118,7 +133,8 @@ int is_mlocked_vma(struct vm_area_struct
 	if (likely(!(vma->vm_flags & VM_LOCKED)))
 		return 0;
 
-	SetPageMlocked(page);
+	if (!TestSetPageMlocked(page))
+		inc_zone_page_state(page, NR_MLOCK);
 	return 1;
 }
 
Index: linux-2.6.24-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-02 17:08:17.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-02 17:08:17.000000000 -0500
@@ -366,8 +366,15 @@ static void migrate_page_copy(struct pag
 		set_page_dirty(newpage);
  	}
 
-	if (TestClearPageMlocked(page))
+	if (TestClearPageMlocked(page)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
 		SetPageMlocked(newpage);
+		__inc_zone_page_state(newpage, NR_MLOCK);
+		local_irq_restore(flags);
+	}
 
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
Index: linux-2.6.24-rc6-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmstat.c	2008-01-02 16:01:21.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmstat.c	2008-01-02 17:09:20.000000000 -0500
@@ -693,6 +693,9 @@ static const char * const vmstat_text[] 
 #ifdef CONFIG_NORECLAIM
 	"nr_noreclaim",
 #endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+	"nr_mlock",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 19/19] cull non-reclaimable anon pages from the LRU at fault time
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (17 preceding siblings ...)
  2008-01-08 20:59 ` [patch 18/19] account mlocked pages Rik van Riel
@ 2008-01-08 20:59 ` Rik van Riel
  2008-01-10  4:39 ` [patch 00/19] VM pageout scalability improvements Mike Snitzer
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 20:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Lee Schermerhorn

[-- Attachment #1: noreclaim-07-cull-nonreclaimable-anon-pages-in-fault-path.patch --]
[-- Type: text/plain, Size: 3840 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series.

V1 -> V2:
+  no changes

Optional part of "noreclaim infrastructure"

In the fault paths that install new anonymous pages, check whether
the page is reclaimable or not using lru_cache_add_active_or_noreclaim().
If the page is reclaimable, just add it to the active lru list [via
the pagevec cache], else add it to the noreclaim list.  

This "proactive" culling in the fault path mimics the handling of
mlocked pages in Nick Piggin's series to keep mlocked pages off
the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
   additional test in the fault path.  We can defer the moving of
   nonreclaimable pages until when vmscan [shrink_*_list()]
   encounters them.  Vmscan will only need to handle such pages
   once.

2) I moved the call to page_add_new_anon_rmap() to before the test
   for page_reclaimable() and thus before the calls to
   lru_cache_add_{active|noreclaim}(), so that page_reclaimable()
   could recognize the page as anon, thus obviating, I think, the
   vma arg to page_reclaimable() for this purpose.  Still needed for
   culling mlocked pages in fault path [later patch].
   TBD:   I think this reordering is OK, but the previous order may
   have existed to close some obscure race?

3) With this and other patches above installed, any anon pages
   created before swap is added--e.g., init's anonymous memory--
   will be declared non-reclaimable and placed on the noreclaim
   LRU list.  Need to add mechanism to bring such pages back when
   swap becomes available.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/memory.c	2008-01-02 12:37:38.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/memory.c	2008-01-02 15:14:31.000000000 -0500
@@ -1665,7 +1665,7 @@ gotten:
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active_anon(new_page);
+		lru_cache_add_active_or_noreclaim(new_page, vma);
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
@@ -2133,7 +2133,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active_anon(page);
+	lru_cache_add_active_or_noreclaim(page, vma);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2285,10 +2285,10 @@ static int __do_fault(struct mm_struct *
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
-                        inc_mm_counter(mm, anon_rss);
+			inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active_anon(page);
-                        page_add_new_anon_rmap(page, vma, address);
+			lru_cache_add_active_or_noreclaim(page, vma);
+			page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
Index: linux-2.6.24-rc6-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/swap_state.c	2008-01-02 12:37:52.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/swap_state.c	2008-01-02 15:14:31.000000000 -0500
@@ -300,7 +300,10 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_anon(new_page);
+			if (!page_reclaimable(new_page, vma))
+				lru_cache_add_noreclaim(new_page);
+			else
+				lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 01/19] move isolate_lru_page() to vmscan.c
  2008-01-08 20:59 ` [patch 01/19] move isolate_lru_page() to vmscan.c Rik van Riel
@ 2008-01-08 22:03   ` Christoph Lameter
  0 siblings, 0 replies; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Nick Piggin, Lee Schermerhorn

Reviewed-by: Christoph Lameter <clameter@sgi.com>



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 02/19] free swap space on swap-in/activation
  2008-01-08 20:59 ` [patch 02/19] free swap space on swap-in/activation Rik van Riel
@ 2008-01-08 22:10   ` Christoph Lameter
  0 siblings, 0 replies; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:10 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008, Rik van Riel wrote:

> Free swap cache entries when swapping in pages if vm_swap_full()
> [swap space > 1/2 used?].  Uses new pagevec to reduce pressure
> on locks.

The pagevec function would be faster if the swap removal could be batched 
inside of the locks taken in remove_exclusive_swap_page.

Reviewed-by: Christoph Lameter <clameter@sgi.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/19] define page_file_cache() function
  2008-01-08 20:59 ` [patch 03/19] define page_file_cache() function Rik van Riel
@ 2008-01-08 22:18   ` Christoph Lameter
  2008-01-08 22:28     ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008, Rik van Riel wrote:

> Define page_file_cache() function to answer the question:
> 	is page backed by a file?

> +static inline int page_file_cache(struct page *page)
> +{
> +	if (PageSwapBacked(page))
> +		return 0;

Could we call this PageNotFileBacked or so? PageSwapBacked is true for 
pages that are RAM based. Its a bit confusing.

> Index: linux-2.6.24-rc6-mm1/mm/migrate.c
> ===================================================================
> --- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-02 12:37:14.000000000 -0500
> +++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-02 12:37:22.000000000 -0500
> @@ -546,6 +546,8 @@ static int move_to_new_page(struct page 
>  	/* Prepare mapping for the new page.*/
>  	newpage->index = page->index;
>  	newpage->mapping = page->mapping;
> +	if (PageSwapBacked(page))
> +		SetPageSwapBacked(newpage);
>  
>  	mapping = page_mapping(page);
>  	if (!mapping)

That hunk belongs into migrate_page_copy()? Or is there a reason that we 
need this flag that early?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
@ 2008-01-08 22:22   ` Christoph Lameter
  2008-01-08 22:36     ` Rik van Riel
  2008-01-09  4:41   ` KAMEZAWA Hiroyuki
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

It may be good to coordinate this with Andrea Arcangeli's OOM fixes.

Also would it be possible to create generic functions that can move pages 
in pagevecs to an arbitrary lru list?



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 08/19] add newly swapped in pages to the inactive list
  2008-01-08 20:59 ` [patch 08/19] add newly swapped in pages to the inactive list Rik van Riel
@ 2008-01-08 22:28   ` Christoph Lameter
  0 siblings, 0 replies; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Tue, 8 Jan 2008, Rik van Riel wrote:

> In short, this patch needs testing.

Seems to be a good idea.

Reviewed-by: Christoph Lameter <clameter@sgi.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/19] define page_file_cache() function
  2008-01-08 22:18   ` Christoph Lameter
@ 2008-01-08 22:28     ` Rik van Riel
  2008-01-09  4:26       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 22:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008 14:18:40 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 8 Jan 2008, Rik van Riel wrote:
> 
> > Define page_file_cache() function to answer the question:
> > 	is page backed by a file?
> 
> > +static inline int page_file_cache(struct page *page)
> > +{
> > +	if (PageSwapBacked(page))
> > +		return 0;
> 
> Could we call this PageNotFileBacked or so? PageSwapBacked is true for 
> pages that are RAM based. Its a bit confusing.

PageNotFileBacked confuses me a little, since shared memory segments live
in tmpfs and are kinda sorta file backed, but go to swap instead of to a
filesystem when there is memory pressure.

I'm always open to better naming ideas, though.

> > Index: linux-2.6.24-rc6-mm1/mm/migrate.c
> > ===================================================================
> > --- linux-2.6.24-rc6-mm1.orig/mm/migrate.c	2008-01-02 12:37:14.000000000 -0500
> > +++ linux-2.6.24-rc6-mm1/mm/migrate.c	2008-01-02 12:37:22.000000000 -0500
> > @@ -546,6 +546,8 @@ static int move_to_new_page(struct page 
> >  	/* Prepare mapping for the new page.*/
> >  	newpage->index = page->index;
> >  	newpage->mapping = page->mapping;
> > +	if (PageSwapBacked(page))
> > +		SetPageSwapBacked(newpage);
> >  
> >  	mapping = page_mapping(page);
> >  	if (!mapping)
> 
> That hunk belongs into migrate_page_copy()? Or is there a reason that we 
> need this flag that early?

We want the page added to the right LRU list.  I'll have to re-read the
migration code to make sure whether the above can or cannot be done in
migrate_page_copy() - I agree it would fit in better there.

Thanks for the suggestions.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 09/19] (NEW) more aggressively use lumpy reclaim
  2008-01-08 20:59 ` [patch 09/19] (NEW) more aggressively use lumpy reclaim Rik van Riel
@ 2008-01-08 22:30   ` Christoph Lameter
  2008-01-14 15:28     ` Mel Gorman
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, mel

On Tue, 8 Jan 2008, Rik van Riel wrote:

> If normal pageout does not result in contiguous free pages for
> kernel stacks, fall back to lumpy reclaim instead of failing fork
> or doing excessive pageout IO.

Good. Ccing Mel. This is going to help higher order pages which is useful 
for a couple of other projects.

Reviewed-by: Christoph Lameter <clameter@sgi.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 22:22   ` Christoph Lameter
@ 2008-01-08 22:36     ` Rik van Riel
  2008-01-08 22:42       ` Christoph Lameter
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-08 22:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008 14:22:38 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> It may be good to coordinate this with Andrea Arcangeli's OOM fixes.

Probably.  With the split LRU lists (and the noreclaim LRUs), we can
simplify the OOM test a lot:

If free + file_active + file_inactive <= zone->pages_high and swap
space is full, the system is doomed.  No need for guesswork.

> Also would it be possible to create generic functions that can move pages 
> in pagevecs to an arbitrary lru list?

What would you use those functions for?

Or am I simply misunderstanding your idea?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 22:36     ` Rik van Riel
@ 2008-01-08 22:42       ` Christoph Lameter
  2008-01-09  2:45         ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Lameter @ 2008-01-08 22:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008, Rik van Riel wrote:

> > Also would it be possible to create generic functions that can move pages 
> > in pagevecs to an arbitrary lru list?
> 
> What would you use those functions for?

We keep on duplicating the pagevec lru operation functions in mm/swap.c. 
Some generic stuff would reduce the code size.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 22:42       ` Christoph Lameter
@ 2008-01-09  2:45         ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-09  2:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008 14:42:03 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 8 Jan 2008, Rik van Riel wrote:
> 
> > > Also would it be possible to create generic functions that can move pages 
> > > in pagevecs to an arbitrary lru list?
> > 
> > What would you use those functions for?
> 
> We keep on duplicating the pagevec lru operation functions in mm/swap.c. 
> Some generic stuff would reduce the code size.

Good idea.  Added to my TODO list :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/19] (NEW) add some sanity checks to get_scan_ratio
  2008-01-08 20:59 ` [patch 07/19] (NEW) add some sanity checks to get_scan_ratio Rik van Riel
@ 2008-01-09  4:16   ` KAMEZAWA Hiroyuki
  2008-01-09 12:53     ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-09  4:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Tue, 08 Jan 2008 15:59:46 -0500
Rik van Riel <riel@redhat.com> wrote:

> The access ratio based scan rate determination in get_scan_ratio
> works ok in most situations, but needs to be corrected in some
> corner cases:
> - if we run out of swap space, do not bother scanning the anon LRUs
> - if we have already freed all of the page cache, we need to scan
>   the anon LRUs
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-01-07 17:33:50.000000000 -0500
> +++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-01-07 17:57:49.000000000 -0500
> @@ -1182,7 +1182,7 @@ static unsigned long shrink_list(enum lr
>  static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
>  					unsigned long *percent)
>  {
> -	unsigned long anon, file;
> +	unsigned long anon, file, free;
>  	unsigned long anon_prio, file_prio;
>  	unsigned long rotate_sum;
>  	unsigned long ap, fp;
> @@ -1230,6 +1230,20 @@ static void get_scan_ratio(struct zone *
>  	else if (fp > 100)
>  		fp = 100;
>  	percent[1] = fp;
> +
> +	free = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * If we have no swap space, do not bother scanning anon pages
> +	 */
> +	if (nr_swap_pages <= 0)
> +		percent[0] = 0;
Doesn't this mean that swap-cache in ACTIVE_ANON_LIST is not scanned ?
Or swap-cache is in File-Cache list ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/19] define page_file_cache() function
  2008-01-08 22:28     ` Rik van Riel
@ 2008-01-09  4:26       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-09  4:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Lameter, linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 8 Jan 2008 17:28:56 -0500
Rik van Riel <riel@redhat.com> wrote:

> On Tue, 8 Jan 2008 14:18:40 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 8 Jan 2008, Rik van Riel wrote:
> > 
> > > Define page_file_cache() function to answer the question:
> > > 	is page backed by a file?
> > 
> > > +static inline int page_file_cache(struct page *page)
> > > +{
> > > +	if (PageSwapBacked(page))
> > > +		return 0;
> > 
> > Could we call this PageNotFileBacked or so? PageSwapBacked is true for 
> > pages that are RAM based. Its a bit confusing.
> 
> PageNotFileBacked confuses me a little, since shared memory segments live
> in tmpfs and are kinda sorta file backed, but go to swap instead of to a
> filesystem when there is memory pressure.
> 
How about PageIsNotCache() ? :)

When a page is a cache, there is an original data somewhere and can be dropped
out.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
  2008-01-08 22:22   ` Christoph Lameter
@ 2008-01-09  4:41   ` KAMEZAWA Hiroyuki
  2008-01-10  2:21     ` Balbir Singh
  2008-01-10  2:28   ` KAMEZAWA Hiroyuki
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-09  4:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn, balbir

I like this patch set thank you.

On Tue, 08 Jan 2008 15:59:44 -0500
Rik van Riel <riel@redhat.com> wrote:
> Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c	2008-01-07 11:55:09.000000000 -0500
> +++ linux-2.6.24-rc6-mm1/mm/memcontrol.c	2008-01-07 17:32:53.000000000 -0500
<snip>

> -enum mem_cgroup_zstat_index {
> -	MEM_CGROUP_ZSTAT_ACTIVE,
> -	MEM_CGROUP_ZSTAT_INACTIVE,
> -
> -	NR_MEM_CGROUP_ZSTAT,
> -};
> -
>  struct mem_cgroup_per_zone {
>  	/*
>  	 * spin_lock to protect the per cgroup LRU
>  	 */
>  	spinlock_t		lru_lock;
> -	struct list_head	active_list;
> -	struct list_head	inactive_list;
> -	unsigned long count[NR_MEM_CGROUP_ZSTAT];
> +	struct list_head	lists[NR_LRU_LISTS];
> +	unsigned long		count[NR_LRU_LISTS];
>  };
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> @@ -160,6 +152,7 @@ struct page_cgroup {
>  };
>  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
>  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> +#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
> 

Now, we don't have control_type and a feature for accounting only CACHE.
Balbir-san, do you have some new plan ?

BTW, is it better to use PageSwapBacked(pc->page) rather than adding a new flag
PAGE_CGROUP_FLAG_FILE ?


PAGE_CGROUP_FLAG_ACTIVE is used because global reclaim can change
ACTIVE/INACTIVE attribute without accessing memory cgroup.
(Then, we cannot trust PageActive(pc->page))

ANON <-> FILE attribute can be changed dinamically (after added to LRU) ?

If no, using page_file_cache(pc->page) will be easy.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/19] (NEW) add some sanity checks to get_scan_ratio
  2008-01-09  4:16   ` KAMEZAWA Hiroyuki
@ 2008-01-09 12:53     ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-09 12:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm

On Wed, 9 Jan 2008 13:16:42 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > +
> > +	free = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * If we have no swap space, do not bother scanning anon pages
> > +	 */
> > +	if (nr_swap_pages <= 0)
> > +		percent[0] = 0;
> Doesn't this mean that swap-cache in ACTIVE_ANON_LIST is not scanned ?
> Or swap-cache is in File-Cache list ?

You are right, the swap cache will not be scanned once we run
completely out of swap space.  To compensate for that, this
patch series has a patch that does scanning of swap cache and
freeing of swap space used by pages on the LRU list while there
is still space free.

Scanning all of the anon LRU lists could be a lot of work for
very little gain.  A typical large server will have 32GB or
more of RAM, but only the default 2GB of swap.

All we accomplish by scanning the anonymous memory on a system
like that (once swap is full) is eating up CPU time and causing
lock contention.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-09  4:41   ` KAMEZAWA Hiroyuki
@ 2008-01-10  2:21     ` Balbir Singh
  2008-01-10  2:36       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 88+ messages in thread
From: Balbir Singh @ 2008-01-10  2:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Rik van Riel, linux-kernel, linux-mm, Lee Schermerhorn

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-01-09 13:41:32]:

> I like this patch set thank you.
> 
> On Tue, 08 Jan 2008 15:59:44 -0500
> Rik van Riel <riel@redhat.com> wrote:
> > Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c	2008-01-07 11:55:09.000000000 -0500
> > +++ linux-2.6.24-rc6-mm1/mm/memcontrol.c	2008-01-07 17:32:53.000000000 -0500
> <snip>
> 
> > -enum mem_cgroup_zstat_index {
> > -	MEM_CGROUP_ZSTAT_ACTIVE,
> > -	MEM_CGROUP_ZSTAT_INACTIVE,
> > -
> > -	NR_MEM_CGROUP_ZSTAT,
> > -};
> > -
> >  struct mem_cgroup_per_zone {
> >  	/*
> >  	 * spin_lock to protect the per cgroup LRU
> >  	 */
> >  	spinlock_t		lru_lock;
> > -	struct list_head	active_list;
> > -	struct list_head	inactive_list;
> > -	unsigned long count[NR_MEM_CGROUP_ZSTAT];
> > +	struct list_head	lists[NR_LRU_LISTS];
> > +	unsigned long		count[NR_LRU_LISTS];
> >  };
> >  /* Macro for accessing counter */
> >  #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
> > @@ -160,6 +152,7 @@ struct page_cgroup {
> >  };
> >  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
> >  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> > +#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
> > 
> 
> Now, we don't have control_type and a feature for accounting only CACHE.
> Balbir-san, do you have some new plan ?
>

Hi, KAMEZAWA-San,

The control_type feature is gone. We still have cached page
accounting, but we do not allow control of only RSS pages anymore. We
need to control both RSS+cached pages. I do not understand your
question about new plan? Is it about adding back control_type?

 
> BTW, is it better to use PageSwapBacked(pc->page) rather than adding a new flag
> PAGE_CGROUP_FLAG_FILE ?
> 
> 
> PAGE_CGROUP_FLAG_ACTIVE is used because global reclaim can change
> ACTIVE/INACTIVE attribute without accessing memory cgroup.
> (Then, we cannot trust PageActive(pc->page))
> 

Yes, correct. A page active on the node's zone LRU need not be active
in the memory cgroup.

> ANON <-> FILE attribute can be changed dinamically (after added to LRU) ?
> 
> If no, using page_file_cache(pc->page) will be easy.
> 
> Thanks,
> -Kame
> 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
  2008-01-08 22:22   ` Christoph Lameter
  2008-01-09  4:41   ` KAMEZAWA Hiroyuki
@ 2008-01-10  2:28   ` KAMEZAWA Hiroyuki
  2008-01-10  2:37     ` Rik van Riel
  2008-01-11  3:59   ` KOSAKI Motohiro
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-10  2:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Tue, 08 Jan 2008 15:59:44 -0500
Rik van Riel <riel@redhat.com> wrote:

> +	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
> +
> +	/* Keep a floating average of RECENT references. */
> +	if (unlikely(rotate_sum > min(anon, file))) {
> +		spin_lock_irq(&zone->lru_lock);
> +		zone->recent_rotated_file /= 2;
> +		zone->recent_rotated_anon /= 2;
> +		spin_unlock_irq(&zone->lru_lock);
> +		rotate_sum /= 2;
> +	}
> +
> +	/*
> +	 * With swappiness at 100, anonymous and file have the same priority.
> +	 * This scanning priority is essentially the inverse of IO cost.
> +	 */
> +	anon_prio = sc->swappiness;
> +	file_prio = 200 - sc->swappiness;
> +
> +	/*
> +	 *                  anon       recent_rotated_anon
> +	 * %anon = 100 * ----------- / ------------------- * IO cost
> +	 *               anon + file       rotate_sum
> +	 */
> +	ap = (anon_prio * anon) / (anon + file + 1);
> +	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
> +	if (ap == 0)
> +		ap = 1;
> +	else if (ap > 100)
> +		ap = 100;
> +	percent[0] = ap;
> +

Hmm, it seems..

When a program copies large amount of files, recent_rotated_file increases
rapidly and 

    rotate_sum
    ----------
recent_rotated_anon

will be very big.

And %ap will be big regardless of vm_swappiness  if it's not 0.

I think # of recent_successful_pageout(anon/file) should be took into account...

I'm sorry if I miss something.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-10  2:21     ` Balbir Singh
@ 2008-01-10  2:36       ` KAMEZAWA Hiroyuki
  2008-01-10  3:26         ` Balbir Singh
  0 siblings, 1 reply; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-10  2:36 UTC (permalink / raw)
  To: balbir; +Cc: Rik van Riel, linux-kernel, linux-mm, Lee Schermerhorn

On Thu, 10 Jan 2008 07:51:33 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > >  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
> > >  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> > > +#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
> > > 
> > 
> > Now, we don't have control_type and a feature for accounting only CACHE.
> > Balbir-san, do you have some new plan ?
> >
> 
> Hi, KAMEZAWA-San,
> 
> The control_type feature is gone. We still have cached page
> accounting, but we do not allow control of only RSS pages anymore. We
> need to control both RSS+cached pages. I do not understand your
> question about new plan? Is it about adding back control_type?
> 
Ah, just wanted to confirm that we can drop PAGE_CGROUP_FLAG_CACHE
if page_file_cache() function and split-LRU is introduced.


Thanks you.

-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-10  2:28   ` KAMEZAWA Hiroyuki
@ 2008-01-10  2:37     ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-10  2:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, Lee Schermerhorn

On Thu, 10 Jan 2008 11:28:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Hmm, it seems..
> 
> When a program copies large amount of files, recent_rotated_file increases
> rapidly and 
> 
>     rotate_sum
>     ----------
> recent_rotated_anon
> 
> will be very big.
> 
> And %ap will be big regardless of vm_swappiness  if it's not 0.
> 
> I think # of recent_successful_pageout(anon/file) should be took into account...
> 
> I'm sorry if I miss something.

You are right.  I wonder if this, again, is a case of myself or
Lee forward porting old code.  I remember having (had) a very
different version of get_scan_ratio() at some point in the past,
but I cannot remember if we discarded this version for that other
version, or the other way around :(

Lee, would you by any chance still have some alternative versions
of get_scan_ratio() around?  I'm searching through my systems, but
have not found it yet...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-10  2:36       ` KAMEZAWA Hiroyuki
@ 2008-01-10  3:26         ` Balbir Singh
  2008-01-10  4:23           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 88+ messages in thread
From: Balbir Singh @ 2008-01-10  3:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Rik van Riel, linux-kernel, linux-mm, Lee Schermerhorn

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-01-10 11:36:18]:

> On Thu, 10 Jan 2008 07:51:33 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > >  #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
> > > >  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
> > > > +#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
> > > > 
> > > 
> > > Now, we don't have control_type and a feature for accounting only CACHE.
> > > Balbir-san, do you have some new plan ?
> > >
> > 
> > Hi, KAMEZAWA-San,
> > 
> > The control_type feature is gone. We still have cached page
> > accounting, but we do not allow control of only RSS pages anymore. We
> > need to control both RSS+cached pages. I do not understand your
> > question about new plan? Is it about adding back control_type?
> > 
> Ah, just wanted to confirm that we can drop PAGE_CGROUP_FLAG_CACHE
> if page_file_cache() function and split-LRU is introduced.
> 

Earlier we would have had a problem, since we even accounted for swap
cache with PAGE_CGROUP_FLAG_CACHE and I think page_file_cache() does
not account swap cache pages with page_file_cache(). Our accounting
is based on mapped vs unmapped whereas the new code from Rik accounts
file vs anonymous. I suspect we could live a little while longer
with PAGE_CGROUP_FLAG_CACHE and then if we do not need it at all,
we can mark it down for removal. What do you think?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-10  3:26         ` Balbir Singh
@ 2008-01-10  4:23           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-10  4:23 UTC (permalink / raw)
  To: balbir; +Cc: Rik van Riel, linux-kernel, linux-mm, Lee Schermerhorn

On Thu, 10 Jan 2008 08:56:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > > The control_type feature is gone. We still have cached page
> > > accounting, but we do not allow control of only RSS pages anymore. We
> > > need to control both RSS+cached pages. I do not understand your
> > > question about new plan? Is it about adding back control_type?
> > > 
> > Ah, just wanted to confirm that we can drop PAGE_CGROUP_FLAG_CACHE
> > if page_file_cache() function and split-LRU is introduced.
> > 
> 
> Earlier we would have had a problem, since we even accounted for swap
> cache with PAGE_CGROUP_FLAG_CACHE and I think page_file_cache() does
> not account swap cache pages with page_file_cache(). Our accounting
> is based on mapped vs unmapped whereas the new code from Rik accounts
> file vs anonymous. I suspect we could live a little while longer
> with PAGE_CGROUP_FLAG_CACHE and then if we do not need it at all,
> we can mark it down for removal. What do you think?

Okay, I have no objection. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (18 preceding siblings ...)
  2008-01-08 20:59 ` [patch 19/19] cull non-reclaimable anon pages from the LRU at fault time Rik van Riel
@ 2008-01-10  4:39 ` Mike Snitzer
  2008-01-10 15:41   ` Rik van Riel
  2008-01-11 10:41 ` Balbir Singh
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2008-01-10  4:39 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Jan 8, 2008 3:59 PM, Rik van Riel <riel@redhat.com> wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
>
> Against 2.6.24-rc6-mm1

Hi Rik,

How much trouble am I asking for if I were to try to get your patchset
to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
workable, is such an effort before it's time relative to your TODO?

I see that you have an old port to a FC7-based 2.6.21 here:
http://people.redhat.com/riel/vmsplit/

Also, do you have a public git repo that you regularly publish to for
this patchset?  If not a git repo do you put the raw patchset on some
http/ftp server?

thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-10  4:39 ` [patch 00/19] VM pageout scalability improvements Mike Snitzer
@ 2008-01-10 15:41   ` Rik van Riel
  2008-01-10 16:08     ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-10 15:41 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, linux-mm

On Wed, 9 Jan 2008 23:39:02 -0500
"Mike Snitzer" <snitzer@gmail.com> wrote:

> How much trouble am I asking for if I were to try to get your patchset
> to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
> workable, is such an effort before it's time relative to your TODO?

Quite a bit :)

The -mm kernel has the memory controller code, which means the
mm/ directory is fairly different.  My patch set sits on top
of that.

Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
I can start building on top of that.

OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
minimal chainsaw effort.

> I see that you have an old port to a FC7-based 2.6.21 here:
> http://people.redhat.com/riel/vmsplit/
> 
> Also, do you have a public git repo that you regularly publish to for
> this patchset?  If not a git repo do you put the raw patchset on some
> http/ftp server?

Up to now I have only emailed out the patches. Since there is demand
for them to be downloadable from somewhere, I'll also start putting
them on http://people.redhat.com/riel/

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-10 15:41   ` Rik van Riel
@ 2008-01-10 16:08     ` Mike Snitzer
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2008-01-10 16:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Jan 10, 2008 10:41 AM, Rik van Riel <riel@redhat.com> wrote:
>
> On Wed, 9 Jan 2008 23:39:02 -0500
> "Mike Snitzer" <snitzer@gmail.com> wrote:
>
> > How much trouble am I asking for if I were to try to get your patchset
> > to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
> > workable, is such an effort before it's time relative to your TODO?
>
> Quite a bit :)
>
> The -mm kernel has the memory controller code, which means the
> mm/ directory is fairly different.  My patch set sits on top
> of that.
>
> Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
> I can start building on top of that.
>
> OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
> minimal chainsaw effort.

That would be great!  I can't speak for others but -mm poses a problem
for testing your patchset because it is so bleeding.  Let me know if
you take the plunge on a 2.6.23.x backport; I'd really appreciate it.

Is anyone else interested in consuming a 2.6.23.x backport of Rik's
patchset?  If so please speak up.

> > I see that you have an old port to a FC7-based 2.6.21 here:
> > http://people.redhat.com/riel/vmsplit/
> >
> > Also, do you have a public git repo that you regularly publish to for
> > this patchset?  If not a git repo do you put the raw patchset on some
> > http/ftp server?
>
> Up to now I have only emailed out the patches. Since there is demand
> for them to be downloadable from somewhere, I'll also start putting
> them on http://people.redhat.com/riel/

Great, thanks.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
                     ` (2 preceding siblings ...)
  2008-01-10  2:28   ` KAMEZAWA Hiroyuki
@ 2008-01-11  3:59   ` KOSAKI Motohiro
  2008-01-11 15:37     ` Rik van Riel
  2008-01-11  6:24   ` KOSAKI Motohiro
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-11  3:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

Hi Rik

> -static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
> -					struct zone *zone, int priority)
> +static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
> +					struct zone *zone, int priority,
> +					int active, int file)
>  {
>  	return 0;
>  }

it can't compile if memcgroup turn off.

because current mem_cgroup_calc_reclaim type is below.

	long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
				int priority, enum lru_list lru)

after patched below, it can compile.
I hope you don't think unpleasant by a trivial point out.

regard.

- kosaki


Index: linux-2.6.24-rc6-mm1-rvr/include/linux/memcontrol.h
===================================================================
--- linux-2.6.24-rc6-mm1-rvr.orig/include/linux/memcontrol.h    2008-01-11 11:10:16.000000000 +0900
+++ linux-2.6.24-rc6-mm1-rvr/include/linux/memcontrol.h 2008-01-11 12:08:29.000000000 +0900
@@ -168,9 +168,8 @@
 {
 }

-static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
-                                       struct zone *zone, int priority,
-                                       int active, int file)
+static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+                                       int priority, enum lru_list lru)
 {
        return 0;
 }





^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 10/19] No Reclaim LRU Infrastructure
  2008-01-08 20:59 ` [patch 10/19] No Reclaim LRU Infrastructure Rik van Riel
@ 2008-01-11  4:36   ` KOSAKI Motohiro
  2008-01-11 15:43     ` Lee Schermerhorn
  0 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-11  4:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

Hi Rik

> +config NORECLAIM
> +	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
> +	depends on EXPERIMENTAL && 64BIT
> +	help
> +	  Supports tracking of non-reclaimable pages off the [in]active lists
> +	  to avoid excessive reclaim overhead on large memory systems.  Pages
> +	  may be non-reclaimable because:  they are locked into memory, they
> +	  are anonymous pages for which no swap space exists, or they are anon
> +	  pages that are expensive to unmap [long anon_vma "related vma" list.]

Why do you select to default is NO ?
I think this is really improvement and no one of 64bit user
hope turn off without NORECLAIM developer :)


- kosaki



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
                     ` (3 preceding siblings ...)
  2008-01-11  3:59   ` KOSAKI Motohiro
@ 2008-01-11  6:24   ` KOSAKI Motohiro
  2008-01-11 15:42     ` Rik van Riel
  2008-01-11 15:50     ` Lee Schermerhorn
  2008-01-11  7:35   ` KOSAKI Motohiro
  2008-01-30  3:25   ` KOSAKI Motohiro
  6 siblings, 2 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-11  6:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

Hi Rik

> +static inline int is_file_lru(enum lru_list l)
> +{
> +	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
> +	return (l/2 == 1);
> +}

below patch is a bit cleanup proposal.
i think LRU_FILE is more clarify than "/2".

What do you think it?



Index: linux-2.6.24-rc6-mm1-rvr/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc6-mm1-rvr.orig/include/linux/mmzone.h        2008-01-11 11:10:30.000000000 +0900
+++ linux-2.6.24-rc6-mm1-rvr/include/linux/mmzone.h     2008-01-11 14:40:31.000000000 +0900
@@ -147,7 +147,7 @@
 static inline int is_file_lru(enum lru_list l)
 {
        BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
-       return (l/2 == 1);
+       return !!(l & LRU_FILE);
 }

 struct per_cpu_pages {




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
                     ` (4 preceding siblings ...)
  2008-01-11  6:24   ` KOSAKI Motohiro
@ 2008-01-11  7:35   ` KOSAKI Motohiro
  2008-01-11 15:46     ` Rik van Riel
  2008-01-30  3:25   ` KOSAKI Motohiro
  6 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-11  7:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

Hi Rik

> @@ -1128,64 +1026,65 @@ static void shrink_active_list(unsigned 
  (snip)

> +	/*
> +	 * For sorting active vs inactive pages, we'll use the 'anon'
> +	 * elements of the local list[] array and sort out the file vs
> +	 * anon pages below.
> +	 */
>  	while (!list_empty(&l_hold)) {
> +		lru = LRU_INACTIVE_ANON;
>  		cond_resched();
>  		page = lru_to_page(&l_hold);
>  		list_del(&page->lru);
> -		if (page_mapped(page)) {
> -			if (!reclaim_mapped ||
> -			    (total_swap_pages == 0 && PageAnon(page)) ||
> -			    page_referenced(page, 0, sc->mem_cgroup)) {
> -				list_add(&page->lru, &list[LRU_ACTIVE]);
> -				continue;
> -			}
> -		} else if (TestClearPageReferenced(page)) {
> -			list_add(&page->lru, &list[LRU_ACTIVE]);
> -			continue;
> -		}
> -		list_add(&page->lru, &list[LRU_INACTIVE]);
> +		if (page_referenced(page, 0, sc->mem_cgroup))
> +			lru = LRU_ACTIVE_ANON;
> +		list_add(&page->lru, &list[lru]);
>  	}

Why drop (total_swap_pages == 0 && PageAnon(page)) condition?
in embedded sysmtem, 
CONFIG_NORECLAIM is OFF (because almost embedded cpu is 32bit) and
that anon move to inactive list is meaningless because it doesn't have swap.

below code is more good, may be.
but I don't understand yet why ignore page_referenced() result at anon page ;-)


- kosaki


---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.24-rc6-mm1-rvr/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1-rvr.orig/mm/vmscan.c   2008-01-11 13:59:12.000000000 +0900
+++ linux-2.6.24-rc6-mm1-rvr/mm/vmscan.c        2008-01-11 16:16:44.000000000 +0900
@@ -1147,7 +1147,7 @@ static void shrink_active_list(unsigned
                }

                if (page_referenced(page, 0, sc->mem_cgroup)) {
-                       if (file)
+                       if (file || (total_swap_pages == 0))
                                /* Referenced file pages stay active. */
                                lru = LRU_ACTIVE_ANON;
                        else



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (19 preceding siblings ...)
  2008-01-10  4:39 ` [patch 00/19] VM pageout scalability improvements Mike Snitzer
@ 2008-01-11 10:41 ` Balbir Singh
  2008-01-11 15:38   ` Rik van Riel
  2008-01-11 11:47 ` Balbir Singh
  2008-01-16  6:17 ` rvr split LRU minor regression ? KOSAKI Motohiro
  22 siblings, 1 reply; 88+ messages in thread
From: Balbir Singh @ 2008-01-11 10:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

* Rik van Riel <riel@redhat.com> [2008-01-08 15:59:39]:

> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 
> Against 2.6.24-rc6-mm1
> 
> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>    onto their own LRUs, so the system only scans the pages that it
>    can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>    number of pages that need to be scanned when the system
>    starts swapping is bound to a reasonable number
> 
> More info on the overall design can be found at:
> 
> 	http://linux-mm.org/PageReplacementDesign
> 
> 
> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
>   since it is not functionally different (it was split up only to help
>   people who had seen the last version of the patch series review it)
> - drop the page_file_cache debugging patch, since it never triggered
> - reintroduce code to not scan anon list if swap is full
> - add code to scan anon list if page cache is very small already
> - use lumpy reclaim more aggressively for smaller order > 1 allocations
>

Hi, Rik,

I've just started the patch series, the compile fails for me on a
powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
else where in mm/page-writeback.c. None of the global_lru_pages()
parameters depend on CONFIG_PM. Here's a simple patch to fix it.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b14e188..39e6aef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order)
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
+unsigned long global_lru_pages(void)
+{
+	return global_page_state(NR_ACTIVE_ANON)
+		+ global_page_state(NR_ACTIVE_FILE)
+		+ global_page_state(NR_INACTIVE_ANON)
+		+ global_page_state(NR_INACTIVE_FILE);
+}
+
 #ifdef CONFIG_PM
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
@@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio,
 	return ret;
 }
 
-unsigned long global_lru_pages(void)
-{
-	return global_page_state(NR_ACTIVE_ANON)
-		+ global_page_state(NR_ACTIVE_FILE)
-		+ global_page_state(NR_INACTIVE_ANON)
-		+ global_page_state(NR_INACTIVE_FILE);
-}
-
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (20 preceding siblings ...)
  2008-01-11 10:41 ` Balbir Singh
@ 2008-01-11 11:47 ` Balbir Singh
  2008-01-16  6:17 ` rvr split LRU minor regression ? KOSAKI Motohiro
  22 siblings, 0 replies; 88+ messages in thread
From: Balbir Singh @ 2008-01-11 11:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

* Rik van Riel <riel@redhat.com> [2008-01-08 15:59:39]:

> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
>   since it is not functionally different (it was split up only to help
>   people who had seen the last version of the patch series review it)

Hi, Rik,

I see a strange behaviour with this patchset. I have a program
(pagetest from Vaidy), that does the following

1. Can allocate different kinds of memory, mapped, malloc'ed or shared
2. Allocates and touches all the memory in a loop (2 times)

I mount the memory controller and limit it to 400M and run pagetest
and ask it to touch 1000M. Without this patchset everything runs fine,
but with this patchset installed, I immediately see

 pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
 Call Trace:
 [c0000000e5aef400] [c00000000000eb24] .show_stack+0x70/0x1bc (unreliable)
 [c0000000e5aef4b0] [c0000000000bbbbc] .oom_kill_process+0x80/0x260
 [c0000000e5aef570] [c0000000000bc498] .mem_cgroup_out_of_memory+0x6c/0x98
 [c0000000e5aef610] [c0000000000f2574] .mem_cgroup_charge_common+0x1e0/0x414
 [c0000000e5aef6e0] [c0000000000b852c] .add_to_page_cache+0x48/0x164
 [c0000000e5aef780] [c0000000000b8664] .add_to_page_cache_lru+0x1c/0x68
 [c0000000e5aef810] [c00000000012db50] .mpage_readpages+0xbc/0x15c
 [c0000000e5aef940] [c00000000018bdac] .ext3_readpages+0x28/0x40
 [c0000000e5aef9c0] [c0000000000c3978] .__do_page_cache_readahead+0x158/0x260
 [c0000000e5aefa90] [c0000000000bac44] .filemap_fault+0x18c/0x3d4
 [c0000000e5aefb70] [c0000000000cd510] .__do_fault+0xb0/0x588
 [c0000000e5aefc80] [c0000000005653cc] .do_page_fault+0x440/0x620
 [c0000000e5aefe30] [c000000000005408] handle_page_fault+0x20/0x58
 Mem-info:
 Node 0 DMA per-cpu:
 CPU    0: hi:    6, btch:   1 usd:   4
 CPU    1: hi:    6, btch:   1 usd:   0
 CPU    2: hi:    6, btch:   1 usd:   3
 CPU    3: hi:    6, btch:   1 usd:   4
 Active_anon:9099 active_file:1523 inactive_anon0
  inactive_file:2869 noreclaim:0 dirty:20 writeback
:0 unstable:0
  free:44210 slab:639 mapped:1724 pagetables:475 bo
unce:0
 Node 0 DMA free:2829440kB min:7808kB low:9728kB hi
gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f
ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable?
no
 lowmem_reserve[]: 0 0 0
 Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k
B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB
 Swap cache: add 0, delete 0, find 0/0
 Free swap  = 3148608kB
 Total swap = 3148608kB
 Free swap:       3148608kB
 59648 pages of RAM
 677 reserved pages
 28165 pages shared
 0 pages swap cached
 Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child
 Killed process 6593 (pagetest)

I am using a powerpc box with 64K size pages. I'll try and investigate further,
just a heads up on the failure I am seeing.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 18/19] account mlocked pages
  2008-01-08 20:59 ` [patch 18/19] account mlocked pages Rik van Riel
@ 2008-01-11 12:51   ` Balbir Singh
  2008-01-13  5:18     ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Balbir Singh @ 2008-01-11 12:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Nick Piggin, Lee Schermerhorn

* Rik van Riel <riel@redhat.com> [2008-01-08 15:59:57]:

The following patch is required to compile the code with
CONFIG_NORECLAIM enabled and CONFIG_NORECLAIM_MLOCK disabled.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c8ccf8f..fb08ee8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -88,6 +88,8 @@ enum zone_stat_item {
 	NR_NORECLAIM,	/*  "     "     "   "       "         */
 #ifdef CONFIG_NORECLAIM_MLOCK
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
+#else
+	NR_MLOCK=NR_ACTIVE_FILE,	/* avoid compiler errors... */
 #endif
 #else
 	NR_NORECLAIM=NR_ACTIVE_FILE,	/* avoid compiler errors in dead code */

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11  3:59   ` KOSAKI Motohiro
@ 2008-01-11 15:37     ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 15:37 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

On Fri, 11 Jan 2008 12:59:31 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi Rik
> 
> > -static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
> > -					struct zone *zone, int priority)
> > +static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
> > +					struct zone *zone, int priority,
> > +					int active, int file)
> >  {
> >  	return 0;
> >  }
> 
> it can't compile if memcgroup turn off.

Doh!  Good point.

Thank you for pointing out this error.  I applied your fix to my tree,
it will be in the next version.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-11 10:41 ` Balbir Singh
@ 2008-01-11 15:38   ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 15:38 UTC (permalink / raw)
  To: balbir; +Cc: linux-kernel, linux-mm

On Fri, 11 Jan 2008 16:11:15 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> I've just started the patch series, the compile fails for me on a
> powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
> else where in mm/page-writeback.c. None of the global_lru_pages()
> parameters depend on CONFIG_PM. Here's a simple patch to fix it.

Thank you for the fix.  I have applied it to my tree.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11  6:24   ` KOSAKI Motohiro
@ 2008-01-11 15:42     ` Rik van Riel
  2008-01-11 15:59       ` Lee Schermerhorn
  2008-01-11 15:50     ` Lee Schermerhorn
  1 sibling, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 15:42 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

On Fri, 11 Jan 2008 15:24:34 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> below patch is a bit cleanup proposal.
> i think LRU_FILE is more clarify than "/2".
> 
> What do you think it?

Thank you for the cleanup, your version looks a lot nicer.  
I have applied your patch to my series.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 10/19] No Reclaim LRU Infrastructure
  2008-01-11  4:36   ` KOSAKI Motohiro
@ 2008-01-11 15:43     ` Lee Schermerhorn
  2008-01-15  0:06       ` KOSAKI Motohiro
  0 siblings, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-11 15:43 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, linux-kernel, linux-mm

On Fri, 2008-01-11 at 13:36 +0900, KOSAKI Motohiro wrote:
> Hi Rik
> 
> > +config NORECLAIM
> > +	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
> > +	depends on EXPERIMENTAL && 64BIT
> > +	help
> > +	  Supports tracking of non-reclaimable pages off the [in]active lists
> > +	  to avoid excessive reclaim overhead on large memory systems.  Pages
> > +	  may be non-reclaimable because:  they are locked into memory, they
> > +	  are anonymous pages for which no swap space exists, or they are anon
> > +	  pages that are expensive to unmap [long anon_vma "related vma" list.]
> 
> Why do you select to default is NO ?
> I think this is really improvement and no one of 64bit user
> hope turn off without NORECLAIM developer :)
> 

Hello, Kosaki-san:

This was my doing.  I left the default == NO during
development/experimemental stage so that one would have to take explicit
action to enable this function.  If the feature makes it into mainline
and we decide that the default should be 'yes', that will be an easy
change.

Thanks for looking at this,
Lee Schermerhorn


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11  7:35   ` KOSAKI Motohiro
@ 2008-01-11 15:46     ` Rik van Riel
  2008-01-14 23:57       ` KOSAKI Motohiro
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 15:46 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

On Fri, 11 Jan 2008 16:35:24 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Why drop (total_swap_pages == 0 && PageAnon(page)) condition?
> in embedded sysmtem, 
> CONFIG_NORECLAIM is OFF (because almost embedded cpu is 32bit) and
> that anon move to inactive list is meaningless because it doesn't have swap.

That was a mistake, kind of.  Since all swap backed pages are on their
own LRU lists, we should not scan those lists at all any more if we are
out of swap space.

The patch that fixes get_scan_ratio() adds that test.

Having said that, with the nr_swap_pages==0 test in get_scan_ratio(),
we no longer need to test for that condition in shrink_active_list().

> below code is more good, may be.
> but I don't understand yet why ignore page_referenced() result at anon page ;-)

On modern systems, swapping out anonymous pages is a relatively rare
event.  All anonymous pages start out as active and referenced, so
testing for that condition does (1) not add any information and (2)
mean we need to scan ALL of the anonymous pages, in order to find one
candidate to swap out (since they are all referenced).

Simply deactivating a few pages and checking whether they were referenced
again while on the (smaller) inactive_anon_list means we can find candidates
to page out with a lot less CPU time used.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11  6:24   ` KOSAKI Motohiro
  2008-01-11 15:42     ` Rik van Riel
@ 2008-01-11 15:50     ` Lee Schermerhorn
  2008-01-11 16:06       ` Rik van Riel
  1 sibling, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-11 15:50 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, linux-kernel, linux-mm

On Fri, 2008-01-11 at 15:24 +0900, KOSAKI Motohiro wrote:
> Hi Rik
> 
> > +static inline int is_file_lru(enum lru_list l)
> > +{
> > +	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
> > +	return (l/2 == 1);
> > +}
> 
> below patch is a bit cleanup proposal.
> i think LRU_FILE is more clarify than "/2".
> 
> What do you think it?
> 
> 
> 
> Index: linux-2.6.24-rc6-mm1-rvr/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.24-rc6-mm1-rvr.orig/include/linux/mmzone.h        2008-01-11 11:10:30.000000000 +0900
> +++ linux-2.6.24-rc6-mm1-rvr/include/linux/mmzone.h     2008-01-11 14:40:31.000000000 +0900
> @@ -147,7 +147,7 @@
>  static inline int is_file_lru(enum lru_list l)
>  {
>         BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
> -       return (l/2 == 1);
> +       return !!(l & LRU_FILE);
>  }
> 
>  struct per_cpu_pages {
> 

Kosaki-san:

Again, my doing.  I agree that the calculation is a bit strange, but I
wanted to "future-proof" this function in case we ever get to a value of
'6' for the lru_list enum.  In that case, the AND will evaluate to
non-zero for what may not be a file LRU.  Between the build time
assertion and the division [which could just be a 'l >> 1', I suppose]
we should be safe.

Thanks,
Lee


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11 15:42     ` Rik van Riel
@ 2008-01-11 15:59       ` Lee Schermerhorn
  2008-01-11 16:15         ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-11 15:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 2008-01-11 at 10:42 -0500, Rik van Riel wrote:
> On Fri, 11 Jan 2008 15:24:34 +0900
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > below patch is a bit cleanup proposal.
> > i think LRU_FILE is more clarify than "/2".
> > 
> > What do you think it?
> 
> Thank you for the cleanup, your version looks a lot nicer.  
> I have applied your patch to my series.
> 

Rik:  

I think we also want to do something like:

-	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
+	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3 ||
+		NR_LRU_LISTS > 6);

Then we'll be warned if future change might break our implicit
assumption that any lru_list value with '0x2' set is a file lru.

Lee


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11 15:50     ` Lee Schermerhorn
@ 2008-01-11 16:06       ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 16:06 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 11 Jan 2008 10:50:09 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Again, my doing.  I agree that the calculation is a bit strange, but I
> wanted to "future-proof" this function in case we ever get to a value of
> '6' for the lru_list enum.  In that case, the AND will evaluate to
> non-zero for what may not be a file LRU.  Between the build time
> assertion and the division [which could just be a 'l >> 1', I suppose]
> we should be safe.

Good point.  I did not guess that.

I'll restore the code to your original test.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11 15:59       ` Lee Schermerhorn
@ 2008-01-11 16:15         ` Rik van Riel
  2008-01-11 19:51           ` Lee Schermerhorn
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-11 16:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 11 Jan 2008 10:59:18 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Fri, 2008-01-11 at 10:42 -0500, Rik van Riel wrote:
> > On Fri, 11 Jan 2008 15:24:34 +0900
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > below patch is a bit cleanup proposal.
> > > i think LRU_FILE is more clarify than "/2".
> > > 
> > > What do you think it?
> > 
> > Thank you for the cleanup, your version looks a lot nicer.  
> > I have applied your patch to my series.
> > 
> 
> Rik:  
> 
> I think we also want to do something like:
> 
> -	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
> +	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3 ||
> +		NR_LRU_LISTS > 6);
> 
> Then we'll be warned if future change might break our implicit
> assumption that any lru_list value with '0x2' set is a file lru.

Restoring the code to your original version makes things work again.

OTOH, I almost wonder if we should not simply define it to

	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE)

and just deal with it.

Your version of the code is correct and probably faster, but not as
easy to read and probably not in a hot path :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11 16:15         ` Rik van Riel
@ 2008-01-11 19:51           ` Lee Schermerhorn
  0 siblings, 0 replies; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-11 19:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 2008-01-11 at 11:15 -0500, Rik van Riel wrote:
> On Fri, 11 Jan 2008 10:59:18 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > On Fri, 2008-01-11 at 10:42 -0500, Rik van Riel wrote:
> > > On Fri, 11 Jan 2008 15:24:34 +0900
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > below patch is a bit cleanup proposal.
> > > > i think LRU_FILE is more clarify than "/2".
> > > > 
> > > > What do you think it?
> > > 
> > > Thank you for the cleanup, your version looks a lot nicer.  
> > > I have applied your patch to my series.
> > > 
> > 
> > Rik:  
> > 
> > I think we also want to do something like:
> > 
> > -	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
> > +	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3 ||
> > +		NR_LRU_LISTS > 6);
> > 
> > Then we'll be warned if future change might break our implicit
> > assumption that any lru_list value with '0x2' set is a file lru.
> 
> Restoring the code to your original version makes things work again.
> 
> OTOH, I almost wonder if we should not simply define it to
> 
> 	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE)
> 
> and just deal with it.
> 
> Your version of the code is correct and probably faster, but not as
> easy to read and probably not in a hot path :)

Sure.  Whatever you think will fly...

Lee
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 18/19] account mlocked pages
  2008-01-11 12:51   ` Balbir Singh
@ 2008-01-13  5:18     ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-13  5:18 UTC (permalink / raw)
  To: balbir; +Cc: linux-kernel, linux-mm, Nick Piggin, Lee Schermerhorn

On Fri, 11 Jan 2008 18:21:09 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Rik van Riel <riel@redhat.com> [2008-01-08 15:59:57]:
> 
> The following patch is required to compile the code with
> CONFIG_NORECLAIM enabled and CONFIG_NORECLAIM_MLOCK disabled.

I have untangled the #ifdefs to make things compile with
all combinations of config settings.  Thanks for pointing
out this problem.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 09/19] (NEW) more aggressively use lumpy reclaim
  2008-01-08 22:30   ` Christoph Lameter
@ 2008-01-14 15:28     ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2008-01-14 15:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Rik van Riel, linux-kernel, linux-mm

On (08/01/08 14:30), Christoph Lameter didst pronounce:
> On Tue, 8 Jan 2008, Rik van Riel wrote:
> 
> > If normal pageout does not result in contiguous free pages for
> > kernel stacks, fall back to lumpy reclaim instead of failing fork
> > or doing excessive pageout IO.
> 
> Good. Ccing Mel. This is going to help higher order pages which is useful 
> for a couple of other projects.
> 

Well, the patch only has any impact when the order you are reclaiming is
less than PAGE_ALLOC_COSTLY_ORDER so I would not have considered it of major
impact to other projects interested in high order allocations.  However, in
isolation I have no problem with this patch and I can see how it makes sense
for the problem scenario described. I rebased just this patch to 2.6.24-rc7
and found no problems but I have not had the chance to review the whole set.

> Reviewed-by: Christoph Lameter <clameter@sgi.com>
> 

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-11 15:46     ` Rik van Riel
@ 2008-01-14 23:57       ` KOSAKI Motohiro
  0 siblings, 0 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-14 23:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-kernel, linux-mm, Lee Schermerhorn

Hi

> > Why drop (total_swap_pages == 0 && PageAnon(page)) condition?
> > in embedded sysmtem, 
> > CONFIG_NORECLAIM is OFF (because almost embedded cpu is 32bit) and
> > that anon move to inactive list is meaningless because it doesn't have swap.
> 
> That was a mistake, kind of.  Since all swap backed pages are on their
> own LRU lists, we should not scan those lists at all any more if we are
> out of swap space.
> 
> The patch that fixes get_scan_ratio() adds that test.
> 
> Having said that, with the nr_swap_pages==0 test in get_scan_ratio(),
> we no longer need to test for that condition in shrink_active_list().

Oh I see!
thank you for your kindful lecture.

your implementation is very cute.


> > below code is more good, may be.
> > but I don't understand yet why ignore page_referenced() result at anon page ;-)
> 
> On modern systems, swapping out anonymous pages is a relatively rare
> event.  All anonymous pages start out as active and referenced, so
> testing for that condition does (1) not add any information and (2)
> mean we need to scan ALL of the anonymous pages, in order to find one
> candidate to swap out (since they are all referenced).
> 
> Simply deactivating a few pages and checking whether they were referenced
> again while on the (smaller) inactive_anon_list means we can find candidates
> to page out with a lot less CPU time used.

thanks, I understand, may be.


- kosaki


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 10/19] No Reclaim LRU Infrastructure
  2008-01-11 15:43     ` Lee Schermerhorn
@ 2008-01-15  0:06       ` KOSAKI Motohiro
  0 siblings, 0 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-15  0:06 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, Rik van Riel, linux-kernel, linux-mm

Hi Lee-san

> > > +config NORECLAIM
> > > +	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
> > > +	depends on EXPERIMENTAL && 64BIT
> > > +	help
> > > +	  Supports tracking of non-reclaimable pages off the [in]active lists
> > > +	  to avoid excessive reclaim overhead on large memory systems.  Pages
> > > +	  may be non-reclaimable because:  they are locked into memory, they
> > > +	  are anonymous pages for which no swap space exists, or they are anon
> > > +	  pages that are expensive to unmap [long anon_vma "related vma" list.]
> > 
> > Why do you select to default is NO ?
> > I think this is really improvement and no one of 64bit user
> > hope turn off without NORECLAIM developer :)
> 
> This was my doing.  I left the default == NO during
> development/experimemental stage so that one would have to take explicit
> action to enable this function.  If the feature makes it into mainline
> and we decide that the default should be 'yes', that will be an easy
> change.

Oh I see.
I will help testing too for it merges to mainline early. 

thanks.


- kosaki



^ permalink raw reply	[flat|nested] 88+ messages in thread

* rvr split LRU minor regression ?
  2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
                   ` (21 preceding siblings ...)
  2008-01-11 11:47 ` Balbir Singh
@ 2008-01-16  6:17 ` KOSAKI Motohiro
  22 siblings, 0 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-16  6:17 UTC (permalink / raw)
  To: Rik van Riel, Lee Schermerhorn; +Cc: kosaki.motohiro, linux-kernel, linux-mm

Hi Rik

I tested new hackbench on rvr split LRU patch.

new hackbench URL is
   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c


method of test

(1) $ ./hackbench 150 process 1000
(2) # sync; echo 3 > /proc/sys/vm/drop_caches
    $ dd if=tmp10G of=/dev/null
    $ ./hackbench 150 process 1000

test machine:
  CPU:    x86_64 1.86GHz x2
  memory: 6GB


result:

         2.6.24-rc6-mm1      +rvr-split-lru      ratio
                                                (small is faster)
-------------------------------------------------------------------
(1)      364.981             359.386            98.47%
(2)      364.461             387.471           106.31%


more detail:
1. /usr/bin/time command output

vanilla 2.6.24-rc6-mm1
	33.74user 703.10system 6:09.56elapsed 199%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (0major+372467minor)pagefaults 0swaps

2.6.24-rc6-mm1 + rvr-split-lru
	36.22user 731.30system 6:35.16elapsed 194%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (804major+389524minor)pagefaults 0swaps

It seems increase page fault.


2.
after test (2), cat /proc/meminfo

vanilla 2.6.24-rc6-mm1

	MemTotal:      5931808 kB
	MemFree:       1751632 kB
	Buffers:          4360 kB
	Cached:        3930020 kB
	SwapCached:          0 kB
	Active:          46396 kB
	Inactive:      3924108 kB
	SwapTotal:    20972848 kB
	SwapFree:     20972720 kB
	Dirty:               0 kB
	Writeback:           0 kB
	AnonPages:       36140 kB
	Mapped:          10104 kB
	Slab:           160020 kB
	SReclaimable:     3460 kB
	SUnreclaim:     156560 kB
	PageTables:       3712 kB
	NFS_Unstable:        0 kB
	Bounce:              0 kB
	CommitLimit:  23938752 kB
	Committed_AS:    78940 kB
	VmallocTotal: 34359738367 kB
	VmallocUsed:     57220 kB
	VmallocChunk: 34359680999 kB
	HugePages_Total:     0
	HugePages_Free:      0
	HugePages_Rsvd:      0
	HugePages_Surp:      0
	Hugepagesize:     2048 kB


2.6.24-rc6-mm1 + rvr-split-lru

	MemTotal:        5931356 kB
	MemFree:         1771800 kB
	Buffers:            2776 kB
	Cached:          3914800 kB
	SwapCached:         7940 kB
	Active(anon):      21868 kB
	Inactive(anon):     6560 kB
	Active(file):    1722888 kB
	Inactive(file):  2192128 kB
	Noreclaim:        3472 kB
	Mlocked:          3724 kB
	SwapTotal:      20972848 kB
	SwapFree:       20935032 kB
	Dirty:                 8 kB
	Writeback:             0 kB
	AnonPages:         23912 kB
	Mapped:             9500 kB
	Slab:             162188 kB
	SReclaimable:       5544 kB
	SUnreclaim:       156644 kB
	PageTables:         4444 kB
	NFS_Unstable:          0 kB
	Bounce:                0 kB
	CommitLimit:    23938524 kB
	Committed_AS:     106816 kB
	VmallocTotal:   34359738367 kB
	VmallocUsed:       57220 kB
	VmallocChunk:   34359680999 kB
	HugePages_Total:     0
	HugePages_Free:      0
	HugePages_Rsvd:      0
	HugePages_Surp:      0
	Hugepagesize:     2048 kB


It seems used once memory incorrect activation increased.
What do you think it?



- kosaki



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
                     ` (5 preceding siblings ...)
  2008-01-11  7:35   ` KOSAKI Motohiro
@ 2008-01-30  3:25   ` KOSAKI Motohiro
  2008-01-30  8:57     ` KOSAKI Motohiro
  6 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-30  3:25 UTC (permalink / raw)
  To: Rik van Riel, Lee Schermerhorn; +Cc: kosaki.motohiro, linux-kernel, linux-mm

Hi Rik, Lee

I tested new hackbench on rvr split LRU patch.
   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

method of test

(1) $ ./hackbench 150 process 1000
(2) # sync; echo 3 > /proc/sys/vm/drop_caches
    $ dd if=tmp10G of=/dev/null
    $ ./hackbench 150 process 1000

test machine
	CPU: Itanium2 x4 (logical 8cpu)
	MEM: 8GB

A. vanilla 2.6.24-rc8-mm1
	(1) 127.540
	(2) 727.548

B. 2.6.24-rc8-mm1 + split-lru-patch-series
	(1) 92.730
	(2) 758.369

   comment:
    (1) active/inactive anon ratio improve performance significant.
    (2) incorrect page activation reduce performance.


I investigate reason and found reason is [05/19] change.
I tested a bit porton reverted split-lru-patch-series again.

C. 2.6.24-rc8-mm1 + split-lru-patch-series + my-revert-patch
 	(1) 83.014
	(2) 717.009


Of course, We need reintroduce this portion after new page LRU
(aka LRU for used only page).
but now is too early.

I hope this patch series merge to -mm ASAP.
therefore, I hope remove any corner case regression.

Thanks!


- kosaki



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/vmscan.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c       2008-01-29 15:59:17.000000000 +0900
+++ b/mm/vmscan.c       2008-01-30 11:53:42.000000000 +0900
@@ -247,6 +247,27 @@
        return ret;
 }

+/* Called without lock on whether page is mapped, so answer is unstable */
+static inline int page_mapping_inuse(struct page *page)
+{
+       struct address_space *mapping;
+
+       /* Page is in somebody's page tables. */
+       if (page_mapped(page))
+               return 1;
+
+       /* Be more reluctant to reclaim swapcache than pagecache */
+       if (PageSwapCache(page))
+               return 1;
+
+       mapping = page_mapping(page);
+       if (!mapping)
+               return 0;
+
+       /* File is mmap'd by somebody? */
+       return mapping_mapped(mapping);
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
        return page_count(page) - !!PagePrivate(page) == 2;
@@ -515,7 +536,8 @@

                referenced = page_referenced(page, 1, sc->mem_cgroup);
                /* In active use or really unfreeable?  Activate it. */
-               if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+               if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
+                                       referenced && page_mapping_inuse(page))
                        goto activate_locked;

 #ifdef CONFIG_SWAP
@@ -550,6 +572,8 @@
                }

                if (PageDirty(page)) {
+                       if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+                               goto keep_locked;
                        if (!may_enter_fs) {
                                sc->nr_io_pages++;
                                goto keep_locked;






^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-30  3:25   ` KOSAKI Motohiro
@ 2008-01-30  8:57     ` KOSAKI Motohiro
  2008-01-30 14:29       ` Lee Schermerhorn
  2008-02-07  0:35       ` Rik van Riel
  0 siblings, 2 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-30  8:57 UTC (permalink / raw)
  To: Rik van Riel, Lee Schermerhorn; +Cc: kosaki.motohiro, linux-kernel, linux-mm

Hi Rik, Lee

I found number of scan pages calculation bug.

1. wrong calculation order

	ap *= rotate_sum / (zone->recent_rotated_anon + 1);

   when recent_rotated_anon = 100 and recent_rotated_file = 0,
   
     rotate_sum / (zone->recent_rotated_anon + 1)
   = 100 / 101
   = 0

   at that time, ap become 0.

2. wrong fraction omission

	nr[l] = zone->nr_scan[l] * percent[file] / 100;

	when percent is very small,
	nr[l] become 0.

Test Result:
(1) $ ./hackbench 150 process 1000
(2) # sync; echo 3 > /proc/sys/vm/drop_caches
    $ dd if=tmp10G of=/dev/null
    $ ./hackbench 150 process 1000

rvr-split-lru + revert patch of previous mail
 	(1) 83.014
	(2) 717.009

rvr-split-lru + revert patch of previous mail + below patch
	(1) 61.965
	(2) 85.444 !!


Now, We got 1000% performance improvement against 2.6.24-rc8-mm1 :)



- kosaki


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/vmscan.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-01-30 15:22:10.000000000 +0900
+++ b/mm/vmscan.c	2008-01-30 16:03:28.000000000 +0900
@@ -1355,7 +1355,7 @@ static void get_scan_ratio(struct zone *
 	 *               anon + file       rotate_sum
 	 */
 	ap = (anon_prio * anon) / (anon + file + 1);
-	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
+	ap = (ap * rotate_sum) / (zone->recent_rotated_anon + 1);
 	if (ap == 0)
 		ap = 1;
 	else if (ap > 100)
@@ -1363,7 +1363,7 @@ static void get_scan_ratio(struct zone *
 	percent[0] = ap;
 
 	fp = (file_prio * file) / (anon + file + 1);
-	fp *= rotate_sum / (zone->recent_rotated_file + 1);
+	fp = (fp * rotate_sum) / (zone->recent_rotated_file + 1);
 	if (fp == 0)
 		fp = 1;
 	else if (fp > 100)
@@ -1402,6 +1402,7 @@ static unsigned long shrink_zone(int pri
 
 	for_each_reclaimable_lru(l) {
 		if (scan_global_lru(sc)) {
+			unsigned long nr_max_scan;
 			int file = is_file_lru(l);
 			/*
 			 * Add one to nr_to_scan just to make sure that the
@@ -1409,7 +1410,11 @@ static unsigned long shrink_zone(int pri
 			 */
 			zone->nr_scan[l] += (zone_page_state(zone,
 				NR_INACTIVE_ANON + l) >> priority) + 1;
-			nr[l] = zone->nr_scan[l] * percent[file] / 100;
+			nr[l] = (zone->nr_scan[l] * percent[file] / 100) + 1;
+			nr_max_scan = zone_page_state(zone, NR_INACTIVE_ANON+l);
+			if (nr[l] > nr_max_scan)
+				nr[l] = nr_max_scan;
+
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-30  8:57     ` KOSAKI Motohiro
@ 2008-01-30 14:29       ` Lee Schermerhorn
  2008-01-31  1:17         ` KOSAKI Motohiro
  2008-02-07  0:35       ` Rik van Riel
  1 sibling, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-30 14:29 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, linux-kernel, linux-mm

On Wed, 2008-01-30 at 17:57 +0900, KOSAKI Motohiro wrote:
> Hi Rik, Lee
> 
> I found number of scan pages calculation bug.
> 
> 1. wrong calculation order
> 
> 	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
> 
>    when recent_rotated_anon = 100 and recent_rotated_file = 0,
>    
>      rotate_sum / (zone->recent_rotated_anon + 1)
>    = 100 / 101
>    = 0
> 
>    at that time, ap become 0.
> 
> 2. wrong fraction omission
> 
> 	nr[l] = zone->nr_scan[l] * percent[file] / 100;
> 
> 	when percent is very small,
> 	nr[l] become 0.
> 
> Test Result:
> (1) $ ./hackbench 150 process 1000
> (2) # sync; echo 3 > /proc/sys/vm/drop_caches
>     $ dd if=tmp10G of=/dev/null
>     $ ./hackbench 150 process 1000
> 
> rvr-split-lru + revert patch of previous mail
>  	(1) 83.014
> 	(2) 717.009
> 
> rvr-split-lru + revert patch of previous mail + below patch
> 	(1) 61.965
> 	(2) 85.444 !!
> 
> 
> Now, We got 1000% performance improvement against 2.6.24-rc8-mm1 :)
> 
> 
> 
> - kosaki
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

<snip>

Kosaki-san:

Rik is currently out on holiday and I've been traveling.  Just getting
back to rebasing to 24-rc8-mm1.  Thank you for your efforts in testing
and tracking down the regressions.  I will add your fixes into my tree
and try them out and let you know.  Rik mentioned to me that he has a
fix for the "get_scan_ratio()" calculation that is causing us to OOM
kill prematurely--i.e., when we still have lots of swap space to evict
swappable anon.  I don't know if it's similar to what you have posted.
Have to wait and see what he says.  Meantime, we'll try your patches.

Again, thank you.

Regards,
Lee




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-30 14:29       ` Lee Schermerhorn
@ 2008-01-31  1:17         ` KOSAKI Motohiro
  2008-01-31 10:48           ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-31  1:17 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, Rik van Riel, linux-kernel, linux-mm

Hi Lee-san

> Rik is currently out on holiday and I've been traveling.  Just getting
> back to rebasing to 24-rc8-mm1.  Thank you for your efforts in testing
> and tracking down the regressions.  I will add your fixes into my tree
> and try them out and let you know.  Rik mentioned to me that he has a
> fix for the "get_scan_ratio()" calculation that is causing us to OOM
> kill prematurely--i.e., when we still have lots of swap space to evict
> swappable anon.  I don't know if it's similar to what you have posted.
> Have to wait and see what he says.  Meantime, we'll try your patches.

thank you for your quick response.

on my test environment, my patch solve incorrect OOM.
because, too small reclaim cause OOM.

Please confirm.


- kosaki



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-31  1:17         ` KOSAKI Motohiro
@ 2008-01-31 10:48           ` Rik van Riel
  2008-01-31 10:59             ` KOSAKI Motohiro
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-31 10:48 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lee Schermerhorn, kosaki.motohiro, linux-kernel, linux-mm

On Thu, 31 Jan 2008 10:17:48 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> on my test environment, my patch solve incorrect OOM.
> because, too small reclaim cause OOM.

That makes sense.

The version you two are looking at can return
"percentages" way larger than 100 in get_scan_ratio.

A fixed version of get_scan_ratio, where the
percentages always add up to 100%, makes the
system go OOM before it seriously starts
swapping.

I will integrate your fixes with my code when I
get back from holidays.  Then things should work :)

Thank you for your analysis of the problem.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-31 10:48           ` Rik van Riel
@ 2008-01-31 10:59             ` KOSAKI Motohiro
  0 siblings, 0 replies; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-01-31 10:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, Lee Schermerhorn, linux-kernel, linux-mm

> I will integrate your fixes with my code when I
> get back from holidays.  Then things should work :)
> 
> Thank you for your analysis of the problem.

Thank you.
enjoy good vacation :)

-
kosaki




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-01-30  8:57     ` KOSAKI Motohiro
  2008-01-30 14:29       ` Lee Schermerhorn
@ 2008-02-07  0:35       ` Rik van Riel
  2008-02-07  1:20         ` KOSAKI Motohiro
  1 sibling, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-02-07  0:35 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Lee Schermerhorn, kosaki.motohiro, linux-kernel, linux-mm

On Wed, 30 Jan 2008 17:57:54 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> I found number of scan pages calculation bug.

My latest version of get_scan_ratio() works differently, with the
percentages always adding up to 100.  However, your patch gave me
the inspiration to (hopefully) find the bug in my version of the
code.
 
> 1. wrong calculation order

I do not believe my new code has this problem.  Of course, this
is purely due to luck :)

> 2. wrong fraction omission
> 
> 	nr[l] = zone->nr_scan[l] * percent[file] / 100;
> 
> 	when percent is very small,
> 	nr[l] become 0.

This is probably where the problem is.  Kind of.

I believe that the problem is that we scale nr[l] by the percentage,
instead of scaling the amount we add to zone->nr_scan[l] by the
percentage!

> @@ -1409,7 +1410,11 @@ static unsigned long shrink_zone(int pri
>  			 */
>  			zone->nr_scan[l] += (zone_page_state(zone,
>  				NR_INACTIVE_ANON + l) >> priority) + 1;
> -			nr[l] = zone->nr_scan[l] * percent[file] / 100;
> +			nr[l] = (zone->nr_scan[l] * percent[file] / 100) + 1;
> +			nr_max_scan = zone_page_state(zone, NR_INACTIVE_ANON+l);
> +			if (nr[l] > nr_max_scan)
> +				nr[l] = nr_max_scan;
> +
>  			if (nr[l] >= sc->swap_cluster_max)
>  				zone->nr_scan[l] = 0;
>  			else

With the fix below (against my latest tree), we always add at least one
to zone->nr_scan[l] and always make that increment count later on!

I am still recovering from my trip home (thanks to the airline companies
I spent 25 hours travelling, from door to door), so I may not get around
to actually testing this today:

Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-02-06 19:23:16.000000000 -0500
+++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-02-06 19:22:55.000000000 -0500
@@ -1275,13 +1275,17 @@ static unsigned long shrink_zone(int pri
 	for_each_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
+			int scan;
 			/*
 			 * Add one to nr_to_scan just to make sure that the
-			 * kernel will slowly sift through the active list.
+			 * kernel will slowly sift through each list.
 			 */
-			zone->nr_scan[l] += (zone_page_state(zone,
-				NR_INACTIVE_ANON + l) >> priority) + 1;
-			nr[l] = zone->nr_scan[l] * percent[file] / 100;
+			scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
+			scan >>= priority;
+			scan = (scan * percent[file]) / 100;
+
+			zone->nr_scan[l] += scan + 1;
+			nr[l] = zone->nr_scan[l];
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-02-07  0:35       ` Rik van Riel
@ 2008-02-07  1:20         ` KOSAKI Motohiro
  2008-02-07  1:36           ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: KOSAKI Motohiro @ 2008-02-07  1:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, Lee Schermerhorn, linux-kernel, linux-mm

Hi Rik

Welcome back :)

> > I found number of scan pages calculation bug.
> 
> My latest version of get_scan_ratio() works differently, with the
> percentages always adding up to 100.  However, your patch gave me
> the inspiration to (hopefully) find the bug in my version of the
> code.

OK.


> > 2. wrong fraction omission
> > 
> > 	nr[l] = zone->nr_scan[l] * percent[file] / 100;
> > 
> > 	when percent is very small,
> > 	nr[l] become 0.
> 
> This is probably where the problem is.  Kind of.
> 
> I believe that the problem is that we scale nr[l] by the percentage,
> instead of scaling the amount we add to zone->nr_scan[l] by the
> percentage!

Aahh,
you are right.


> Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c	2008-02-06 19:23:16.000000000 -0500
> +++ linux-2.6.24-rc6-mm1/mm/vmscan.c	2008-02-06 19:22:55.000000000 -0500
> @@ -1275,13 +1275,17 @@ static unsigned long shrink_zone(int pri
>  	for_each_lru(l) {
>  		if (scan_global_lru(sc)) {
>  			int file = is_file_lru(l);
> +			int scan;
>  			/*
>  			 * Add one to nr_to_scan just to make sure that the
> -			 * kernel will slowly sift through the active list.
> +			 * kernel will slowly sift through each list.
>  			 */
> -			zone->nr_scan[l] += (zone_page_state(zone,
> -				NR_INACTIVE_ANON + l) >> priority) + 1;
> -			nr[l] = zone->nr_scan[l] * percent[file] / 100;
> +			scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
> +			scan >>= priority;
> +			scan = (scan * percent[file]) / 100;
> +
> +			zone->nr_scan[l] += scan + 1;
> +			nr[l] = zone->nr_scan[l];
>  			if (nr[l] >= sc->swap_cluster_max)
>  				zone->nr_scan[l] = 0;
>  			else

looks good.
thank you clean up code.


- kosaki



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/19] split LRU lists into anon & file sets
  2008-02-07  1:20         ` KOSAKI Motohiro
@ 2008-02-07  1:36           ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-02-07  1:36 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: kosaki.motohiro, Lee Schermerhorn, linux-kernel, linux-mm

On Thu, 07 Feb 2008 10:20:39 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> looks good.
> thank you clean up code.

Yeah, it looks good.

Too bad it still does not work :)

Oh well, I'll look at that tomorrow.  Jet lag is catching up
with me, so I should get some rest first...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-07 19:07               ` Christoph Lameter
@ 2008-01-07 19:32                 ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-07 19:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Andi Kleen, linux-kernel, linux-mm,
	Eric Whitney, Nick Dokos

On Mon, 7 Jan 2008 11:07:54 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 4 Jan 2008, Lee Schermerhorn wrote:
> 
> > We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
> > criteria to reproduce is to be able to run thousands [or low 10s of
> > thousands] of tasks, continually increasing the number until the system
> > just goes into reclaim.  Instead of swapping, the system seems to
> > hang--unresponsive from the console, but with "soft lockup" messages
> > spitting out every few seconds...
> 
> Ditto here.

I have some suspicions on what could be causing this.

The most obvious suspect is get_scan_ratio() continuing to return
100 file reclaim, 0 anon reclaim when the file LRUs have already
been reduced to something very small, because reclaiming up to that
point was easy.

I plan to add some code to automatically set the anon reclaim to
100% if (free + file_active + file_inactive <= zone->pages_high),
meaning that reclaiming just file pages will not be able to free
enough pages.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-04 17:06             ` Lee Schermerhorn
@ 2008-01-07 19:07               ` Christoph Lameter
  2008-01-07 19:32                 ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Lameter @ 2008-01-07 19:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andi Kleen, Rik van Riel, linux-kernel, linux-mm, Eric Whitney,
	Nick Dokos

On Fri, 4 Jan 2008, Lee Schermerhorn wrote:

> We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
> criteria to reproduce is to be able to run thousands [or low 10s of
> thousands] of tasks, continually increasing the number until the system
> just goes into reclaim.  Instead of swapping, the system seems to
> hang--unresponsive from the console, but with "soft lockup" messages
> spitting out every few seconds...

Ditto here.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-07 10:06     ` KAMEZAWA Hiroyuki
@ 2008-01-07 15:18       ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-07 15:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Lee Schermerhorn, linux-kernel, linux-mm, Eric Whitney

On Mon, 7 Jan 2008 19:06:10 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 3 Jan 2008 12:00:00 -0500
> Rik van Riel <riel@redhat.com> wrote:

> > If there is no swap space, my VM code will not bother scanning
> > any anon pages.  This has the same effect as moving the pages
> > to the no-reclaim list, with the extra benefit of being able to
> > resume scanning the anon lists once swap space is freed.
> > 
> Is this 'avoiding scanning anon if no swap' feature  in this set ?

I seem to have lost that code in a forward merge :(

Dunno if I started the forward merge from an older series that
Lee had or if I lost the code myself...

I'll put it back in ASAP.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-03 17:00   ` Rik van Riel
  2008-01-03 17:13     ` Lee Schermerhorn
@ 2008-01-07 10:06     ` KAMEZAWA Hiroyuki
  2008-01-07 15:18       ` Rik van Riel
  1 sibling, 1 reply; 88+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-01-07 10:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lee Schermerhorn, linux-kernel, linux-mm, Eric Whitney

On Thu, 3 Jan 2008 12:00:00 -0500
Rik van Riel <riel@redhat.com> wrote:

> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> > 
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available.  This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
> 
> If there is no swap space, my VM code will not bother scanning
> any anon pages.  This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
> 
Is this 'avoiding scanning anon if no swap' feature  in this set ?

Thanks
-Kame


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-04 16:55             ` Rik van Riel
@ 2008-01-04 18:07               ` Larry Woodman
  0 siblings, 0 replies; 88+ messages in thread
From: Larry Woodman @ 2008-01-04 18:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Lee Schermerhorn, linux-kernel, linux-mm,
	Eric Whitney, Nick Dokos

Rik van Riel wrote:

>On Fri, 04 Jan 2008 17:34:00 +0100
>Andi Kleen <andi@firstfloor.org> wrote:
>  
>
>>Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
>>
>>    
>>
>>>We can easily [he says, glibly] reproduce the hang on the anon_vma lock
>>>      
>>>
>>Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?
>>    
>>
>
>I really think that the anon_vma and i_mmap_lock spinlock hangs are
>due to the lack of queued spinlocks.  Not because I have seen your
>system hang, but because I've seen one of Larry's test systems here
>hang in scary/amusing ways :)
>
Changing the anon_vma->lock into a rwlock_t helps because 
page_lock_anon_vma()
can take it for read and thats where the contention is.  However its the 
fact that under
some tests, most of the pages are in vmas queued to one anon_vma that 
causes so much
lock contention.


>
>With queued spinlocks the system should just slow down, not hang.
>
>  
>



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-04 16:34           ` Andi Kleen
  2008-01-04 16:55             ` Rik van Riel
@ 2008-01-04 17:06             ` Lee Schermerhorn
  2008-01-07 19:07               ` Christoph Lameter
  1 sibling, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-04 17:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rik van Riel, linux-kernel, linux-mm, Eric Whitney, Nick Dokos

On Fri, 2008-01-04 at 17:34 +0100, Andi Kleen wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> 
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
> 
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
criteria to reproduce is to be able to run thousands [or low 10s of
thousands] of tasks, continually increasing the number until the system
just goes into reclaim.  Instead of swapping, the system seems to
hang--unresponsive from the console, but with "soft lockup" messages
spitting out every few seconds...


Lee 


> 
> -Andi


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-04 16:34           ` Andi Kleen
@ 2008-01-04 16:55             ` Rik van Riel
  2008-01-04 18:07               ` Larry Woodman
  2008-01-04 17:06             ` Lee Schermerhorn
  1 sibling, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-04 16:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Lee Schermerhorn, linux-kernel, linux-mm, Eric Whitney, Nick Dokos

On Fri, 04 Jan 2008 17:34:00 +0100
Andi Kleen <andi@firstfloor.org> wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> 
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
> 
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

I really think that the anon_vma and i_mmap_lock spinlock hangs are
due to the lack of queued spinlocks.  Not because I have seen your
system hang, but because I've seen one of Larry's test systems here
hang in scary/amusing ways :)

With queued spinlocks the system should just slow down, not hang.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-04 16:25         ` Lee Schermerhorn
@ 2008-01-04 16:34           ` Andi Kleen
  2008-01-04 16:55             ` Rik van Riel
  2008-01-04 17:06             ` Lee Schermerhorn
  0 siblings, 2 replies; 88+ messages in thread
From: Andi Kleen @ 2008-01-04 16:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Rik van Riel, linux-kernel, linux-mm, Eric Whitney, Nick Dokos

Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:

> We can easily [he says, glibly] reproduce the hang on the anon_vma lock

Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

-Andi

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-03 22:00       ` Rik van Riel
@ 2008-01-04 16:25         ` Lee Schermerhorn
  2008-01-04 16:34           ` Andi Kleen
  0 siblings, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-04 16:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Eric Whitney, Nick Dokos

On Thu, 2008-01-03 at 17:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 12:13:32 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Yes, but the problem, when it occurs, is very awkward.  The system just
> > hangs for hours/days spinning on the reverse mapping locks--in both
> > page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
> > kill occurs because we never get that far.  So, I'm not sure I'd call
> > any OOM kills resulting from this patch as "false".  The memory is
> > effectively nonreclaimable.   Now, I think that your anon pages SEQ
> > patch will eliminate the contention in page_referenced[_anon](), but we
> > could still hang in try_to_unmap().
> 
> I am hoping that Nick's ticket spinlocks will fix this problem.
> 
> Would you happen to have any test cases for the above problem that
> I could use to reproduce the problem and look for an automatic fix?

We can easily [he says, glibly] reproduce the hang on the anon_vma lock
with AIM7 loads on our test platforms.  Perhaps we can come up with an
AIM workload to reproduce the phenomenon on one of your test platforms.
I've seen the hang with 15K-20K tasks on a 4 socket x86_64 with 16-32G
of memory and quite a bit of storage.

I've also seen related hangs on both anon_vma and i_mmap_lock during a
heavy usex stress load on the splitlru+noreclaim patches.  [This, by the
way, without and WITH my rw_lock patches for both anon_vma and
i_mmap_lock.]  I can try to package up the workload to run on your
system.

> 
> Any fix that requires the sysadmin to tune things _just_ right seems
> too dangerous to me - especially if a change in the workload can
> result in the system doing exactly the wrong thing...
> 
> The idea is valid, but it just has to work automagically.
> 
> Btw, if page_referenced() is called less, the locks that try_to_unmap()
> also takes should get less contention.

Makes sense.  we'll have to see.

Lee
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-03 17:13     ` Lee Schermerhorn
@ 2008-01-03 22:00       ` Rik van Riel
  2008-01-04 16:25         ` Lee Schermerhorn
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2008-01-03 22:00 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-kernel, linux-mm, Eric Whitney

On Thu, 03 Jan 2008 12:13:32 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Yes, but the problem, when it occurs, is very awkward.  The system just
> hangs for hours/days spinning on the reverse mapping locks--in both
> page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
> kill occurs because we never get that far.  So, I'm not sure I'd call
> any OOM kills resulting from this patch as "false".  The memory is
> effectively nonreclaimable.   Now, I think that your anon pages SEQ
> patch will eliminate the contention in page_referenced[_anon](), but we
> could still hang in try_to_unmap().

I am hoping that Nick's ticket spinlocks will fix this problem.

Would you happen to have any test cases for the above problem that
I could use to reproduce the problem and look for an automatic fix?

Any fix that requires the sysadmin to tune things _just_ right seems
too dangerous to me - especially if a change in the workload can
result in the system doing exactly the wrong thing...

The idea is valid, but it just has to work automagically.

Btw, if page_referenced() is called less, the locks that try_to_unmap()
also takes should get less contention.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-03 17:00   ` Rik van Riel
@ 2008-01-03 17:13     ` Lee Schermerhorn
  2008-01-03 22:00       ` Rik van Riel
  2008-01-07 10:06     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-03 17:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Eric Whitney

On Thu, 2008-01-03 at 12:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> > 
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available.  This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
> 
> If there is no swap space, my VM code will not bother scanning
> any anon pages.  This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
> 
> > 2) treat anon pages with "excessively long" anon_vma lists as
> > nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
> > This also addresses problems we've seen with benchmarks and stress
> > tests--all cpus spinning on some anon_vma lock.  In "real life", we've
> > seen this behavior with file backed pages--spinning on the
> > i_mmap_lock--running Oracle workloads with user counts in the few
> > thousands.  Again, something we may not need with Rik's vmscan rework.
> > If we did want to do this, we'd probably want to address file backed
> > pages and add support to bring the pages back from the noreclaim list
> > when the number of "mappers" drops below the threshold.  My current
> > patch leaves anon pages as non-reclaimable until they're freed, or
> > manually scanned via the mechanism introduced by patch 12.
> 
> I can see some issues with that patch.  Specifically, if the threshold
> is set too high no pages will be affected, and if the threshold is too
> low all pages will become non-reclaimable, leading to a false OOM kill.
> 
> Not only is it a very big hammer, it's also a rather awkward one...

Yes, but the problem, when it occurs, is very awkward.  The system just
hangs for hours/days spinning on the reverse mapping locks--in both
page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
kill occurs because we never get that far.  So, I'm not sure I'd call
any OOM kills resulting from this patch as "false".  The memory is
effectively nonreclaimable.   Now, I think that your anon pages SEQ
patch will eliminate the contention in page_referenced[_anon](), but we
could still hang in try_to_unmap().  And we have the issue with file
back pages and the i_mmap_lock.  I'll see if this issue comes up in
testings with the current series.  If not, cool!  If so, we just have
more work to do.

Later,
Lee
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-03 16:52 ` Lee Schermerhorn
@ 2008-01-03 17:00   ` Rik van Riel
  2008-01-03 17:13     ` Lee Schermerhorn
  2008-01-07 10:06     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 88+ messages in thread
From: Rik van Riel @ 2008-01-03 17:00 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-kernel, linux-mm, Eric Whitney

On Thu, 03 Jan 2008 11:52:08 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Also, I should point out that the full noreclaim series includes a
> couple of other patches NOT posted here by Rik:
> 
> 1) treat swap backed pages as nonreclaimable when no swap space is
> available.  This addresses a problem we've seen in real life, with
> vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> pages only to find that there is no swap space--add_to_swap() fails.
> Maybe not a problem with Rik's new anon page handling.

If there is no swap space, my VM code will not bother scanning
any anon pages.  This has the same effect as moving the pages
to the no-reclaim list, with the extra benefit of being able to
resume scanning the anon lists once swap space is freed.

> 2) treat anon pages with "excessively long" anon_vma lists as
> nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
> This also addresses problems we've seen with benchmarks and stress
> tests--all cpus spinning on some anon_vma lock.  In "real life", we've
> seen this behavior with file backed pages--spinning on the
> i_mmap_lock--running Oracle workloads with user counts in the few
> thousands.  Again, something we may not need with Rik's vmscan rework.
> If we did want to do this, we'd probably want to address file backed
> pages and add support to bring the pages back from the noreclaim list
> when the number of "mappers" drops below the threshold.  My current
> patch leaves anon pages as non-reclaimable until they're freed, or
> manually scanned via the mechanism introduced by patch 12.

I can see some issues with that patch.  Specifically, if the threshold
is set too high no pages will be affected, and if the threshold is too
low all pages will become non-reclaimable, leading to a false OOM kill.

Not only is it a very big hammer, it's also a rather awkward one...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/19] VM pageout scalability improvements
  2008-01-02 22:41 [patch 00/19] VM pageout scalability improvements linux-kernel
@ 2008-01-03 16:52 ` Lee Schermerhorn
  2008-01-03 17:00   ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Lee Schermerhorn @ 2008-01-03 16:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Rik van Riel, Eric Whitney

On Wed, 2008-01-02 at 17:41 -0500, linux-kernel@vger.kernelporg wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 
> Against 2.6.24-rc6-mm1
> 
> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>    onto their own LRUs, so the system only scans the pages that it
>    can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>    number of pages that need to be scanned when the system
>    starts swapping is bound to a reasonable number
> 
> The noreclaim patches come verbatim from Lee Schermerhorn and
> Nick Piggin.  I have made a few small fixes to them and left out
> the bits that are no longer needed with split file/anon lists.
> 
> The exception is "Scan noreclaim list for reclaimable pages",
> which should not be needed but could be a useful debugging tool.

Note that patch 14/19 [SHM_LOCK/UNLOCK handling] depends on the
infrastructure introduced by the "Scan noreclaim list for reclaimable
pages" patch.  When SHM_UNLOCKing a shm segment, we call a new
scan_mapping_noreclaim_page() function to check all of the pages in the
segment for reclaimability.  There might be other reasons for the pages
to be non-reclaimable...

So, we can't merge 14/19 as is w/o some of patch 12.  We can probably
eliminate the sysctl and per node sysfs attributes to force a scan.
But, as Rik says, this has been useful for debugging--e.g., periodically
forcing a full rescan while running a stress load.

Also, I should point out that the full noreclaim series includes a
couple of other patches NOT posted here by Rik:

1) treat swap backed pages as nonreclaimable when no swap space is
available.  This addresses a problem we've seen in real life, with
vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
pages only to find that there is no swap space--add_to_swap() fails.
Maybe not a problem with Rik's new anon page handling.  We'll see.  If
we did want to add this filter, we'll need a way to bring back pages
from the noreclaim list that are there only for lack of swap space when
space is added or becomes available.

2) treat anon pages with "excessively long" anon_vma lists as
nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
This also addresses problems we've seen with benchmarks and stress
tests--all cpus spinning on some anon_vma lock.  In "real life", we've
seen this behavior with file backed pages--spinning on the
i_mmap_lock--running Oracle workloads with user counts in the few
thousands.  Again, something we may not need with Rik's vmscan rework.
If we did want to do this, we'd probably want to address file backed
pages and add support to bring the pages back from the noreclaim list
when the number of "mappers" drops below the threshold.  My current
patch leaves anon pages as non-reclaimable until they're freed, or
manually scanned via the mechanism introduced by patch 12.

Lee
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 00/19] VM pageout scalability improvements
@ 2008-01-02 22:41 linux-kernel
  2008-01-03 16:52 ` Lee Schermerhorn
  0 siblings, 1 reply; 88+ messages in thread
From: linux-kernel @ 2008-01-02 22:41 UTC (permalink / raw)
  Cc: linux-mm, lee.schermerhorn

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin.  I have made a few small fixes to them and left out
the bits that are no longer needed with split file/anon lists.

The exception is "Scan noreclaim list for reclaimable pages",
which should not be needed but could be a useful debugging tool.

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2008-02-07  1:36 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-08 20:59 [patch 00/19] VM pageout scalability improvements Rik van Riel
2008-01-08 20:59 ` [patch 01/19] move isolate_lru_page() to vmscan.c Rik van Riel
2008-01-08 22:03   ` Christoph Lameter
2008-01-08 20:59 ` [patch 02/19] free swap space on swap-in/activation Rik van Riel
2008-01-08 22:10   ` Christoph Lameter
2008-01-08 20:59 ` [patch 03/19] define page_file_cache() function Rik van Riel
2008-01-08 22:18   ` Christoph Lameter
2008-01-08 22:28     ` Rik van Riel
2008-01-09  4:26       ` KAMEZAWA Hiroyuki
2008-01-08 20:59 ` [patch 04/19] Use an indexed array for LRU variables Rik van Riel
2008-01-08 20:59 ` [patch 05/19] split LRU lists into anon & file sets Rik van Riel
2008-01-08 22:22   ` Christoph Lameter
2008-01-08 22:36     ` Rik van Riel
2008-01-08 22:42       ` Christoph Lameter
2008-01-09  2:45         ` Rik van Riel
2008-01-09  4:41   ` KAMEZAWA Hiroyuki
2008-01-10  2:21     ` Balbir Singh
2008-01-10  2:36       ` KAMEZAWA Hiroyuki
2008-01-10  3:26         ` Balbir Singh
2008-01-10  4:23           ` KAMEZAWA Hiroyuki
2008-01-10  2:28   ` KAMEZAWA Hiroyuki
2008-01-10  2:37     ` Rik van Riel
2008-01-11  3:59   ` KOSAKI Motohiro
2008-01-11 15:37     ` Rik van Riel
2008-01-11  6:24   ` KOSAKI Motohiro
2008-01-11 15:42     ` Rik van Riel
2008-01-11 15:59       ` Lee Schermerhorn
2008-01-11 16:15         ` Rik van Riel
2008-01-11 19:51           ` Lee Schermerhorn
2008-01-11 15:50     ` Lee Schermerhorn
2008-01-11 16:06       ` Rik van Riel
2008-01-11  7:35   ` KOSAKI Motohiro
2008-01-11 15:46     ` Rik van Riel
2008-01-14 23:57       ` KOSAKI Motohiro
2008-01-30  3:25   ` KOSAKI Motohiro
2008-01-30  8:57     ` KOSAKI Motohiro
2008-01-30 14:29       ` Lee Schermerhorn
2008-01-31  1:17         ` KOSAKI Motohiro
2008-01-31 10:48           ` Rik van Riel
2008-01-31 10:59             ` KOSAKI Motohiro
2008-02-07  0:35       ` Rik van Riel
2008-02-07  1:20         ` KOSAKI Motohiro
2008-02-07  1:36           ` Rik van Riel
2008-01-08 20:59 ` [patch 06/19] SEQ replacement for anonymous pages Rik van Riel
2008-01-08 20:59 ` [patch 07/19] (NEW) add some sanity checks to get_scan_ratio Rik van Riel
2008-01-09  4:16   ` KAMEZAWA Hiroyuki
2008-01-09 12:53     ` Rik van Riel
2008-01-08 20:59 ` [patch 08/19] add newly swapped in pages to the inactive list Rik van Riel
2008-01-08 22:28   ` Christoph Lameter
2008-01-08 20:59 ` [patch 09/19] (NEW) more aggressively use lumpy reclaim Rik van Riel
2008-01-08 22:30   ` Christoph Lameter
2008-01-14 15:28     ` Mel Gorman
2008-01-08 20:59 ` [patch 10/19] No Reclaim LRU Infrastructure Rik van Riel
2008-01-11  4:36   ` KOSAKI Motohiro
2008-01-11 15:43     ` Lee Schermerhorn
2008-01-15  0:06       ` KOSAKI Motohiro
2008-01-08 20:59 ` [patch 11/19] Non-reclaimable page statistics Rik van Riel
2008-01-08 20:59 ` [patch 12/19] scan noreclaim list for reclaimable pages Rik van Riel
2008-01-08 20:59 ` [patch 13/19] ramfs pages are non-reclaimable Rik van Riel
2008-01-08 20:59 ` [patch 14/19] SHM_LOCKED pages are nonreclaimable Rik van Riel
2008-01-08 20:59 ` [patch 15/19] non-reclaimable mlocked pages Rik van Riel
2008-01-08 20:59 ` [patch 16/19] mlock vma pages under mmap_sem held for read Rik van Riel
2008-01-08 20:59 ` [patch 17/19] handle mlocked pages during map/unmap and truncate Rik van Riel
2008-01-08 20:59 ` [patch 18/19] account mlocked pages Rik van Riel
2008-01-11 12:51   ` Balbir Singh
2008-01-13  5:18     ` Rik van Riel
2008-01-08 20:59 ` [patch 19/19] cull non-reclaimable anon pages from the LRU at fault time Rik van Riel
2008-01-10  4:39 ` [patch 00/19] VM pageout scalability improvements Mike Snitzer
2008-01-10 15:41   ` Rik van Riel
2008-01-10 16:08     ` Mike Snitzer
2008-01-11 10:41 ` Balbir Singh
2008-01-11 15:38   ` Rik van Riel
2008-01-11 11:47 ` Balbir Singh
2008-01-16  6:17 ` rvr split LRU minor regression ? KOSAKI Motohiro
  -- strict thread matches above, loose matches on Subject: below --
2008-01-02 22:41 [patch 00/19] VM pageout scalability improvements linux-kernel
2008-01-03 16:52 ` Lee Schermerhorn
2008-01-03 17:00   ` Rik van Riel
2008-01-03 17:13     ` Lee Schermerhorn
2008-01-03 22:00       ` Rik van Riel
2008-01-04 16:25         ` Lee Schermerhorn
2008-01-04 16:34           ` Andi Kleen
2008-01-04 16:55             ` Rik van Riel
2008-01-04 18:07               ` Larry Woodman
2008-01-04 17:06             ` Lee Schermerhorn
2008-01-07 19:07               ` Christoph Lameter
2008-01-07 19:32                 ` Rik van Riel
2008-01-07 10:06     ` KAMEZAWA Hiroyuki
2008-01-07 15:18       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).