LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/29] swap over networked storage -v11
@ 2007-02-21 14:43 Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 01/29] mm: page allocation rank Peter Zijlstra
                   ` (28 more replies)
  0 siblings, 29 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

(patches against 2.6.20-mm1)

There is a fundamental deadlock associated with paging; when writing out a page
to free memory requires free memory to complete. The usually solution is to
keep a small amount of memory available at all times so we can overcome this
problem. This however assumes the amount of memory needed for writeout is
(constant and) smaller than the provided reserve.

It is this latter assumption that breaks when doing writeout over network.
Network can take up an unspecified amount of memory while waiting for a reply
to our write request. This re-introduces the deadlock; we might never complete
the writeout, for we might not have enough memory to receive the completion
message.

The proposed solution is simple, only allow traffic servicing the VM to make
use of the reserves. Since the VM is always present to service, this limited
amount of memory can sustain a full connection; after a packet has been
processed its memory can be re-used for the next packet.

This however implies you know what packets are for whom, which generally
speaking you don't. Hence we need to receive all packets but discard them as
soon as we encounter a non VM bound packet allocated from the reserves.

Also knowing it is headed towards the VM needs a little help, hence we
introduce the socket flag SOCK_VMIO to mark sockets with.

Of course, since we are paging all this has to happen in kernel-space, since
user-space might just not be there.

Since packet processing might also require memory, this all also implies that
those auxiliary allocations may use the reserves when an emergency packet is
processed. This is accomplished by using PF_MEMALLOC.

How much memory is to be reserved is also an issue, enough memory to saturate
both the route cache and IP fragment reassembly, along with various constants.

This patch-set comes in 5 parts:

1) introduce the memory reserve and make the SLAB allocator play nice with it.
   patches 01-09

2) add some needed infrastructure to the network code
   patches 10-12

3) implement the idea outlined above
   patches 13-19

4) teach the swap machinery to use generic address_spaces
   patches 20-23

5) implement swap over NFS using all the new stuff
   patches 24-29
-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/29] mm: page allocation rank
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 02/29] mm: slab allocation fairness Peter Zijlstra
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-page_alloc-rank.patch --]
[-- Type: text/plain, Size: 8187 bytes --]

Introduce page allocation rank.

This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim 
was activated) to obtain the page.

It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind: 
  'would this allocation have succeeded using these gfp flags'.

For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.

The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
  0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
  1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
  ...
  15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
  16 is the softest allocation - ALLOC_WMARK_HIGH.

Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.

Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).

The allocation rank is stored in page->index for successful allocations.

'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.

The purpose of this measure is to introduce some fairness into the slab
allocator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   58 ++++++++++--------------------------
 2 files changed, 106 insertions(+), 41 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2007-01-09 11:29:18.000000000 +0100
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/hardirq.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,92 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK	16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+	int rank;
+
+	if (alloc_flags & ALLOC_NO_WATERMARKS)
+		return 0;
+
+	rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+	rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+	return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+	/*
+	 * Although correct this full version takes a ~3% performance
+	 * hit on the network tests in aim9.
+	 *
+
+	return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+
+	 *
+	 * Just check the bare essential ALLOC_NO_WATERMARKS case this keeps
+	 * the aim9 results within the error margin.
+	 */
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((current->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			return 0;
+	}
+
+	return 1;
+}
+
 #endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-01-09 11:29:18.000000000 +0100
@@ -888,14 +888,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1186,6 +1178,7 @@ zonelist_scan:
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page)
+			page->index = alloc_flags_to_rank(alloc_flags);
 			break;
 this_zone_full:
 		if (NUMA_BUILD)
@@ -1259,48 +1252,27 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		if (page)
+			goto got_pg;
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1309,6 +1281,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 02/29] mm: slab allocation fairness
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 01/29] mm: page allocation rank Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 15:33   ` Pekka Enberg
  2007-02-21 14:43 ` [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-slab-ranking.patch --]
[-- Type: text/plain, Size: 11585 bytes --]

The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is 
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.

Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.

This patch solves that by using the newly introduced page allocation rank.

Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page. 
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most 
shallow allocation possible (ALLOC_WMARK_HIGH).

When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.

If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.

Thus if we grew the slab under great duress while PF_MEMALLOC was set and we 
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.

So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.

If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.

[AIM9 results go here]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/Kconfig |    3 ++
 mm/slab.c  |   81 ++++++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 57 insertions(+), 27 deletions(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -114,6 +114,7 @@
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
+#include	"internal.h"
 
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
 
 struct kmem_cache {
 /* 1) per-cpu data, touched during every alloc/free */
+	int rank;
 	struct array_cache *array[NR_CPUS];
 /* 2) Cache tunables. Protected by cache_chain_mutex */
 	unsigned int batchcount;
@@ -1023,21 +1025,21 @@ static inline int cache_free_alien(struc
 }
 
 static inline void *alternate_node_alloc(struct kmem_cache *cachep,
-		gfp_t flags)
+		gfp_t flags, int rank)
 {
 	return NULL;
 }
 
 static inline void *____cache_alloc_node(struct kmem_cache *cachep,
-		 gfp_t flags, int nodeid)
+		 gfp_t flags, int nodeid, int rank)
 {
 	return NULL;
 }
 
 #else	/* CONFIG_NUMA */
 
-static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
 
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -1639,6 +1641,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	cachep->rank = page->index;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2287,6 +2290,7 @@ kmem_cache_create (const char *name, siz
 	}
 #endif
 #endif
+	cachep->rank = MAX_ALLOC_RANK;
 
 	/*
 	 * Determine if the slab management is 'on' or 'off' slab.
@@ -2953,7 +2957,7 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2965,6 +2969,8 @@ static void *cache_alloc_refill(struct k
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
 retry:
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
 		/*
@@ -3020,14 +3026,16 @@ must_grow:
 	l3->free_objects -= ac->avail;
 alloc_done:
 	spin_unlock(&l3->list_lock);
-
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || rank > cachep->rank))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3184,7 +3192,8 @@ static inline int should_failslab(struct
 
 #endif /* CONFIG_FAILSLAB */
 
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	void *objp;
 	struct array_cache *ac;
@@ -3195,17 +3204,29 @@ static inline void *____cache_alloc(stru
 		return NULL;
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && rank <= cachep->rank)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, rank);
 	}
 	return objp;
 }
 
+#ifdef CONFIG_SLAB_FAIR
+static inline int slab_alloc_rank(gfp_t flags)
+{
+	return gfp_to_rank(flags);
+}
+#else
+static inline int slab_alloc_rank(gfp_t flags)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_NUMA
 /*
  * Try allocating on another node if PF_SPREAD_SLAB|PF_MEMPOLICY.
@@ -3213,7 +3234,8 @@ static inline void *____cache_alloc(stru
  * If we are in_interrupt, then process context, including cpusets and
  * mempolicy, may not apply and should not be used for allocation policy.
  */
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	int nid_alloc, nid_here;
 
@@ -3225,7 +3247,7 @@ static void *alternate_node_alloc(struct
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
 	if (nid_alloc != nid_here)
-		return ____cache_alloc_node(cachep, flags, nid_alloc);
+		return ____cache_alloc_node(cachep, flags, nid_alloc, rank);
 	return NULL;
 }
 
@@ -3237,7 +3259,7 @@ static void *alternate_node_alloc(struct
  * allocator to do its reclaim / fallback magic. We then insert the
  * slab into the proper nodelist and then allocate from it.
  */
-static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
 {
 	struct zonelist *zonelist;
 	gfp_t local_flags;
@@ -3264,7 +3286,7 @@ retry:
 			cache->nodelists[nid] &&
 			cache->nodelists[nid]->free_objects)
 				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
+					flags | GFP_THISNODE, nid, rank);
 	}
 
 	if (!obj && !(flags & __GFP_NO_GROW)) {
@@ -3287,7 +3309,7 @@ retry:
 			nid = page_to_nid(virt_to_page(obj));
 			if (cache_grow(cache, flags, nid, obj)) {
 				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
+					flags | GFP_THISNODE, nid, rank);
 				if (!obj)
 					/*
 					 * Another processor may allocate the
@@ -3308,7 +3330,7 @@ retry:
  * A interface to enable slab creation on nodeid
  */
 static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
-				int nodeid)
+				int nodeid, int rank)
 {
 	struct list_head *entry;
 	struct slab *slabp;
@@ -3321,6 +3343,8 @@ static void *____cache_alloc_node(struct
 
 retry:
 	check_irq_off();
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	spin_lock(&l3->list_lock);
 	entry = l3->slabs_partial.next;
 	if (entry == &l3->slabs_partial) {
@@ -3356,11 +3380,12 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;
 
-	return fallback_alloc(cachep, flags);
+	return fallback_alloc(cachep, flags, rank);
 
 done:
 	return obj;
@@ -3384,6 +3409,7 @@ __cache_alloc_node(struct kmem_cache *ca
 {
 	unsigned long save_flags;
 	void *ptr;
+	int rank = slab_alloc_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
@@ -3393,7 +3419,7 @@ __cache_alloc_node(struct kmem_cache *ca
 
 	if (unlikely(!cachep->nodelists[nodeid])) {
 		/* Node not bootstrapped yet */
-		ptr = fallback_alloc(cachep, flags);
+		ptr = fallback_alloc(cachep, flags, rank);
 		goto out;
 	}
 
@@ -3404,12 +3430,12 @@ __cache_alloc_node(struct kmem_cache *ca
 		 * to other nodes. It may fail while we still have
 		 * objects on other nodes available.
 		 */
-		ptr = ____cache_alloc(cachep, flags);
+		ptr = ____cache_alloc(cachep, flags, rank);
 		if (ptr)
 			goto out;
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
-	ptr = ____cache_alloc_node(cachep, flags, nodeid);
+	ptr = ____cache_alloc_node(cachep, flags, nodeid, rank);
   out:
 	local_irq_restore(save_flags);
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
@@ -3418,23 +3444,23 @@ __cache_alloc_node(struct kmem_cache *ca
 }
 
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
 {
 	void *objp;
 
 	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
-		objp = alternate_node_alloc(cache, flags);
+		objp = alternate_node_alloc(cache, flags, rank);
 		if (objp)
 			goto out;
 	}
-	objp = ____cache_alloc(cache, flags);
+	objp = ____cache_alloc(cache, flags, rank);
 
 	/*
 	 * We may just have run out of memory on the local node.
 	 * ____cache_alloc_node() knows how to locate memory on other nodes
 	 */
  	if (!objp)
- 		objp = ____cache_alloc_node(cache, flags, numa_node_id());
+ 		objp = ____cache_alloc_node(cache, flags, numa_node_id(), rank);
 
   out:
 	return objp;
@@ -3442,9 +3468,9 @@ __do_cache_alloc(struct kmem_cache *cach
 #else
 
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int rank)
 {
-	return ____cache_alloc(cachep, flags);
+	return ____cache_alloc(cachep, flags, rank);
 }
 
 #endif /* CONFIG_NUMA */
@@ -3454,10 +3480,11 @@ __cache_alloc(struct kmem_cache *cachep,
 {
 	unsigned long save_flags;
 	void *objp;
+	int rank = slab_alloc_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
-	objp = __do_cache_alloc(cachep, flags);
+	objp = __do_cache_alloc(cachep, flags, rank);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	prefetchw(objp);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -163,6 +163,8 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+config SLAB_FAIR
+	def_bool n
 #
 # Adaptive file readahead
 #

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 01/29] mm: page allocation rank Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 02/29] mm: slab allocation fairness Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 15:53   ` Arjan van de Ven
  2007-02-21 14:43 ` [PATCH 04/29] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2119 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/softirq.c |    3 +++
 mm/internal.h    |   14 ++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-12-14 10:02:52.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-12-14 10:10:09.000000000 +0100
@@ -75,9 +75,10 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
@@ -117,9 +118,10 @@ static inline int gfp_to_rank(gfp_t gfp_
 	 */
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (current->flags & PF_MEMALLOC))
+			return 0;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			return 0;
 	}
 
Index: linux-2.6-git/kernel/softirq.c
===================================================================
--- linux-2.6-git.orig/kernel/softirq.c	2006-12-14 10:02:18.000000000 +0100
+++ linux-2.6-git/kernel/softirq.c	2006-12-14 10:02:52.000000000 +0100
@@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -247,6 +249,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	current->flags = pflags;
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 04/29] mm: serialize access to min_free_kbytes
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 05/29] mm: emergency pool Peter Zijlstra
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1913 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-01-15 09:58:49.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-01-15 09:58:51.000000000 +0100
@@ -95,6 +95,7 @@ static char * const zone_names[MAX_NR_ZO
 #endif
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -3074,12 +3075,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -3133,6 +3134,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -3168,7 +3178,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 05/29] mm: emergency pool
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 04/29] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 06/29] mm: __GFP_EMERGENCY Peter Zijlstra
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6284 bytes --]

Provide means to reserve a specific amount pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    3 +-
 mm/page_alloc.c        |   52 ++++++++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c            |    6 ++---
 3 files changed, 48 insertions(+), 13 deletions(-)

Index: linux-2.6-git/include/linux/mmzone.h
===================================================================
--- linux-2.6-git.orig/include/linux/mmzone.h	2007-02-12 09:40:51.000000000 +0100
+++ linux-2.6-git/include/linux/mmzone.h	2007-02-12 11:13:58.000000000 +0100
@@ -178,7 +178,7 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -562,6 +562,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-02-12 11:13:35.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-02-12 11:14:16.000000000 +0100
@@ -101,6 +101,7 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -995,7 +996,8 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx] +
+			z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1348,8 +1350,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%lx\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -1562,9 +1564,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
 			K(zone->present_pages),
@@ -3000,7 +3002,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -3057,7 +3059,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -3069,11 +3072,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -3092,12 +3097,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -3118,6 +3125,33 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ *
+ *	NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	if (pages)
+		printk(KERN_DEBUG "Emergency reserve: %d\n",
+				var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6-git/mm/vmstat.c
===================================================================
--- linux-2.6-git.orig/mm/vmstat.c	2007-02-12 09:40:51.000000000 +0100
+++ linux-2.6-git/mm/vmstat.c	2007-02-12 11:14:28.000000000 +0100
@@ -513,9 +513,9 @@ static int zoneinfo_show(struct seq_file
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone_page_state(zone, NR_FREE_PAGES),
-			   zone->pages_min,
-			   zone->pages_low,
-			   zone->pages_high,
+			   zone->pages_emerg + zone->pages_min,
+			   zone->pages_emerg + zone->pages_low,
+			   zone->pages_emerg + zone->pages_high,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 06/29] mm: __GFP_EMERGENCY
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 05/29] mm: emergency pool Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 07/29] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 3698 bytes --]

__GFP_EMERGENCY will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    7 ++++++-
 mm/internal.h       |   10 +++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

Index: linux-2.6-git/include/linux/gfp.h
===================================================================
--- linux-2.6-git.orig/include/linux/gfp.h	2006-12-14 10:02:18.000000000 +0100
+++ linux-2.6-git/include/linux/gfp.h	2006-12-14 10:02:52.000000000 +0100
@@ -35,17 +35,21 @@ struct vm_area_struct;
 #define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)0x40u)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)0x80u)	/* Can call down to low-level FS? */
+
 #define __GFP_COLD	((__force gfp_t)0x100u)	/* Cache-cold page required */
 #define __GFP_NOWARN	((__force gfp_t)0x200u)	/* Suppress page allocation failure warning */
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
+
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	((__force gfp_t)0x2000u)/* Slab internal usage */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
+
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x80000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
+			__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-12-14 10:02:52.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-12-14 10:02:52.000000000 +0100
@@ -75,7 +75,9 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_EMERGENCY)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))
@@ -103,7 +105,7 @@ static inline int alloc_flags_to_rank(in
 	return rank;
 }
 
-static inline int gfp_to_rank(gfp_t gfp_mask)
+static __always_inline int gfp_to_rank(gfp_t gfp_mask)
 {
 	/*
 	 * Although correct this full version takes a ~3% performance
@@ -118,7 +120,9 @@ static inline int gfp_to_rank(gfp_t gfp_
 	 */
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (current->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_EMERGENCY)
+			return 0;
+		else if (!in_irq() && (current->flags & PF_MEMALLOC))
 			return 0;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 07/29] mm: allow mempool to fall back to memalloc reserves
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 06/29] mm: __GFP_EMERGENCY Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 08/29] mm: kmem_cache_objs_to_pages() Peter Zijlstra
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1167 bytes --]

Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempool.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6-git/mm/mempool.c
===================================================================
--- linux-2.6-git.orig/mm/mempool.c	2007-01-12 08:03:44.000000000 +0100
+++ linux-2.6-git/mm/mempool.c	2007-01-12 10:38:57.000000000 +0100
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include "internal.h"
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -229,6 +230,15 @@ repeat_alloc:
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
+	/* if we really had right to the emergency reserves try those */
+	if (gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS) {
+		if (gfp_temp & __GFP_NOMEMALLOC) {
+			gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			goto repeat_alloc;
+		} else
+			gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+	}
+
 	/* We must not sleep in the GFP_ATOMIC case */
 	if (!(gfp_mask & __GFP_WAIT))
 		return NULL;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 08/29] mm: kmem_cache_objs_to_pages()
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 07/29] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 15:47   ` Pekka Enberg
  2007-02-21 14:43 ` [PATCH 09/29] selinux: tag avc cache alloc as non-critical Peter Zijlstra
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-kmem_cache_objs_to_pages.patch --]
[-- Type: text/plain, Size: 1426 bytes --]

Provide a method to calculate the number of pages needed to store a given
number of slab objects (upper bound when considering possible partial and
free slabs).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slab.h |    1 +
 mm/slab.c            |    6 ++++++
 2 files changed, 7 insertions(+)

Index: linux-2.6-git/include/linux/slab.h
===================================================================
--- linux-2.6-git.orig/include/linux/slab.h	2007-01-09 11:28:32.000000000 +0100
+++ linux-2.6-git/include/linux/slab.h	2007-01-09 11:30:16.000000000 +0100
@@ -43,6 +43,7 @@ typedef struct kmem_cache kmem_cache_t _
  */
 void __init kmem_cache_init(void);
 extern int slab_is_available(void);
+extern unsigned int kmem_cache_objs_to_pages(struct kmem_cache *, int);
 
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			unsigned long,
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c	2007-01-09 11:30:00.000000000 +0100
+++ linux-2.6-git/mm/slab.c	2007-01-09 11:30:16.000000000 +0100
@@ -4482,3 +4482,9 @@ unsigned int ksize(const void *objp)
 
 	return obj_size(virt_to_cache(objp));
 }
+
+unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
+{
+	return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;
+}
+EXPORT_SYMBOL_GPL(kmem_cache_objs_to_pages);

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 09/29] selinux: tag avc cache alloc as non-critical
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 08/29] mm: kmem_cache_objs_to_pages() Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 15:22   ` James Morris
  2007-02-21 14:43 ` [PATCH 10/29] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-selinux-emergency.patch --]
[-- Type: text/plain, Size: 731 bytes --]

Failing to allocate a cache entry will only harm performance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c	2007-02-14 08:31:13.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c	2007-02-14 10:10:47.000000000 +0100
@@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 10/29] net: wrap sk->sk_backlog_rcv()
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 09/29] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 11/29] net: packet split receive api Peter Zijlstra
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: net-backlog.patch --]
[-- Type: text/plain, Size: 2745 bytes --]

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    5 +++++
 net/core/sock.c      |    4 ++--
 net/ipv4/tcp.c       |    2 +-
 net/ipv4/tcp_timer.c |    2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 11:29:55.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 11:42:00.000000000 +0100
@@ -480,6 +480,11 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)		\
 ({	int rc;							\
 	release_sock(__sk);					\
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2007-02-14 11:29:55.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2007-02-14 11:42:00.000000000 +0100
@@ -290,7 +290,7 @@ int sk_receive_skb(struct sock *sk, stru
 		 */
 		mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-		rc = sk->sk_backlog_rcv(sk, skb);
+		rc = sk_backlog_rcv(sk, skb);
 
 		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	} else
@@ -1244,7 +1244,7 @@ static void __release_sock(struct sock *
 			struct sk_buff *next = skb->next;
 
 			skb->next = NULL;
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 			/*
 			 * We are in process context here with softirqs
Index: linux-2.6-git/net/ipv4/tcp.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp.c	2007-02-14 11:29:35.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp.c	2007-02-14 11:42:00.000000000 +0100
@@ -1002,7 +1002,7 @@ static void tcp_prequeue_process(struct 
 	 * necessary */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-		sk->sk_backlog_rcv(sk, skb);
+		sk_backlog_rcv(sk, skb);
 	local_bh_enable();
 
 	/* Clear memory counter. */
Index: linux-2.6-git/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_timer.c	2007-02-14 11:29:36.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_timer.c	2007-02-14 11:42:00.000000000 +0100
@@ -198,7 +198,7 @@ static void tcp_delack_timer(unsigned lo
 		NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
 		while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 		tp->ucopy.memory = 0;
 	}

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 11/29] net: packet split receive api
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 10/29] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 12/29] net: remove alloc_skb_from_cache Peter Zijlstra
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: net-ps_rx.patch --]
[-- Type: text/plain, Size: 6019 bytes --]

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs.  Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/net/e1000/e1000_main.c |    8 ++------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |   23 +++++++++++++++++++++++
 net/core/skbuff.c              |   20 ++++++++++++++++++++
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6-git/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6-git.orig/drivers/net/e1000/e1000_main.c	2007-02-14 08:31:12.000000000 +0100
+++ linux-2.6-git/drivers/net/e1000/e1000_main.c	2007-02-14 11:42:07.000000000 +0100
@@ -4412,12 +4412,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
 			pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
 					PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			ps_page_dma->ps_page_dma[j] = 0;
-			skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-			                   length);
+			skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
 			ps_page->ps_page[j] = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -4623,7 +4619,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 			if (j < adapter->rx_ps_pages) {
 				if (likely(!ps_page->ps_page[j])) {
 					ps_page->ps_page[j] =
-						alloc_page(GFP_ATOMIC);
+						netdev_alloc_page(netdev);
 					if (unlikely(!ps_page->ps_page[j])) {
 						adapter->alloc_rx_buff_failed++;
 						goto no_buffers;
Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2007-02-14 11:29:54.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2007-02-14 11:59:04.000000000 +0100
@@ -813,6 +813,9 @@ static inline void skb_fill_page_desc(st
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1148,6 +1151,26 @@ static inline struct sk_buff *netdev_all
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+
+/**
+ *	netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *
+ * 	Allocate a new page node local to the specified device.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+	return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+	__free_page(page);
+}
+
 /**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2007-02-14 11:29:54.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2007-02-14 12:01:40.000000000 +0100
@@ -279,6 +279,24 @@ struct sk_buff *__netdev_alloc_skb(struc
 	return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	struct page *page;
+
+	page = alloc_pages_node(node, gfp_mask, 0);
+	return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		int size)
+{
+	skb_fill_page_desc(skb, i, page, off, size);
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
@@ -2066,6 +2084,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Index: linux-2.6-git/drivers/net/sky2.c
===================================================================
--- linux-2.6-git.orig/drivers/net/sky2.c	2007-02-14 08:31:12.000000000 +0100
+++ linux-2.6-git/drivers/net/sky2.c	2007-02-14 12:00:22.000000000 +0100
@@ -1083,7 +1083,7 @@ static struct sk_buff *sky2_rx_alloc(str
 	skb_reserve(skb, ALIGN(p, RX_SKB_ALIGN) - p);
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -1972,8 +1972,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -1990,15 +1990,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2027,7 +2023,7 @@ static struct sk_buff *receive_new(struc
 	sky2_rx_map_skb(sky2->hw->pdev, re, hdr_space);
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 12/29] net: remove alloc_skb_from_cache
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 11/29] net: packet split receive api Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 13/29] netvm: link network to vm layer Peter Zijlstra
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: net-skbuff-cleanup.patch --]
[-- Type: text/plain, Size: 2906 bytes --]

Lets get rid of the unused alloc_skb_from_cache() thing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/skbuff.h |    3 --
 net/core/dev.c         |    1 
 net/core/skbuff.c      |   71 ++++++-------------------------------------------
 3 files changed, 11 insertions(+), 64 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2007-02-14 08:31:13.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2007-02-14 10:11:36.000000000 +0100
@@ -345,9 +345,6 @@ static inline struct sk_buff *alloc_skb_
 	return __alloc_skb(size, priority, 1, -1);
 }
 
-extern struct sk_buff *alloc_skb_from_cache(struct kmem_cache *cp,
-					    unsigned int size,
-					    gfp_t priority);
 extern void	       kfree_skbmem(struct sk_buff *skb);
 extern struct sk_buff *skb_clone(struct sk_buff *skb,
 				 gfp_t priority);
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2007-02-14 08:31:12.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2007-02-14 10:11:16.000000000 +0100
@@ -198,61 +198,6 @@ nodata:
 }
 
 /**
- *	alloc_skb_from_cache	-	allocate a network buffer
- *	@cp: kmem_cache from which to allocate the data area
- *           (object size must be big enough for @size bytes + skb overheads)
- *	@size: size to allocate
- *	@gfp_mask: allocation mask
- *
- *	Allocate a new &sk_buff. The returned buffer has no headroom and
- *	tail room of size bytes. The object has a reference count of one.
- *	The return is the buffer. On a failure the return is %NULL.
- *
- *	Buffers may only be allocated from interrupts using a @gfp_mask of
- *	%GFP_ATOMIC.
- */
-struct sk_buff *alloc_skb_from_cache(struct kmem_cache *cp,
-				     unsigned int size,
-				     gfp_t gfp_mask)
-{
-	struct sk_buff *skb;
-	u8 *data;
-
-	/* Get the HEAD */
-	skb = kmem_cache_alloc(skbuff_head_cache,
-			       gfp_mask & ~__GFP_DMA);
-	if (!skb)
-		goto out;
-
-	/* Get the DATA. */
-	size = SKB_DATA_ALIGN(size);
-	data = kmem_cache_alloc(cp, gfp_mask);
-	if (!data)
-		goto nodata;
-
-	memset(skb, 0, offsetof(struct sk_buff, truesize));
-	skb->truesize = size + sizeof(struct sk_buff);
-	atomic_set(&skb->users, 1);
-	skb->head = data;
-	skb->data = data;
-	skb->tail = data;
-	skb->end  = data + size;
-
-	atomic_set(&(skb_shinfo(skb)->dataref), 1);
-	skb_shinfo(skb)->nr_frags  = 0;
-	skb_shinfo(skb)->gso_size = 0;
-	skb_shinfo(skb)->gso_segs = 0;
-	skb_shinfo(skb)->gso_type = 0;
-	skb_shinfo(skb)->frag_list = NULL;
-out:
-	return skb;
-nodata:
-	kmem_cache_free(skbuff_head_cache, skb);
-	skb = NULL;
-	goto out;
-}
-
-/**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
  *	@length: length to allocate

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 13/29] netvm: link network to vm layer
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 12/29] net: remove alloc_skb_from_cache Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 14/29] netvm: INET reserves Peter Zijlstra
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm-reserve.patch --]
[-- Type: text/plain, Size: 6836 bytes --]

Hook up networking to the memory reserve.

There are two kinds of reserves: skb and aux. 
 - skb reserves are used for incomming packets,
 - aux reserves are used for processing these packets.

The consumers for these reserves are sockets marked with:
  SOCK_VMIO

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |   31 ++++++++++++
 net/Kconfig        |    3 +
 net/core/sock.c    |  134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-20 15:06:17.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-20 15:07:45.000000000 +0100
@@ -392,6 +392,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -414,6 +415,36 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_SKB 3
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+	(4 * MAX_FRAGMENTS * MAX_PAGES_PER_SKB)
+
+extern atomic_t vmio_socks;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void skb_reserve_memory(int skb_reserve_bytes);
+extern void aux_reserve_memory(int aux_reserve_pages);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2007-02-20 15:06:17.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2007-02-20 15:18:48.000000000 +0100
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/log2.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -196,6 +197,138 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static atomic_t rx_emergency_bytes;
+
+static int skb_reserve_bytes;
+static int aux_reserve_pages;
+
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+atomic_t vmio_socks;
+EXPORT_SYMBOL_GPL(vmio_socks);
+
+/*
+ * is there room for another emergency packet?
+ * we account in power of two units to approx the slab allocator.
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+	int size = roundup_pow_of_two(bytes);
+	int nr = atomic_add_return(size, &rx_emergency_bytes);
+	int thresh = (3 * skb_reserve_bytes) / 2;
+	if (nr < thresh || overcommit)
+		return 1;
+
+	atomic_dec(&rx_emergency_bytes);
+	return 0;
+}
+
+int rx_emergency_get(int bytes)
+{
+	return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+	return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+	int size = roundup_pow_of_two(bytes);
+	return atomic_sub(size, &rx_emergency_bytes);
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	nr_socks = atomic_add_return(socks, &vmio_socks);
+	BUG_ON(nr_socks < 0);
+
+	if (nr_socks) {
+		int skb_reserve_pages = skb_reserve_bytes / PAGE_SIZE;
+		int rx_pages = 2 * skb_reserve_pages + aux_reserve_pages;
+		reserve += rx_pages - rx_net_reserve;
+		rx_net_reserve = rx_pages;
+	} else {
+		reserve -= rx_net_reserve;
+		rx_net_reserve = 0;
+	}
+
+	if (reserve)
+		adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper functions to track the memory reserves
+ * needed because of modular ipv6
+ */
+void skb_reserve_memory(int bytes)
+{
+	skb_reserve_bytes += bytes;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(skb_reserve_memory);
+
+void aux_reserve_memory(int pages)
+{
+	aux_reserve_pages += pages;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(aux_reserve_memory);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+#ifndef CONFIG_NETVM
+	BUG();
+#endif
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -868,6 +1001,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6-git/net/Kconfig
===================================================================
--- linux-2.6-git.orig/net/Kconfig	2007-02-20 14:42:09.000000000 +0100
+++ linux-2.6-git/net/Kconfig	2007-02-20 15:06:17.000000000 +0100
@@ -227,6 +227,9 @@ config WIRELESS_EXT
 config FIB_RULES
 	bool
 
+config NETVM
+	def_bool n
+
 endif   # if NET
 endmenu # Networking
 

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 14/29] netvm: INET reserves.
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 13/29] netvm: link network to vm layer Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 15/29] netvm: hook skb allocation to reserves Peter Zijlstra
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm-reserve-inet.patch --]
[-- Type: text/plain, Size: 6810 bytes --]

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Account the route cache to the auxillary reserve.
Account the fragments to the skb reserve so that one can at least
overflow the fragment cache (avoids fragment deadlocks).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/ipv4/ip_fragment.c     |    1 +
 net/ipv4/route.c           |   18 +++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   13 ++++++++++++-
 net/ipv6/reassembly.c      |    1 +
 net/ipv6/route.c           |   18 +++++++++++++++++-
 net/ipv6/sysctl_net_ipv6.c |   12 +++++++++++-
 6 files changed, 59 insertions(+), 4 deletions(-)

Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c	2007-02-20 16:41:28.000000000 +0100
@@ -18,6 +18,7 @@
 #include <net/route.h>
 #include <net/tcp.h>
 #include <net/cipso_ipv4.h>
+#include <net/sock.h>
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,16 @@ static int strategy_allowed_congestion_c
 
 }
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	skb_reserve_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+
 ctl_table ipv4_table[] = {
 	{
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +302,7 @@ ctl_table ipv4_table[] = {
 		.data		= &sysctl_ipfrag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c	2007-02-20 16:41:28.000000000 +0100
@@ -15,6 +15,16 @@
 
 #ifdef CONFIG_SYSCTL
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	skb_reserve_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+
 static ctl_table ipv6_table[] = {
 	{
 		.ctl_name	= NET_IPV6_ROUTE,
@@ -44,7 +54,7 @@ static ctl_table ipv6_table[] = {
 		.data		= &sysctl_ip6frag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv4/ip_fragment.c	2007-02-20 16:41:28.000000000 +0100
@@ -743,6 +743,7 @@ void ipfrag_init(void)
 	ipfrag_secret_timer.function = ipfrag_secret_rebuild;
 	ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
 	add_timer(&ipfrag_secret_timer);
+	skb_reserve_memory(sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv6/reassembly.c	2007-02-20 16:41:28.000000000 +0100
@@ -772,4 +772,5 @@ void __init ipv6_frag_init(void)
 	ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
 	ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
 	add_timer(&ip6_frag_secret_timer);
+	skb_reserve_memory(sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv4/route.c	2007-02-20 16:41:28.000000000 +0100
@@ -2884,6 +2884,20 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int new_pages;
+	int old_pages = kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	new_pages = kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	aux_reserve_memory(new_pages - old_pages);
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2926,7 +2940,7 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_rt_size,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3153,6 +3167,8 @@ int __init ip_rt_init(void)
 
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	aux_reserve_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+				ip_rt_max_size));
 
 	devinet_init();
 	ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c	2007-02-20 15:12:56.000000000 +0100
+++ linux-2.6-git/net/ipv6/route.c	2007-02-20 17:46:13.000000000 +0100
@@ -2370,6 +2370,20 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int new_pages;
+	int old_pages = kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	new_pages = kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	aux_reserve_memory(new_pages - old_pages);
+	return ret;
+}
+
 ctl_table ipv6_route_table[] = {
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_FLUSH,
@@ -2393,7 +2407,7 @@ ctl_table ipv6_route_table[] = {
 		.data		=	&ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	&proc_dointvec,
+         	.proc_handler	=	&proc_dointvec_rt_size,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2478,6 +2492,8 @@ void __init ip6_route_init(void)
 
 	proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
 #endif
+	aux_reserve_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+				ip6_rt_max_size));
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 15/29] netvm: hook skb allocation to reserves
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 14/29] netvm: INET reserves Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 16/29] netvm: filter emergency skbs Peter Zijlstra
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm-skbuff-reserve.patch --]
[-- Type: text/plain, Size: 14479 bytes --]

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. Skbs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

(NOTE the extra atomic overhead is only for those pages allocated from the
reserves - it does not affect the normal fast path.)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/skbuff.h |   22 ++++--
 net/core/skbuff.c      |  170 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 165 insertions(+), 27 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2007-02-15 12:31:05.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2007-02-15 12:31:05.000000000 +0100
@@ -284,7 +284,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -329,10 +330,19 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+#ifdef CONFIG_NETVM
+#define skb_emergency(skb)	unlikely((skb)->emergency)
+#else
+#define skb_emergency(skb)	false
+#endif
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -342,7 +352,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern void	       kfree_skbmem(struct sk_buff *skb);
@@ -1103,7 +1113,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1149,6 +1160,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1165,7 +1177,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2007-02-15 12:31:05.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2007-02-15 12:45:50.000000000 +0100
@@ -142,28 +142,36 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+	if (flags & SKB_ALLOC_RX)
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+#endif
 
+retry_alloc:
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
-		goto out;
+		goto noskb;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask, node);
 	if (!data)
 		goto nodata;
 
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -180,7 +188,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -188,12 +196,31 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+#ifdef CONFIG_NETVM
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+		goto out;
+
+	if (!emergency) {
+		if (rx_emergency_get(size)) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_EMERGENCY;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		rx_emergency_put(size);
+#endif
+
 	goto out;
 }
 
@@ -216,7 +243,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ 	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -229,10 +256,34 @@ struct page *__netdev_alloc_page(struct 
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+#ifdef CONFIG_NETVM
+	gfp_mask |= __GFP_NOMEMALLOC | __GFP_NOWARN;
+#endif
+
 	page = alloc_pages_node(node, gfp_mask, 0);
+
+#ifdef CONFIG_NETVM
+	if (!page && rx_emergency_get(PAGE_SIZE)) {
+		gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_NOWARN);
+		gfp_mask |= __GFP_EMERGENCY;
+		page = alloc_pages_node(node, gfp_mask, 0);
+		if (!page)
+			rx_emergency_put(PAGE_SIZE);
+	}
+#endif
+
 	return page;
 }
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+#ifdef CONFIG_NETVM
+	if (unlikely(page->index == 0))
+		rx_emergency_put(PAGE_SIZE);
+#endif
+	__free_page(page);
+}
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -240,6 +291,33 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * Fix-up the emergency accounting; make sure all pages match
+	 * skb->emergency.
+	 *
+	 * This relies on the page rank (page->index) to be preserved between
+	 * the call to __netdev_alloc_page() and this call.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * If the page rank wasn't 0 (ALLOC_NO_WATERMARK) we can use
+		 * overcommit accounting, since we already have the memory.
+		 */
+		if (page->index != 0)
+			rx_emergency_get_overcommit(PAGE_SIZE);
+		atomic_set((atomic_t *)&page->index, 1);
+	} else if (unlikely(page->index == 0)) {
+		/*
+		 * Rare case; the skb wasn't allocated under pressure but
+		 * the page was. We need to return the page. This can offset
+		 * the accounting a little, but its a constant shift, it does
+		 * not accumulate.
+		 */
+		rx_emergency_put(PAGE_SIZE);
+	}
+#endif
 }
 
 static void skb_drop_list(struct sk_buff **listp)
@@ -273,16 +351,25 @@ static void skb_release_data(struct sk_b
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+		int size = skb->end - skb->head;
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
-			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
+				put_page(page);
+				if (skb_emergency(skb) &&
+				    atomic_dec_and_test((atomic_t *)&page->index))
+					rx_emergency_put(PAGE_SIZE);
+			}
 		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (skb_emergency(skb))
+			rx_emergency_put(size);
 	}
 }
 
@@ -403,6 +490,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_EMERGENCY;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -437,6 +527,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 	C(mark);
@@ -530,6 +621,8 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+#define skb_alloc_rx(skb) (skb_emergency(skb) ? SKB_ALLOC_RX : 0)
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -553,8 +646,8 @@ struct sk_buff *skb_copy(const struct sk
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(skb->end - skb->head + skb->data_len,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(skb->end - skb->head + skb->data_len,
+					gfp_mask, skb_alloc_rx(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -591,7 +684,8 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(skb->end - skb->head, gfp_mask);
+	struct sk_buff *n = __alloc_skb(skb->end - skb->head, gfp_mask,
+					skb_alloc_rx(skb), -1);
 
 	if (!n)
 		goto out;
@@ -613,8 +707,11 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			get_page(frag->page);
+			if (skb_emergency(n))
+				atomic_inc((atomic_t *)&frag->page->index);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -652,12 +749,19 @@ int pskb_expand_head(struct sk_buff *skb
 	u8 *data;
 	int size = nhead + (skb->end - skb->head) + ntail;
 	long off;
+	int emergency = 0;
 
 	if (skb_shared(skb))
 		BUG();
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (skb_emergency(skb) && rx_emergency_get(size)) {
+		gfp_mask |= __GFP_EMERGENCY;
+		emergency = 1;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -667,8 +771,12 @@ int pskb_expand_head(struct sk_buff *skb
 	memcpy(data + nhead, skb->head, skb->tail - skb->head);
 	memcpy(data + size, skb->end, sizeof(struct skb_shared_info));
 
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		struct page *page = skb_shinfo(skb)->frags[i].page;
+		get_page(page);
+		if (emergency)
+			atomic_inc((atomic_t *)&page->index);
+	}
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -690,6 +798,8 @@ int pskb_expand_head(struct sk_buff *skb
 	return 0;
 
 nodata:
+	if (unlikely(emergency))
+		rx_emergency_put(size);
 	return -ENOMEM;
 }
 
@@ -742,8 +852,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+					gfp_mask, skb_alloc_rx(skb), -1);
 	int head_copy_len, head_copy_off;
 
 	if (!n)
@@ -849,8 +959,13 @@ int ___pskb_trim(struct sk_buff *skb, un
 drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
-		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+		for (; i < nfrags; i++) {
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+			put_page(page);
+			if (skb_emergency(skb) &&
+			    atomic_dec_and_test((atomic_t *)&page->index))
+				rx_emergency_put(PAGE_SIZE);
+		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1019,7 +1134,11 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+			put_page(page);
+			if (skb_emergency(skb) &&
+			    atomic_dec_and_test((atomic_t *)&page->index))
+				rx_emergency_put(PAGE_SIZE);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1593,6 +1712,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -1601,7 +1721,9 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				get_page(page);
+				if (skb_emergency(skb1))
+					atomic_inc((atomic_t *)&page->index);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -1927,7 +2049,8 @@ struct sk_buff *skb_segment(struct sk_bu
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
+		nskb = __alloc_skb(hsize + doffset + headroom, GFP_ATOMIC,
+				   skb_alloc_rx(skb), -1);
 		if (unlikely(!nskb))
 			goto err;
 
@@ -1970,6 +2093,8 @@ struct sk_buff *skb_segment(struct sk_bu
 
 			*frag = skb_shinfo(skb)->frags[i];
 			get_page(frag->page);
+			if (skb_emergency(nskb))
+				atomic_inc((atomic_t *)&frag->page->index);
 			size = frag->size;
 
 			if (pos < offset) {
@@ -2030,6 +2155,7 @@ EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(__netdev_free_page);
 EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 16/29] netvm: filter emergency skbs.
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 15/29] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 17/29] netvm: prevent a TCP specific deadlock Peter Zijlstra
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm-sk_filter.patch --]
[-- Type: text/plain, Size: 751 bytes --]

Toss all emergency packets not for a SOCK_VMIO socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 16:15:49.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 16:16:27.000000000 +0100
@@ -926,6 +926,9 @@ static inline int sk_filter(struct sock 
 {
 	int err;
 	struct sk_filter *filter;
+
+	if (skb_emergency(skb) && !sk_has_vmio(sk))
+		return -EPERM;
 	
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 17/29] netvm: prevent a TCP specific deadlock
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 16/29] netvm: filter emergency skbs Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm-tcp-deadlock.patch --]
[-- Type: text/plain, Size: 2597 bytes --]

It could happen that all !SOCK_VMIO sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_VMIO buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h  |    7 ++++---
 net/core/stream.c   |    5 +++--
 net/ipv4/tcp_ipv4.c |    8 ++++++++
 net/ipv6/tcp_ipv6.c |    8 ++++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 12:09:05.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 12:09:21.000000000 +0100
@@ -730,7 +730,8 @@ static inline struct inode *SOCK_INODE(s
 }
 
 extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+		int size, int kind);
 
 #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
 
@@ -757,13 +758,13 @@ static inline void sk_stream_writequeue_
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	return (int)skb->truesize <= sk->sk_forward_alloc ||
-		sk_stream_mem_schedule(sk, skb->truesize, 1);
+		sk_stream_mem_schedule(sk, skb, skb->truesize, 1);
 }
 
 static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
 {
 	return size <= sk->sk_forward_alloc ||
-	       sk_stream_mem_schedule(sk, size, 0);
+	       sk_stream_mem_schedule(sk, NULL, size, 0);
 }
 
 /* Used by processes to "lock" a socket state, so that
Index: linux-2.6-git/net/core/stream.c
===================================================================
--- linux-2.6-git.orig/net/core/stream.c	2007-02-14 12:09:05.000000000 +0100
+++ linux-2.6-git/net/core/stream.c	2007-02-14 12:09:21.000000000 +0100
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
 
 EXPORT_SYMBOL(__sk_stream_mem_reclaim);
 
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind)
 {
 	int amt = sk_stream_pages(size);
 
@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
 	/* Over hard limit. */
 	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
 		sk->sk_prot->enter_memory_pressure();
-		goto suppress_allocation;
+		if (!skb || (skb && !skb_emergency(skb)))
+			goto suppress_allocation;
 	}
 
 	/* Under pressure. */

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 17/29] netvm: prevent a TCP specific deadlock Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-24 15:27   ` Patrick McHardy
  2007-02-21 14:43 ` [PATCH 19/29] netvm: skb processing Peter Zijlstra
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 977 bytes --]

Emergency skbs should never touch user-space, however NF_QUEUE is fully user
configurable. Notify the user of his mistake and try to continue.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/netfilter/core.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c	2007-02-14 12:09:07.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c	2007-02-14 12:09:18.000000000 +0100
@@ -187,6 +187,11 @@ next_hook:
 		kfree_skb(*pskb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+		if (unlikely((*pskb)->emergency)) {
+			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
+					"emergency skb - skipping rule.\n");
+			goto next_hook;
+		}
 		NFDEBUG("nf_hook: Verdict = QUEUE.\n");
 		if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 19/29] netvm: skb processing
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 20/29] uml: rename arch/um remove_mapping() Peter Zijlstra
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: netvm.patch --]
[-- Type: text/plain, Size: 4730 bytes --]

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency skbs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    4 ++++
 net/core/dev.c     |   42 +++++++++++++++++++++++++++++++++++++-----
 net/core/sock.c    |   19 +++++++++++++++++++
 3 files changed, 60 insertions(+), 5 deletions(-)

Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c	2007-02-14 12:16:03.000000000 +0100
+++ linux-2.6-git/net/core/dev.c	2007-02-14 12:28:33.000000000 +0100
@@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_VMIO sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (skb_emergency(skb))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (skb->dev->poll && netpoll_rx(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.off_sec)
 		net_timestamp(skb);
@@ -1781,7 +1794,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -1799,6 +1812,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (skb_emergency(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -1807,6 +1823,7 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (pt_prev) {
 		ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1819,15 +1836,27 @@ int netif_receive_skb(struct sk_buff *sk
 
 	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
 		kfree_skb(skb);
-		goto out;
+		goto unlock;
 	}
 
 	skb->tc_verd = 0;
 ncls:
 #endif
 
+	if (skb_emergency(skb))
+		switch(skb->protocol) {
+			case __constant_htons(ETH_P_ARP):
+			case __constant_htons(ETH_P_IP):
+			case __constant_htons(ETH_P_IPV6):
+			case __constant_htons(ETH_P_8021Q):
+				break;
+
+			default:
+				goto drop;
+		}
+
 	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1842,6 +1871,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -1849,8 +1879,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	current->flags = pflags;
 	return ret;
 }
 
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 12:32:03.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 12:32:37.000000000 +0100
@@ -510,10 +510,14 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+#ifndef CONFIG_NETVM
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	return sk->sk_backlog_rcv(sk, skb);
 }
+#else
+extern int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+#endif
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
 ({	int rc;							\
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2007-02-14 12:32:07.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2007-02-14 12:37:11.000000000 +0100
@@ -332,6 +332,25 @@ int sk_clear_vmio(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_vmio);
 
+#ifdef CONFIG_NETVM
+int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	if (skb_emergency(skb)) {
+		int ret;
+		unsigned long pflags = current->flags;
+	       	/* these should have been dropped before queueing */
+		BUG_ON(!sk_has_vmio(sk));
+		current->flags |= PF_MEMALLOC;
+		ret = sk->sk_backlog_rcv(sk, skb);
+		current->flags = pflags;
+		return ret;
+	}
+
+	return sk->sk_backlog_rcv(sk, skb);
+}
+EXPORT_SYMBOL(sk_backlog_rcv);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 20/29] uml: rename arch/um remove_mapping()
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 19/29] netvm: skb processing Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 21/29] mm: prepare swap entry methods for use in page methods Peter Zijlstra
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: uml_remove_mapping.patch --]
[-- Type: text/plain, Size: 1342 bytes --]

When 'include/linux/mm.h' includes 'include/linux/swap.h', the global
remove_mapping() definition clashes with the arch/um one.

Rename the arch/um one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jeff Dike <jdike@addtoit.com>
---
 arch/um/kernel/physmem.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-git/arch/um/kernel/physmem.c
===================================================================
--- linux-2.6-git.orig/arch/um/kernel/physmem.c	2007-02-12 09:40:47.000000000 +0100
+++ linux-2.6-git/arch/um/kernel/physmem.c	2007-02-12 11:17:47.000000000 +0100
@@ -160,7 +160,7 @@ int physmem_subst_mapping(void *virt, in
 
 static int physmem_fd = -1;
 
-static void remove_mapping(struct phys_desc *desc)
+static void um_remove_mapping(struct phys_desc *desc)
 {
 	void *virt = desc->virt;
 	int err;
@@ -184,7 +184,7 @@ int physmem_remove_mapping(void *virt)
 	if(desc == NULL)
 		return 0;
 
-	remove_mapping(desc);
+	um_remove_mapping(desc);
 	return 1;
 }
 
@@ -205,7 +205,7 @@ void physmem_forget_descriptor(int fd)
 		page = list_entry(ele, struct phys_desc, list);
 		offset = page->offset;
 		addr = page->virt;
-		remove_mapping(page);
+		um_remove_mapping(page);
 		err = os_seek_file(fd, offset);
 		if(err)
 			panic("physmem_forget_descriptor - failed to seek "

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 21/29] mm: prepare swap entry methods for use in page methods
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 20/29] uml: rename arch/um remove_mapping() Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 22/29] mm: add support for non block device backed swap files Peter Zijlstra
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-swap_entry_methods.patch --]
[-- Type: text/plain, Size: 5570 bytes --]

Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |    8 ++++++++
 include/linux/swap.h    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/swapops.h |   44 --------------------------------------------
 mm/swapfile.c           |    1 +
 4 files changed, 57 insertions(+), 44 deletions(-)

Index: linux-2.6-git/include/linux/mm.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/mm.h	2007-02-21 12:15:01.000000000 +0100
@@ -17,6 +17,7 @@
 #include <linux/debug_locks.h>
 #include <linux/backing-dev.h>
 #include <linux/mm_types.h>
+#include <linux/swap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -586,6 +587,13 @@ static inline struct address_space *page
 	return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/swap.h	2007-02-21 12:15:01.000000000 +0100
@@ -79,6 +79,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+	swp_entry_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+			(offset & SWP_OFFSET_MASK(ret));
+	return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+	return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+	return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_
 	return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+	return NULL;
+}
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6-git/include/linux/swapops.h
===================================================================
--- linux-2.6-git.orig/include/linux/swapops.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/swapops.h	2007-02-21 12:15:01.000000000 +0100
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-	swp_entry_t ret;
-
-	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-			(offset & SWP_OFFSET_MASK(ret));
-	return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-	return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-	return entry.val & SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */
Index: linux-2.6-git/mm/swapfile.c
===================================================================
--- linux-2.6-git.orig/mm/swapfile.c	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/mm/swapfile.c	2007-02-21 12:15:01.000000000 +0100
@@ -1764,6 +1764,7 @@ get_swap_info_struct(unsigned type)
 {
 	return &swap_info[type];
 }
+EXPORT_SYMBOL_GPL(get_swap_info_struct);
 
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 22/29] mm: add support for non block device backed swap files
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 21/29] mm: prepare swap entry methods for use in page methods Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 23/29] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-swapfile.patch --]
[-- Type: text/plain, Size: 7664 bytes --]

A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 Documentation/filesystems/Locking |    9 ++++++++
 include/linux/fs.h                |    1 
 include/linux/swap.h              |    3 ++
 mm/Kconfig                        |    4 +++
 mm/page_io.c                      |   42 ++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c                   |    4 +++
 mm/swapfile.c                     |   22 +++++++++++++++++++
 7 files changed, 84 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -163,6 +163,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -265,6 +266,8 @@ extern int shmem_unuse(swp_entry_t entry
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern int end_swap_bio_read(struct bio *bio, unsigned int bytes_done, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -110,6 +111,18 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->writepage(page, wbc);
+			if (!ret)
+				count_vm_event(PSWPOUT);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -128,6 +141,23 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->set_page_dirty)
+			return a_ops->set_page_dirty(page);
+		return __set_page_dirty_buffers(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+#endif
+
 int swap_readpage(struct file *file, struct page *page)
 {
 	struct bio *bio;
@@ -135,6 +165,18 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->readpage(sis->swap_file, page);
+			if (!ret)
+				count_vm_event(PSWPIN);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -26,7 +26,11 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+	.set_page_dirty	= swap_set_page_dirty,
+#else
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+#endif
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -948,6 +948,13 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+#ifdef CONFIG_SWAP_FILE
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
+#endif
 }
 
 /*
@@ -1040,6 +1047,19 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+#ifdef CONFIG_SWAP_FILE
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+#endif
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1603,7 +1623,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -428,6 +428,7 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*swapfile)(struct address_space *, int);
 };
 
 struct backing_dev_info;
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -172,6 +172,7 @@ prototypes:
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
+	int (*swapfile) (struct address_space *, int);
 
 locking rules:
 	All except set_page_dirty may block
@@ -190,6 +191,7 @@ invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
+swapfile		no
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -289,6 +291,13 @@ cleaned, or an error value if not. Note 
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
+	->swapfile() will be called with a non zero argument on address spaces
+backing non block device backed swapfiles. A return value of zero indicates
+success. In which case this address space can be used for backing swapspace.
+The swapspace operations will be proxied to the address space operations.
+Swapoff will call this method with a zero argument to release the address
+space.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -165,6 +165,9 @@ config ZONE_DMA_FLAG
 
 config SLAB_FAIR
 	def_bool n
+
+config SWAP_FILE
+	def_bool n
 #
 # Adaptive file readahead
 #

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 23/29] mm: methods for teaching filesystems about PG_swapcache pages
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 22/29] mm: add support for non block device backed swap files Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 24/29] nfs: remove mempools Peter Zijlstra
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: mm-page_file_methods.patch --]
[-- Type: text/plain, Size: 2843 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |   25 +++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/linux/mm.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm.h	2007-02-21 12:15:01.000000000 +0100
+++ linux-2.6-git/include/linux/mm.h	2007-02-21 12:15:07.000000000 +0100
@@ -594,6 +594,16 @@ static inline struct swap_info_struct *p
 	return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page)))
+		return page_swap_info(page)->swap_file->f_mapping;
+#endif
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -611,6 +621,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page))) {
+		swp_entry_t swap = { .val = page_private(page) };
+		return swp_offset(swap);
+	}
+#endif
+	return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6-git/include/linux/pagemap.h
===================================================================
--- linux-2.6-git.orig/include/linux/pagemap.h	2007-02-21 12:14:54.000000000 +0100
+++ linux-2.6-git/include/linux/pagemap.h	2007-02-21 12:15:07.000000000 +0100
@@ -120,7 +120,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 24/29] nfs: remove mempools
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 23/29] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 25/29] nfs: only use stable storage for swap Peter Zijlstra
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs-no-mempool.patch --]
[-- Type: text/plain, Size: 5239 bytes --]

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/read.c  |   15 +++------------
 fs/nfs/write.c |   27 +++++----------------------
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6-git/fs/nfs/read.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/read.c	2007-02-21 12:14:54.000000000 +0100
+++ linux-2.6-git/fs/nfs/read.c	2007-02-21 12:15:10.000000000 +0100
@@ -32,14 +32,11 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ	(32)
 
 struct nfs_read_data *nfs_readdata_alloc(size_t len)
 {
 	unsigned int pagecount = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_rdata_mempool);
+				kmem_cache_free(nfs_rdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
 	struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_rdata_mempool);
+	kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -614,16 +611,10 @@ int __init nfs_init_readpagecache(void)
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-						     nfs_rdata_cachep);
-	if (nfs_rdata_mempool == NULL)
-		return -ENOMEM;
-
 	return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-	mempool_destroy(nfs_rdata_mempool);
 	kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c	2007-02-21 12:14:54.000000000 +0100
+++ linux-2.6-git/fs/nfs/write.c	2007-02-21 12:15:10.000000000 +0100
@@ -29,9 +29,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE		(32)
-#define MIN_POOL_COMMIT		(4)
-
 /*
  * Local function declarations
  */
@@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -64,7 +59,7 @@ void nfs_commit_rcu_free(struct rcu_head
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_commit_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -75,7 +70,7 @@ void nfs_commit_free(struct nfs_write_da
 struct nfs_write_data *nfs_writedata_alloc(size_t len)
 {
 	unsigned int pagecount = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -86,7 +81,7 @@ struct nfs_write_data *nfs_writedata_all
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_wdata_mempool);
+				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -99,7 +94,7 @@ static void nfs_writedata_rcu_free(struc
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_wdata_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1517,16 +1512,6 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
-
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
@@ -1552,8 +1537,6 @@ int __init nfs_init_writepagecache(void)
 
 void nfs_destroy_writepagecache(void)
 {
-	mempool_destroy(nfs_commit_mempool);
-	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 25/29] nfs: only use stable storage for swap
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 24/29] nfs: remove mempools Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs-wb_priority.patch --]
[-- Type: text/plain, Size: 751 bytes --]

unstable writes don't make sense for swap pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/write.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c	2007-02-21 12:15:10.000000000 +0100
+++ linux-2.6-git/fs/nfs/write.c	2007-02-21 12:15:13.000000000 +0100
@@ -197,7 +197,7 @@ static int nfs_writepage_setup(struct nf
 static int wb_priority(struct writeback_control *wbc)
 {
 	if (wbc->for_reclaim)
-		return FLUSH_HIGHPRI;
+		return FLUSH_HIGHPRI|FLUSH_STABLE;
 	if (wbc->for_kupdate)
 		return FLUSH_LOWPRI;
 	return 0;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 25/29] nfs: only use stable storage for swap Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 27/29] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs-swapcache.patch --]
[-- Type: text/plain, Size: 9610 bytes --]

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c     |    4 ++--
 fs/nfs/internal.h |    7 ++++---
 fs/nfs/pagelist.c |    6 +++---
 fs/nfs/read.c     |    6 +++---
 fs/nfs/write.c    |   35 ++++++++++++++++++-----------------
 5 files changed, 30 insertions(+), 28 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -310,7 +310,7 @@ static void nfs_invalidate_page(struct p
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
-	nfs_wb_page_priority(page->mapping->host, page, FLUSH_INVALIDATE);
+	nfs_wb_page_priority(page_file_mapping(page)->host, page, FLUSH_INVALIDATE);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -321,7 +321,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-	return nfs_wb_page(page->mapping->host, page);
+	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -81,11 +81,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -338,7 +338,7 @@ out:
  * @nfsi: NFS inode
  * @head: One of the NFS inode request lists
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -492,11 +492,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -543,7 +543,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -122,7 +122,7 @@ static struct nfs_page *nfs_page_find_re
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct nfs_page *req = NULL;
-	spinlock_t *req_lock = &NFS_I(page->mapping->host)->req_lock;
+	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
 
 	spin_lock(req_lock);
 	req = nfs_page_find_request_locked(page);
@@ -133,13 +133,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	unsigned long end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -150,7 +150,7 @@ static void nfs_grow_file(struct page *p
 static void nfs_set_pageerror(struct page *page)
 {
 	SetPageError(page);
-	nfs_zap_mapping(page->mapping->host, page->mapping);
+	nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
 }
 
 /* We can set the PG_uptodate flag if we see that a write request
@@ -182,7 +182,7 @@ static int nfs_writepage_setup(struct nf
 		ret = PTR_ERR(req);
 		if (ret != -EBUSY)
 			return ret;
-		ret = nfs_wb_page(page->mapping->host, page);
+		ret = nfs_wb_page(page_file_mapping(page)->host, page);
 		if (ret != 0)
 			return ret;
 	}
@@ -216,7 +216,7 @@ int nfs_congestion_kb;
 static void nfs_set_page_writeback(struct page *page)
 {
 	if (!test_set_page_writeback(page)) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_inc_return(&nfss->writeback) >
@@ -227,7 +227,7 @@ static void nfs_set_page_writeback(struc
 
 static void nfs_end_page_writeback(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
@@ -247,7 +247,7 @@ static void nfs_end_page_writeback(struc
 static int nfs_page_mark_flush(struct page *page)
 {
 	struct nfs_page *req;
-	spinlock_t *req_lock = &NFS_I(page->mapping->host)->req_lock;
+	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
 	int ret;
 
 	spin_lock(req_lock);
@@ -287,7 +287,7 @@ static int nfs_page_mark_flush(struct pa
 static int nfs_writepage_locked(struct page *page, struct writeback_control *wbc)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	unsigned offset;
 	int err;
 
@@ -316,7 +316,8 @@ static int nfs_writepage_locked(struct p
 		err = 0;
 out:
 	if (!wbc->for_writepages)
-		nfs_flush_mapping(page->mapping, wbc, FLUSH_STABLE|wb_priority(wbc));
+		nfs_flush_mapping(page_file_mapping(page), wbc,
+				  FLUSH_STABLE|wb_priority(wbc));
 	return err;
 }
 
@@ -518,7 +519,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -583,7 +584,7 @@ static int nfs_wait_on_write_congestion(
 static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx,
 		struct page *page, unsigned int offset, unsigned int bytes)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_file_mapping(page);
 	struct inode *inode = mapping->host;
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_page		*req, *new = NULL;
@@ -688,7 +689,7 @@ int nfs_flush_incompatible(struct file *
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_page(page->mapping->host, page);
+		status = nfs_wb_page(page_file_mapping(page)->host, page);
 	} while (status == 0);
 	return status;
 }
@@ -703,7 +704,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
@@ -1456,7 +1457,7 @@ int nfs_wb_page_priority(struct inode *i
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1472,7 +1473,7 @@ int nfs_wb_page_priority(struct inode *i
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;
 out:
Index: linux-2.6/fs/nfs/internal.h
===================================================================
--- linux-2.6.orig/fs/nfs/internal.h
+++ linux-2.6/fs/nfs/internal.h
@@ -220,13 +220,14 @@ void nfs_super_set_maxbytes(struct super
 static inline
 unsigned int nfs_page_length(struct page *page)
 {
-	loff_t i_size = i_size_read(page->mapping->host);
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
 
 	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
 		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
-		if (page->index < end_index)
+		if (page_index < end_index)
 			return PAGE_CACHE_SIZE;
-		if (page->index == end_index)
+		if (page_index == end_index)
 			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
 	}
 	return 0;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 27/29] nfs: disable data cache revalidation for swapfiles
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 28/29] nfs: enable swap on NFS Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 29/29] balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 4874 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/inode.c |    6 ++++++
 fs/nfs/write.c |   35 +++++++++++++++++++++++------------
 2 files changed, 29 insertions(+), 12 deletions(-)

Index: linux-2.6-git/fs/nfs/inode.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/inode.c	2007-02-21 11:04:08.000000000 +0100
+++ linux-2.6-git/fs/nfs/inode.c	2007-02-21 11:52:21.000000000 +0100
@@ -719,6 +719,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c	2007-02-21 11:52:17.000000000 +0100
+++ linux-2.6-git/fs/nfs/write.c	2007-02-21 11:53:18.000000000 +0100
@@ -107,7 +107,7 @@ void nfs_writedata_release(void *wdata)
 	nfs_writedata_free(wdata);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
 {
 	struct nfs_page *req = NULL;
 
@@ -115,6 +115,10 @@ static struct nfs_page *nfs_page_find_re
 		req = (struct nfs_page *)page_private(page);
 		if (req != NULL)
 			atomic_inc(&req->wb_count);
+	} else if (unlikely(PageSwapCache(page))) {
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+		if (req != NULL)
+			atomic_inc(&req->wb_count);
 	}
 	return req;
 }
@@ -122,10 +126,11 @@ static struct nfs_page *nfs_page_find_re
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct nfs_page *req = NULL;
-	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
+	struct nfs_inode *nfsi = NFS_I(page_file_mapping(page)->host);
+	spinlock_t *req_lock = &nfsi->req_lock;
 
 	spin_lock(req_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(nfsi, page);
 	spin_unlock(req_lock);
 	return req;
 }
@@ -248,12 +253,13 @@ static void nfs_end_page_writeback(struc
 static int nfs_page_mark_flush(struct page *page)
 {
 	struct nfs_page *req;
-	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
+	struct nfs_inode *nfsi = NFS_I(page_file_mapping(page)->host);
+	spinlock_t *req_lock = &nfsi->req_lock;
 	int ret;
 
 	spin_lock(req_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req == NULL) {
 			spin_unlock(req_lock);
 			return 1;
@@ -368,8 +374,14 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	atomic_inc(&req->wb_count);
 	return 0;
@@ -386,8 +398,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&nfsi->req_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	nfsi->npages--;
 	if (!nfsi->npages) {
@@ -600,7 +614,7 @@ static struct nfs_page * nfs_update_requ
 		 * A request for the page we wish to update
 		 */
 		spin_lock(&nfsi->req_lock);
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -1472,8 +1486,6 @@ int nfs_wb_page_priority(struct inode *i
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
-		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 28/29] nfs: enable swap on NFS
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 27/29] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  2007-02-21 14:43 ` [PATCH 29/29] balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs-swapfile.patch --]
[-- Type: text/plain, Size: 7663 bytes --]

Provide an ops->swapfile() implementation for NFS. This will set the
NFS socket to SOCK_VMIO and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_VMIO before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_VMIO should allow us to receive the packets
required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/Kconfig                  |   14 ++++++++++++
 fs/nfs/file.c               |    6 +++++
 include/linux/sunrpc/xprt.h |    5 +++-
 net/sunrpc/sched.c          |   13 +++++++----
 net/sunrpc/xprtsock.c       |   49 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 81 insertions(+), 6 deletions(-)

Index: linux-2.6-git/fs/nfs/file.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/file.c	2007-02-21 12:15:16.000000000 +0100
+++ linux-2.6-git/fs/nfs/file.c	2007-02-21 12:15:19.000000000 +0100
@@ -324,6 +324,11 @@ static int nfs_launder_page(struct page 
 	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
+}
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -338,6 +343,7 @@ const struct address_space_operations nf
 	.direct_IO = nfs_direct_IO,
 #endif
 	.launder_page = nfs_launder_page,
+	.swapfile = nfs_swapfile,
 };
 
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6-git/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6-git.orig/include/linux/sunrpc/xprt.h	2007-02-21 11:04:08.000000000 +0100
+++ linux-2.6-git/include/linux/sunrpc/xprt.h	2007-02-21 12:15:19.000000000 +0100
@@ -149,7 +149,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 
 	/*
 	 * Connection of transports
@@ -241,6 +243,7 @@ void			xprt_disconnect(struct rpc_xprt *
  */
 struct rpc_xprt *	xs_setup_udp(struct sockaddr *addr, size_t addrlen, struct rpc_timeout *to);
 struct rpc_xprt *	xs_setup_tcp(struct sockaddr *addr, size_t addrlen, struct rpc_timeout *to);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6-git/net/sunrpc/sched.c
===================================================================
--- linux-2.6-git.orig/net/sunrpc/sched.c	2007-02-21 11:04:08.000000000 +0100
+++ linux-2.6-git/net/sunrpc/sched.c	2007-02-21 12:15:19.000000000 +0100
@@ -751,10 +751,13 @@ void * rpc_malloc(struct rpc_task *task,
 	struct rpc_rqst *req = task->tk_rqstp;
 	gfp_t	gfp;
 
-	if (task->tk_flags & RPC_TASK_SWAPPER)
-		gfp = GFP_ATOMIC;
-	else
-		gfp = GFP_NOFS;
+	/*
+	 * this rcpio thread might be needed by reclaim, hence we cannot
+	 * wait on a regular alloc to succeed.
+	 */
+	gfp = GFP_ATOMIC;
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_EMERGENCY;
 
 	if (size > RPC_BUFFER_MAXSIZE) {
 		req->rq_buffer = kmalloc(size, gfp);
@@ -834,7 +837,7 @@ void rpc_init_task(struct rpc_task *task
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 static void rpc_free_task(struct rcu_head *rcu)
Index: linux-2.6-git/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6-git.orig/net/sunrpc/xprtsock.c	2007-02-21 11:04:08.000000000 +0100
+++ linux-2.6-git/net/sunrpc/xprtsock.c	2007-02-21 12:15:19.000000000 +0100
@@ -1215,11 +1215,15 @@ static void xs_udp_connect_worker(struct
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1257,6 +1261,9 @@ static void xs_udp_connect_worker(struct
 		transport->sock = sock;
 		transport->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_vmio(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1264,6 +1271,7 @@ static void xs_udp_connect_worker(struct
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /*
@@ -1302,11 +1310,15 @@ static void xs_tcp_connect_worker(struct
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1356,6 +1368,10 @@ static void xs_tcp_connect_worker(struct
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+
+	if (xprt->swapper)
+		sk_set_vmio(transport->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1383,6 +1399,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /**
@@ -1642,6 +1659,38 @@ int init_socket_xprt(void)
 	return 0;
 }
 
+#define RPC_BUF_RESERVE_PAGES	(RPC_MAX_SLOT_TABLE)
+#define RPC_RESERVE_PAGES	(RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
+		sk_set_vmio(transport->inet);
+		xprt->swapper = 1;
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_vmio(transport->inet);
+		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+
 /**
  * cleanup_socket_xprt - remove xprtsock's sysctls
  *
Index: linux-2.6-git/fs/Kconfig
===================================================================
--- linux-2.6-git.orig/fs/Kconfig	2007-02-21 11:04:08.000000000 +0100
+++ linux-2.6-git/fs/Kconfig	2007-02-21 12:15:19.000000000 +0100
@@ -1621,6 +1621,20 @@ config NFS_DIRECTIO
 	  causes open() to return EINVAL if a file residing in NFS is
 	  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SLAB_FAIR
+	select NETVM
+	select SWAP_FILE
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/vm_deadlock.txt
+
+	  If unsure, say N.
+
 config NFSD
 	tristate "NFS server support"
 	depends on INET

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 29/29] balance_dirty_pages() vs throttle_vm_writeout() deadlock
  2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2007-02-21 14:43 ` [PATCH 28/29] nfs: enable swap on NFS Peter Zijlstra
@ 2007-02-21 14:43 ` Peter Zijlstra
  28 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-21 14:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller

[-- Attachment #1: nfs_mm-throttle_vm_writeout.patch --]
[-- Type: text/plain, Size: 1441 bytes --]

If we have a lot of dirty memory and hit the throttle in balance_dirty_pages()
we (potentially) generate a lot of writeback and unstable pages, if however
during this writeback we need to reclaim a bit, we might hit
throttle_vm_writeout(), which might delay us until the combined total of
NR_UNSTABLE_NFS + NR_WRITEBACK falls below the dirty limit.

However unstable pages don't go away automagickally, they need a push. While
balance_dirty_pages() does this push, throttle_vm_writeout() doesn't. So we can
sit here ad infintum.

Hence I propose to remove the NR_UNSTABLE_NFS count from throttle_vm_writeout().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2007-02-20 15:07:43.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2007-02-20 16:42:45.000000000 +0100
@@ -310,8 +310,7 @@ void throttle_vm_writeout(void)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
+                if (global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(WRITE, HZ/10);
         }

-- 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/29] selinux: tag avc cache alloc as non-critical
  2007-02-21 14:43 ` [PATCH 09/29] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2007-02-21 15:22   ` James Morris
  0 siblings, 0 replies; 46+ messages in thread
From: James Morris @ 2007-02-21 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, Stephen Smalley

On Wed, 21 Feb 2007, Peter Zijlstra wrote:

> Failing to allocate a cache entry will only harm performance.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  security/selinux/avc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: James Morris <jmorris@namei.org>

> 
> Index: linux-2.6-git/security/selinux/avc.c
> ===================================================================
> --- linux-2.6-git.orig/security/selinux/avc.c	2007-02-14 08:31:13.000000000 +0100
> +++ linux-2.6-git/security/selinux/avc.c	2007-02-14 10:10:47.000000000 +0100
> @@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
>  {
>  	struct avc_node *node;
>  
> -	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
> +	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
>  	if (!node)
>  		goto out;
>  
> 
> 

-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/29] mm: slab allocation fairness
  2007-02-21 14:43 ` [PATCH 02/29] mm: slab allocation fairness Peter Zijlstra
@ 2007-02-21 15:33   ` Pekka Enberg
  0 siblings, 0 replies; 46+ messages in thread
From: Pekka Enberg @ 2007-02-21 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On 2/21/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> [AIM9 results go here]

Yes please. I would really like to know what we gain by making the
slab even more complex.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/29] mm: kmem_cache_objs_to_pages()
  2007-02-21 14:43 ` [PATCH 08/29] mm: kmem_cache_objs_to_pages() Peter Zijlstra
@ 2007-02-21 15:47   ` Pekka Enberg
  2007-02-22  9:28     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Pekka Enberg @ 2007-02-21 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

Hi Peter,

On 2/21/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Provide a method to calculate the number of pages needed to store a given
> number of slab objects (upper bound when considering possible partial and
> free slabs).

So how does this work? You ask the slab allocator how many pages you
need for a given number of objects and then those pages are available
to it via the page allocator? Can other users also dip into those
reserves?

I would prefer we simply have an API for telling the slab allocator to
keep certain number of pages in a reserve for a cache rather than
exposing internals such as object size to rest of the world.

                                 Pekka

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context
  2007-02-21 14:43 ` [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-02-21 15:53   ` Arjan van de Ven
  2007-02-22  9:16     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Arjan van de Ven @ 2007-02-21 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller


> Index: linux-2.6-git/kernel/softirq.c
> ===================================================================
> --- linux-2.6-git.orig/kernel/softirq.c	2006-12-14 10:02:18.000000000 +0100
> +++ linux-2.6-git/kernel/softirq.c	2006-12-14 10:02:52.000000000 +0100
> @@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
>  	__u32 pending;
>  	int max_restart = MAX_SOFTIRQ_RESTART;
>  	int cpu;
> +	unsigned long pflags = current->flags;
> +	current->flags &= ~PF_MEMALLOC;
>  
>  	pending = local_softirq_pending();
>  	account_system_vtime(current);
> @@ -247,6 +249,7 @@ restart:
>  
>  	account_system_vtime(current);
>  	_local_bh_enable();
> +	current->flags = pflags;

this wipes out all the flags in one go.... evil.
What if something just selected this process for OOM killing? you nuke
that flag here again. Would be nicer if only the PF_MEMALLOC bit got
inherited in the restore path..





^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context
  2007-02-21 15:53   ` Arjan van de Ven
@ 2007-02-22  9:16     ` Peter Zijlstra
  2007-02-22  9:48       ` Arjan van de Ven
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-22  9:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Wed, 2007-02-21 at 16:53 +0100, Arjan van de Ven wrote:
> > Index: linux-2.6-git/kernel/softirq.c
> > ===================================================================
> > --- linux-2.6-git.orig/kernel/softirq.c	2006-12-14 10:02:18.000000000 +0100
> > +++ linux-2.6-git/kernel/softirq.c	2006-12-14 10:02:52.000000000 +0100
> > @@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
> >  	__u32 pending;
> >  	int max_restart = MAX_SOFTIRQ_RESTART;
> >  	int cpu;
> > +	unsigned long pflags = current->flags;
> > +	current->flags &= ~PF_MEMALLOC;
> >  
> >  	pending = local_softirq_pending();
> >  	account_system_vtime(current);
> > @@ -247,6 +249,7 @@ restart:
> >  
> >  	account_system_vtime(current);
> >  	_local_bh_enable();
> > +	current->flags = pflags;
> 
> this wipes out all the flags in one go.... evil.
> What if something just selected this process for OOM killing? you nuke
> that flag here again. Would be nicer if only the PF_MEMALLOC bit got
> inherited in the restore path..

would something like this:

#define PF_PUSH(tsk, pflags, mask)		\
do {						\
	(pflags) = ((tsk)->flags) & (mask);	\
} while (0)


#define PF_POP(tsk, pflags, mask)		\
do {						\
	((tsk)->flags &= ~(mask);		\
	((tsk)->flags |= (pflags);		\
} while (0)

be useful, or shall I just open code it in various places?

(I made this same mistake; ignorant of the problem; all over this patch series)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/29] mm: kmem_cache_objs_to_pages()
  2007-02-21 15:47   ` Pekka Enberg
@ 2007-02-22  9:28     ` Peter Zijlstra
  2007-02-22  9:45       ` Pekka Enberg
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-22  9:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Wed, 2007-02-21 at 17:47 +0200, Pekka Enberg wrote:
> Hi Peter,
> 
> On 2/21/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Provide a method to calculate the number of pages needed to store a given
> > number of slab objects (upper bound when considering possible partial and
> > free slabs).
> 
> So how does this work? You ask the slab allocator how many pages you
> need for a given number of objects and then those pages are available
> to it via the page allocator? Can other users also dip into those
> reserves?

Everybody (ab)using PF_MEMALLOC or the new __GFP_EMERGENCY.

> I would prefer we simply have an API for telling the slab allocator to
> keep certain number of pages in a reserve for a cache rather than
> exposing internals such as object size to rest of the world.

Keeping the free pages in the page allocator is good for the buddy
system. Although you could probably implement a reserve interface
without actually claiming the pages.

However, doing it like so separates the making of the reserve from the
actual kmem_cache object, I can just carry a sum of pages around instead
of a list of kmem_cache pointers.

I calculate a potential reserve, I might never actually commit to making
(and using) the reserve.

Also, I don't see what internals are exposed, kmem_cache is still
private to slab.c.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/29] mm: kmem_cache_objs_to_pages()
  2007-02-22  9:28     ` Peter Zijlstra
@ 2007-02-22  9:45       ` Pekka Enberg
  2007-02-22  9:49         ` Pekka Enberg
  0 siblings, 1 reply; 46+ messages in thread
From: Pekka Enberg @ 2007-02-22  9:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

Hi Peter,

On Wed, 2007-02-21 at 17:47 +0200, Pekka Enberg wrote:
> > So how does this work? You ask the slab allocator how many pages you
> > need for a given number of objects and then those pages are available
> > to it via the page allocator? Can other users also dip into those
> > reserves?

On 2/22/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Everybody (ab)using PF_MEMALLOC or the new __GFP_EMERGENCY.

So you are only interested in rough estimation of how much many pages
you need for a given amount of objects? Why not use ksize() for that
then?

                                    Pekka

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context
  2007-02-22  9:16     ` Peter Zijlstra
@ 2007-02-22  9:48       ` Arjan van de Ven
  0 siblings, 0 replies; 46+ messages in thread
From: Arjan van de Ven @ 2007-02-22  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Thu, 2007-02-22 at 10:16 +0100, Peter Zijlstra wrote:
> On Wed, 2007-02-21 at 16:53 +0100, Arjan van de Ven wrote:
> > > Index: linux-2.6-git/kernel/softirq.c
> > > ===================================================================
> > > --- linux-2.6-git.orig/kernel/softirq.c	2006-12-14 10:02:18.000000000 +0100
> > > +++ linux-2.6-git/kernel/softirq.c	2006-12-14 10:02:52.000000000 +0100
> > > @@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
> > >  	__u32 pending;
> > >  	int max_restart = MAX_SOFTIRQ_RESTART;
> > >  	int cpu;
> > > +	unsigned long pflags = current->flags;
> > > +	current->flags &= ~PF_MEMALLOC;
> > >  
> > >  	pending = local_softirq_pending();
> > >  	account_system_vtime(current);
> > > @@ -247,6 +249,7 @@ restart:
> > >  
> > >  	account_system_vtime(current);
> > >  	_local_bh_enable();
> > > +	current->flags = pflags;
> > 
> > this wipes out all the flags in one go.... evil.
> > What if something just selected this process for OOM killing? you nuke
> > that flag here again. Would be nicer if only the PF_MEMALLOC bit got
> > inherited in the restore path..
> 
> would something like this:
> 
> #define PF_PUSH(tsk, pflags, mask)		\
> do {						\
> 	(pflags) = ((tsk)->flags) & (mask);	\
> } while (0)
> 
> 
> #define PF_POP(tsk, pflags, mask)		\
> do {						\
> 	((tsk)->flags &= ~(mask);		\
> 	((tsk)->flags |= (pflags);		\
> } while (0)
> 
> be useful, or shall I just open code it in various places?

technically all you need is __get_bit and __set_bit() right?
(well a set_bit which sets to a value, not to always-1)

more generic name at least ;)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/29] mm: kmem_cache_objs_to_pages()
  2007-02-22  9:45       ` Pekka Enberg
@ 2007-02-22  9:49         ` Pekka Enberg
  0 siblings, 0 replies; 46+ messages in thread
From: Pekka Enberg @ 2007-02-22  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On 2/22/07, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> So you are only interested in rough estimation of how much many pages
> you need for a given amount of objects? Why not use ksize() for that
> then?

Uhm, I obviously meant, why not expose obj_size() instead.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-21 14:43 ` [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2007-02-24 15:27   ` Patrick McHardy
  2007-02-24 15:46     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Patrick McHardy @ 2007-02-24 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

Peter Zijlstra wrote:
> Emergency skbs should never touch user-space, however NF_QUEUE is fully user
> configurable. Notify the user of his mistake and try to continue.
>
> --- linux-2.6-git.orig/net/netfilter/core.c	2007-02-14 12:09:07.000000000 +0100
> +++ linux-2.6-git/net/netfilter/core.c	2007-02-14 12:09:18.000000000 +0100
> @@ -187,6 +187,11 @@ next_hook:
>  		kfree_skb(*pskb);
>  		ret = -EPERM;
>  	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
> +		if (unlikely((*pskb)->emergency)) {
> +			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
> +					"emergency skb - skipping rule.\n");
> +			goto next_hook;
> +		}

If I'm not mistaken any skb on the receive side might get
allocated from the reserve. I don't see how the user could
avoid this except by not using queueing at all.

I also didn't see a patch dropping packets allocated from
the reserve that are forwarded or processed directly without
getting queued to a socket, so this would allow them to
bypass userspace queueing and still go through.

I think the user should just exclude packets necessary for
swapping from queueing manually, based on IP addresses,
port numbers or something like that.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-24 15:27   ` Patrick McHardy
@ 2007-02-24 15:46     ` Peter Zijlstra
  2007-02-24 16:17       ` Patrick McHardy
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-24 15:46 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Sat, 2007-02-24 at 16:27 +0100, Patrick McHardy wrote:
> Peter Zijlstra wrote:
> > Emergency skbs should never touch user-space, however NF_QUEUE is fully user
> > configurable. Notify the user of his mistake and try to continue.
> >
> > --- linux-2.6-git.orig/net/netfilter/core.c	2007-02-14 12:09:07.000000000 +0100
> > +++ linux-2.6-git/net/netfilter/core.c	2007-02-14 12:09:18.000000000 +0100
> > @@ -187,6 +187,11 @@ next_hook:
> >  		kfree_skb(*pskb);
> >  		ret = -EPERM;
> >  	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
> > +		if (unlikely((*pskb)->emergency)) {
> > +			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
> > +					"emergency skb - skipping rule.\n");
> > +			goto next_hook;
> > +		}
> 
> If I'm not mistaken any skb on the receive side might get
> allocated from the reserve. I don't see how the user could
> avoid this except by not using queueing at all.

Well, the rules could be setup so that the storage path will never hit
the queue.

> I also didn't see a patch dropping packets allocated from
> the reserve that are forwarded or processed directly without
> getting queued to a socket, so this would allow them to
> bypass userspace queueing and still go through.
> 
> I think the user should just exclude packets necessary for
> swapping from queueing manually, based on IP addresses,
> port numbers or something like that.

Indeed, this patch will just warn the user that he did something very
wrong and should avoid this situation.

Perhaps skipping is not the proper action, but dropping them will most
certainly freeze the box. Either way seems unlucky. Might as well stick
BUG() in there :-(.

Any ideas on how to resolve this are most welcome, detecting the
situation on either rule insert or swapon and failing the respective
action would be most ideal, but I have no idea if that is feasible.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-24 15:46     ` Peter Zijlstra
@ 2007-02-24 16:17       ` Patrick McHardy
  2007-02-24 16:18         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Patrick McHardy @ 2007-02-24 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

Peter Zijlstra wrote:
> On Sat, 2007-02-24 at 16:27 +0100, Patrick McHardy wrote:
> 
>>> 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
>>>+		if (unlikely((*pskb)->emergency)) {
>>>+			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
>>>+					"emergency skb - skipping rule.\n");
>>>+			goto next_hook;
>>>+		}
>>
>>If I'm not mistaken any skb on the receive side might get
>>allocated from the reserve. I don't see how the user could
>>avoid this except by not using queueing at all.
> 
> 
> Well, the rules could be setup so that the storage path will never hit
> the queue.


Sure, but other packets might still get allocated from the
reserve and trigger this.

>>I think the user should just exclude packets necessary for
>>swapping from queueing manually, based on IP addresses,
>>port numbers or something like that.
> 
> 
> Indeed, this patch will just warn the user that he did something very
> wrong and should avoid this situation.
> 
> Perhaps skipping is not the proper action, but dropping them will most
> certainly freeze the box. Either way seems unlucky. Might as well stick
> BUG() in there :-(.


At this point we don't know whether the packet is destined for
a SOCK_VMIO socket or not. The only thing we know is that is
was allocated from the reserve, but it could be anything.
There is really nothing you can do at this point.

> Any ideas on how to resolve this are most welcome, detecting the
> situation on either rule insert or swapon and failing the respective
> action would be most ideal, but I have no idea if that is feasible.


Unfortunately this is not possible either. I don't really see why
queueing is special though, dropping the packets in the ruleset
will break things just as well, as will routing them to a blackhole.
I guess the user just needs to be smart enough not to do this.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-24 16:17       ` Patrick McHardy
@ 2007-02-24 16:18         ` Peter Zijlstra
  2007-02-24 16:40           ` Patrick McHardy
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-24 16:18 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Sat, 2007-02-24 at 17:17 +0100, Patrick McHardy wrote:

> I don't really see why
> queueing is special though, dropping the packets in the ruleset
> will break things just as well, as will routing them to a blackhole.
> I guess the user just needs to be smart enough not to do this.

Its user-space and no emergency packet may rely on user-space because it
most likely is needed to maintain user-space.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-24 16:18         ` Peter Zijlstra
@ 2007-02-24 16:40           ` Patrick McHardy
  2007-02-24 16:55             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Patrick McHardy @ 2007-02-24 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

Peter Zijlstra wrote:
> On Sat, 2007-02-24 at 17:17 +0100, Patrick McHardy wrote:
> 
> 
>>I don't really see why
>>queueing is special though, dropping the packets in the ruleset
>>will break things just as well, as will routing them to a blackhole.
>>I guess the user just needs to be smart enough not to do this.
> 
> 
> Its user-space and no emergency packet may rely on user-space because it
> most likely is needed to maintain user-space.

I believe I might have misunderstood the intention of this patch.

Assuming the user is smart enough not to queue packets destined
to a SOCK_VMIO socket, are you worried about unrelated packets
allocated from the emergency reserve not getting freed fast
enough because they're sitting in a queue? In that case simply
dropping the packets would be fine I guess.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-02-24 16:40           ` Patrick McHardy
@ 2007-02-24 16:55             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-02-24 16:55 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller

On Sat, 2007-02-24 at 17:40 +0100, Patrick McHardy wrote:
> Peter Zijlstra wrote:
> > On Sat, 2007-02-24 at 17:17 +0100, Patrick McHardy wrote:
> > 
> > 
> >>I don't really see why
> >>queueing is special though, dropping the packets in the ruleset
> >>will break things just as well, as will routing them to a blackhole.
> >>I guess the user just needs to be smart enough not to do this.
> > 
> > 
> > Its user-space and no emergency packet may rely on user-space because it
> > most likely is needed to maintain user-space.
> 
> I believe I might have misunderstood the intention of this patch.
> 
> Assuming the user is smart enough not to queue packets destined
> to a SOCK_VMIO socket, are you worried about unrelated packets
> allocated from the emergency reserve not getting freed fast
> enough because they're sitting in a queue? In that case simply
> dropping the packets would be fine I guess.

OK, that sounds good. I shall make NF_QUEUE a black hole for emergency
packets.

Alas, that leaves no way to warn a user about a SOCK_VMIO bound packet
treated this way, since, as you said, that is unknown at this point in
the chain.

Thanks,
Peter


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 27/29] nfs: disable data cache revalidation for swapfiles
  2007-12-14 15:39 [PATCH 00/29] Swap over NFS -v15 Peter Zijlstra
@ 2007-12-14 15:39 ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-12-14 15:39 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 5230 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/inode.c |    6 ++++
 fs/nfs/write.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -758,6 +758,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			kref_get(&req->wb_kref);
-	}
+	else if (unlikely(PageSwapCache(page)))
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+
+	if (get && req)
+		kref_get(&req->wb_kref);
+
 	return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+	return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+	struct inode *inode = page_file_mapping(page)->host;
+	struct nfs_page *req = NULL;
+
+	spin_lock(&inode->i_lock);
+	req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * hole here plugged by the caller holding onto PG_locked
+	 */
+
+	return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+	if (PagePrivate(page))
+		return 1;
+
+	if (unlikely(PageSwapCache(page)))
+		return __nfs_page_has_request(page);
+
+	return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	spin_unlock(&inode->i_lock);
 	return req;
 }
@@ -253,7 +290,7 @@ static int nfs_page_async_flush(struct n
 
 	spin_lock(&inode->i_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req == NULL) {
 			spin_unlock(&inode->i_lock);
 			return 0;
@@ -370,8 +407,14 @@ static void nfs_inode_add_request(struct
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
 }
@@ -387,8 +430,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&inode->i_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	nfsi->npages--;
 	if (!nfsi->npages) {
@@ -594,7 +639,7 @@ static struct nfs_page * nfs_update_requ
 		}
 
 		spin_lock(&inode->i_lock);
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -1426,7 +1471,7 @@ int nfs_wb_page_cancel(struct inode *ino
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
@@ -1453,7 +1498,7 @@ static int nfs_wb_page_priority(struct i
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)

--


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2007-12-14 19:33 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-21 14:43 [PATCH 00/29] swap over networked storage -v11 Peter Zijlstra
2007-02-21 14:43 ` [PATCH 01/29] mm: page allocation rank Peter Zijlstra
2007-02-21 14:43 ` [PATCH 02/29] mm: slab allocation fairness Peter Zijlstra
2007-02-21 15:33   ` Pekka Enberg
2007-02-21 14:43 ` [PATCH 03/29] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-02-21 15:53   ` Arjan van de Ven
2007-02-22  9:16     ` Peter Zijlstra
2007-02-22  9:48       ` Arjan van de Ven
2007-02-21 14:43 ` [PATCH 04/29] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-02-21 14:43 ` [PATCH 05/29] mm: emergency pool Peter Zijlstra
2007-02-21 14:43 ` [PATCH 06/29] mm: __GFP_EMERGENCY Peter Zijlstra
2007-02-21 14:43 ` [PATCH 07/29] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-02-21 14:43 ` [PATCH 08/29] mm: kmem_cache_objs_to_pages() Peter Zijlstra
2007-02-21 15:47   ` Pekka Enberg
2007-02-22  9:28     ` Peter Zijlstra
2007-02-22  9:45       ` Pekka Enberg
2007-02-22  9:49         ` Pekka Enberg
2007-02-21 14:43 ` [PATCH 09/29] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2007-02-21 15:22   ` James Morris
2007-02-21 14:43 ` [PATCH 10/29] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2007-02-21 14:43 ` [PATCH 11/29] net: packet split receive api Peter Zijlstra
2007-02-21 14:43 ` [PATCH 12/29] net: remove alloc_skb_from_cache Peter Zijlstra
2007-02-21 14:43 ` [PATCH 13/29] netvm: link network to vm layer Peter Zijlstra
2007-02-21 14:43 ` [PATCH 14/29] netvm: INET reserves Peter Zijlstra
2007-02-21 14:43 ` [PATCH 15/29] netvm: hook skb allocation to reserves Peter Zijlstra
2007-02-21 14:43 ` [PATCH 16/29] netvm: filter emergency skbs Peter Zijlstra
2007-02-21 14:43 ` [PATCH 17/29] netvm: prevent a TCP specific deadlock Peter Zijlstra
2007-02-21 14:43 ` [PATCH 18/29] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
2007-02-24 15:27   ` Patrick McHardy
2007-02-24 15:46     ` Peter Zijlstra
2007-02-24 16:17       ` Patrick McHardy
2007-02-24 16:18         ` Peter Zijlstra
2007-02-24 16:40           ` Patrick McHardy
2007-02-24 16:55             ` Peter Zijlstra
2007-02-21 14:43 ` [PATCH 19/29] netvm: skb processing Peter Zijlstra
2007-02-21 14:43 ` [PATCH 20/29] uml: rename arch/um remove_mapping() Peter Zijlstra
2007-02-21 14:43 ` [PATCH 21/29] mm: prepare swap entry methods for use in page methods Peter Zijlstra
2007-02-21 14:43 ` [PATCH 22/29] mm: add support for non block device backed swap files Peter Zijlstra
2007-02-21 14:43 ` [PATCH 23/29] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2007-02-21 14:43 ` [PATCH 24/29] nfs: remove mempools Peter Zijlstra
2007-02-21 14:43 ` [PATCH 25/29] nfs: only use stable storage for swap Peter Zijlstra
2007-02-21 14:43 ` [PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2007-02-21 14:43 ` [PATCH 27/29] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2007-02-21 14:43 ` [PATCH 28/29] nfs: enable swap on NFS Peter Zijlstra
2007-02-21 14:43 ` [PATCH 29/29] balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
2007-12-14 15:39 [PATCH 00/29] Swap over NFS -v15 Peter Zijlstra
2007-12-14 15:39 ` [PATCH 27/29] nfs: disable data cache revalidation for swapfiles Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).