LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/9] VM deadlock avoidance -v10
@ 2007-01-16 9:45 Peter Zijlstra
2007-01-16 9:45 ` [PATCH 1/9] mm: page allocation rank Peter Zijlstra
` (9 more replies)
0 siblings, 10 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:45 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
These patches implement the basic infrastructure to allow swap over networked
storage.
The basic idea is to reserve some memory up front to use when regular memory
runs out.
To bound network behaviour we accept only a limited number of concurrent
packets and drop those packets that are not aimed at the connection(s) servicing
the VM. Also all network paths that interact with userspace are to be avoided -
e.g. taps and NF_QUEUE.
PF_MEMALLOC is set when processing emergency skbs. This makes sense in that we
are indeed working on behalf of the swapper/VM. This allows us to use the
regular memory allocators for processing but requires that said processing have
bounded memory usage and has that accounted in the reserve.
I am particularly looking for comments on the design; is this acceptable?
Kind regards,
Peter
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/9] mm: page allocation rank
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
@ 2007-01-16 9:45 ` Peter Zijlstra
2007-01-16 9:45 ` [PATCH 2/9] mm: slab allocation fairness Peter Zijlstra
` (8 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:45 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: page_alloc-rank.patch --]
[-- Type: text/plain, Size: 8187 bytes --]
Introduce page allocation rank.
This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim
was activated) to obtain the page.
It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind:
'would this allocation have succeeded using these gfp flags'.
For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.
The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
...
15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
16 is the softest allocation - ALLOC_WMARK_HIGH.
Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.
Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).
The allocation rank is stored in page->index for successful allocations.
'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.
The purpose of this measure is to introduce some fairness into the slab
allocator.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/internal.h | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 58 ++++++++++--------------------------
2 files changed, 106 insertions(+), 41 deletions(-)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h 2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/internal.h 2007-01-09 11:29:18.000000000 +0100
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,92 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK 16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+ int rank;
+
+ if (alloc_flags & ALLOC_NO_WATERMARKS)
+ return 0;
+
+ rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+ rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+ return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+ /*
+ * Although correct this full version takes a ~3% performance
+ * hit on the network tests in aim9.
+ *
+
+ return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+
+ *
+ * Just check the bare essential ALLOC_NO_WATERMARKS case this keeps
+ * the aim9 results within the error margin.
+ */
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ return 0;
+ }
+
+ return 1;
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2007-01-09 11:29:18.000000000 +0100
@@ -888,14 +888,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1186,6 +1178,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->index = alloc_flags_to_rank(alloc_flags);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1259,48 +1252,27 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1309,6 +1281,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/9] mm: slab allocation fairness
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
2007-01-16 9:45 ` [PATCH 1/9] mm: page allocation rank Peter Zijlstra
@ 2007-01-16 9:45 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 3/9] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
` (7 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:45 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: slab-ranking.patch --]
[-- Type: text/plain, Size: 10402 bytes --]
The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.
Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.
Thus if we grew the slab under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.
If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.
[AIM9 results go here]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/slab.c | 61 ++++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 38 insertions(+), 23 deletions(-)
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c 2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/slab.c 2007-01-09 11:30:00.000000000 +0100
@@ -114,6 +114,7 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
+#include "internal.h"
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
+ int rank;
struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
unsigned int batchcount;
@@ -1021,21 +1023,21 @@ static inline int cache_free_alien(struc
}
static inline void *alternate_node_alloc(struct kmem_cache *cachep,
- gfp_t flags)
+ gfp_t flags, int rank)
{
return NULL;
}
static inline void *____cache_alloc_node(struct kmem_cache *cachep,
- gfp_t flags, int nodeid)
+ gfp_t flags, int nodeid, int rank)
{
return NULL;
}
#else /* CONFIG_NUMA */
-static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
static struct array_cache **alloc_alien_cache(int node, int limit)
{
@@ -1624,6 +1626,7 @@ static void *kmem_getpages(struct kmem_c
if (!page)
return NULL;
+ cachep->rank = page->index;
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
@@ -2272,6 +2275,7 @@ kmem_cache_create (const char *name, siz
}
#endif
#endif
+ cachep->rank = MAX_ALLOC_RANK;
/*
* Determine if the slab management is 'on' or 'off' slab.
@@ -2944,7 +2948,7 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
{
int batchcount;
struct kmem_list3 *l3;
@@ -2956,6 +2960,8 @@ static void *cache_alloc_refill(struct k
check_irq_off();
ac = cpu_cache_get(cachep);
retry:
+ if (unlikely(rank > cachep->rank))
+ goto force_grow;
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*
@@ -3011,14 +3017,16 @@ must_grow:
l3->free_objects -= ac->avail;
alloc_done:
spin_unlock(&l3->list_lock);
-
if (unlikely(!ac->avail)) {
int x;
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 || rank > cachep->rank))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
@@ -3175,7 +3183,8 @@ static inline int should_failslab(struct
#endif /* CONFIG_FAILSLAB */
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
void *objp;
struct array_cache *ac;
@@ -3186,13 +3195,13 @@ static inline void *____cache_alloc(stru
return NULL;
ac = cpu_cache_get(cachep);
- if (likely(ac->avail)) {
+ if (likely(ac->avail && rank <= cachep->rank)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
objp = ac->entry[--ac->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = cache_alloc_refill(cachep, flags, rank);
}
return objp;
}
@@ -3202,6 +3211,7 @@ static __always_inline void *__cache_all
{
unsigned long save_flags;
void *objp = NULL;
+ int rank = gfp_to_rank(flags);
cache_alloc_debugcheck_before(cachep, flags);
@@ -3209,16 +3219,16 @@ static __always_inline void *__cache_all
if (unlikely(NUMA_BUILD &&
current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
- objp = alternate_node_alloc(cachep, flags);
+ objp = alternate_node_alloc(cachep, flags, rank);
if (!objp)
- objp = ____cache_alloc(cachep, flags);
+ objp = ____cache_alloc(cachep, flags, rank);
/*
* We may just have run out of memory on the local node.
* ____cache_alloc_node() knows how to locate memory on other nodes
*/
if (NUMA_BUILD && !objp)
- objp = ____cache_alloc_node(cachep, flags, numa_node_id());
+ objp = ____cache_alloc_node(cachep, flags, numa_node_id(), rank);
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp,
caller);
@@ -3233,7 +3243,8 @@ static __always_inline void *__cache_all
* If we are in_interrupt, then process context, including cpusets and
* mempolicy, may not apply and should not be used for allocation policy.
*/
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
int nid_alloc, nid_here;
@@ -3245,7 +3256,7 @@ static void *alternate_node_alloc(struct
else if (current->mempolicy)
nid_alloc = slab_node(current->mempolicy);
if (nid_alloc != nid_here)
- return ____cache_alloc_node(cachep, flags, nid_alloc);
+ return ____cache_alloc_node(cachep, flags, nid_alloc, rank);
return NULL;
}
@@ -3257,7 +3268,7 @@ static void *alternate_node_alloc(struct
* allocator to do its reclaim / fallback magic. We then insert the
* slab into the proper nodelist and then allocate from it.
*/
-void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
{
struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
->node_zonelists[gfp_zone(flags)];
@@ -3278,7 +3289,7 @@ retry:
cache->nodelists[nid] &&
cache->nodelists[nid]->free_objects)
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
}
if (!obj && !(flags & __GFP_NO_GROW)) {
@@ -3301,7 +3312,7 @@ retry:
nid = page_to_nid(virt_to_page(obj));
if (cache_grow(cache, flags, nid, obj)) {
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
if (!obj)
/*
* Another processor may allocate the
@@ -3322,7 +3333,7 @@ retry:
* A interface to enable slab creation on nodeid
*/
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
- int nodeid)
+ int nodeid, int rank)
{
struct list_head *entry;
struct slab *slabp;
@@ -3335,6 +3346,8 @@ static void *____cache_alloc_node(struct
retry:
check_irq_off();
+ if (unlikely(rank > cachep->rank))
+ goto force_grow;
spin_lock(&l3->list_lock);
entry = l3->slabs_partial.next;
if (entry == &l3->slabs_partial) {
@@ -3370,13 +3383,14 @@ retry:
must_grow:
spin_unlock(&l3->list_lock);
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
if (x)
goto retry;
if (!(flags & __GFP_THISNODE))
/* Unable to grow the cache. Fall back to other nodes. */
- return fallback_alloc(cachep, flags);
+ return fallback_alloc(cachep, flags, rank);
return NULL;
@@ -3600,6 +3614,7 @@ __cache_alloc_node(struct kmem_cache *ca
{
unsigned long save_flags;
void *ptr = NULL;
+ int rank = gfp_to_rank(flags);
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
@@ -3615,16 +3630,16 @@ __cache_alloc_node(struct kmem_cache *ca
* to other nodes. It may fail while we still have
* objects on other nodes available.
*/
- ptr = ____cache_alloc(cachep, flags);
+ ptr = ____cache_alloc(cachep, flags, rank);
}
if (!ptr) {
/* ___cache_alloc_node can fall back to other nodes */
- ptr = ____cache_alloc_node(cachep, flags, nodeid);
+ ptr = ____cache_alloc_node(cachep, flags, nodeid, rank);
}
} else {
/* Node not bootstrapped yet */
if (!(flags & __GFP_THISNODE))
- ptr = fallback_alloc(cachep, flags);
+ ptr = fallback_alloc(cachep, flags, rank);
}
local_irq_restore(save_flags);
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 3/9] mm: allow PF_MEMALLOC from softirq context
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
2007-01-16 9:45 ` [PATCH 1/9] mm: page allocation rank Peter Zijlstra
2007-01-16 9:45 ` [PATCH 2/9] mm: slab allocation fairness Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 4/9] mm: serialize access to min_free_kbytes Peter Zijlstra
` (6 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2119 bytes --]
Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own
task_struct.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/softirq.c | 3 +++
mm/internal.h | 14 ++++++++------
2 files changed, 11 insertions(+), 6 deletions(-)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h 2006-12-14 10:02:52.000000000 +0100
+++ linux-2.6-git/mm/internal.h 2006-12-14 10:10:09.000000000 +0100
@@ -75,9 +75,10 @@ static int inline gfp_to_alloc_flags(gfp
alloc_flags |= ALLOC_HARDER;
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((p->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if (!in_irq() && (p->flags & PF_MEMALLOC))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (!in_interrupt() &&
+ unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
@@ -117,9 +118,10 @@ static inline int gfp_to_rank(gfp_t gfp_
*/
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if (!in_irq() && (current->flags & PF_MEMALLOC))
+ return 0;
+ else if (!in_interrupt() &&
+ unlikely(test_thread_flag(TIF_MEMDIE)))
return 0;
}
Index: linux-2.6-git/kernel/softirq.c
===================================================================
--- linux-2.6-git.orig/kernel/softirq.c 2006-12-14 10:02:18.000000000 +0100
+++ linux-2.6-git/kernel/softirq.c 2006-12-14 10:02:52.000000000 +0100
@@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
+ current->flags &= ~PF_MEMALLOC;
pending = local_softirq_pending();
account_system_vtime(current);
@@ -247,6 +249,7 @@ restart:
account_system_vtime(current);
_local_bh_enable();
+ current->flags = pflags;
}
#ifndef __ARCH_HAS_DO_SOFTIRQ
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 4/9] mm: serialize access to min_free_kbytes
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (2 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 3/9] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 5/9] mm: emergency pool Peter Zijlstra
` (5 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1913 bytes --]
There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/page_alloc.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2007-01-15 09:58:49.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2007-01-15 09:58:51.000000000 +0100
@@ -95,6 +95,7 @@ static char * const zone_names[MAX_NR_ZO
#endif
};
+static DEFINE_SPINLOCK(min_free_lock);
int min_free_kbytes = 1024;
unsigned long __meminitdata nr_kernel_pages;
@@ -3074,12 +3075,12 @@ static void setup_per_zone_lowmem_reserv
}
/**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
*
* Ensures that the pages_{min,low,high} values for each zone are set correctly
* with respect to min_free_kbytes.
*/
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
{
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -3133,6 +3134,15 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
}
+void setup_per_zone_pages_min(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&min_free_lock, flags);
+ __setup_per_zone_pages_min();
+ spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
/*
* Initialise min_free_kbytes.
*
@@ -3168,7 +3178,7 @@ static int __init init_per_zone_pages_mi
min_free_kbytes = 128;
if (min_free_kbytes > 65536)
min_free_kbytes = 65536;
- setup_per_zone_pages_min();
+ __setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
return 0;
}
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 5/9] mm: emergency pool
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (3 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 4/9] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 6/9] mm: __GFP_EMERGENCY Peter Zijlstra
` (4 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6204 bytes --]
Provide means to reserve a specific amount pages.
The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mmzone.h | 3 +-
mm/page_alloc.c | 52 ++++++++++++++++++++++++++++++++++++++++---------
mm/vmstat.c | 6 ++---
3 files changed, 48 insertions(+), 13 deletions(-)
Index: linux-2.6-git/include/linux/mmzone.h
===================================================================
--- linux-2.6-git.orig/include/linux/mmzone.h 2007-01-15 09:58:44.000000000 +0100
+++ linux-2.6-git/include/linux/mmzone.h 2007-01-15 09:58:54.000000000 +0100
@@ -156,7 +156,7 @@ enum zone_type {
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
- unsigned long pages_min, pages_low, pages_high;
+ unsigned long pages_emerg, pages_min, pages_low, pages_high;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
@@ -540,6 +540,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
struct file *, void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2007-01-15 09:58:51.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2007-01-15 09:58:54.000000000 +0100
@@ -97,6 +97,7 @@ static char * const zone_names[MAX_NR_ZO
static DEFINE_SPINLOCK(min_free_lock);
int min_free_kbytes = 1024;
+int var_free_kbytes;
unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
@@ -991,7 +992,8 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
- if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+ if (free_pages <= min + z->lowmem_reserve[classzone_idx] +
+ z->pages_emerg)
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -1344,8 +1346,8 @@ nofail_alloc:
nopage:
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
printk(KERN_WARNING "%s: page allocation failure."
- " order:%d, mode:0x%x\n",
- p->comm, order, gfp_mask);
+ " order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%lx\n",
+ p->comm, order, gfp_mask, alloc_flags, p->flags);
dump_stack();
show_mem();
}
@@ -1590,9 +1592,9 @@ void show_free_areas(void)
"\n",
zone->name,
K(zone->free_pages),
- K(zone->pages_min),
- K(zone->pages_low),
- K(zone->pages_high),
+ K(zone->pages_emerg + zone->pages_min),
+ K(zone->pages_emerg + zone->pages_low),
+ K(zone->pages_emerg + zone->pages_high),
K(zone->nr_active),
K(zone->nr_inactive),
K(zone->present_pages),
@@ -3025,7 +3027,7 @@ static void calculate_totalreserve_pages
}
/* we treat pages_high as reserved pages. */
- max += zone->pages_high;
+ max += zone->pages_high + zone->pages_emerg;
if (max > zone->present_pages)
max = zone->present_pages;
@@ -3082,7 +3084,8 @@ static void setup_per_zone_lowmem_reserv
*/
static void __setup_per_zone_pages_min(void)
{
- unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -3094,11 +3097,13 @@ static void __setup_per_zone_pages_min(v
}
for_each_zone(zone) {
- u64 tmp;
+ u64 tmp, tmp_emerg;
spin_lock_irqsave(&zone->lru_lock, flags);
tmp = (u64)pages_min * zone->present_pages;
do_div(tmp, lowmem_pages);
+ tmp_emerg = (u64)pages_emerg * zone->present_pages;
+ do_div(tmp_emerg, lowmem_pages);
if (is_highmem(zone)) {
/*
* __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -3117,12 +3122,14 @@ static void __setup_per_zone_pages_min(v
if (min_pages > 128)
min_pages = 128;
zone->pages_min = min_pages;
+ zone->pages_emerg = min_pages;
} else {
/*
* If it's a lowmem zone, reserve a number of pages
* proportionate to the zone's size.
*/
zone->pages_min = tmp;
+ zone->pages_emerg = tmp_emerg;
}
zone->pages_low = zone->pages_min + (tmp >> 2);
@@ -3143,6 +3150,33 @@ void setup_per_zone_pages_min(void)
spin_unlock_irqrestore(&min_free_lock, flags);
}
+/**
+ * adjust_memalloc_reserve - adjust the memalloc reserve
+ * @pages: number of pages to add
+ *
+ * It adds a number of pages to the memalloc reserve; if
+ * the number was positive it kicks kswapd into action to
+ * satisfy the higher watermarks.
+ *
+ * NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+ var_free_kbytes += pages << (PAGE_SHIFT - 10);
+ BUG_ON(var_free_kbytes < 0);
+ setup_per_zone_pages_min();
+ if (pages > 0) {
+ struct zone *zone;
+ for_each_zone(zone)
+ wakeup_kswapd(zone, 0);
+ }
+ if (pages)
+ printk(KERN_DEBUG "Emergency reserve: %d\n",
+ var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
/*
* Initialise min_free_kbytes.
*
Index: linux-2.6-git/mm/vmstat.c
===================================================================
--- linux-2.6-git.orig/mm/vmstat.c 2007-01-15 09:58:44.000000000 +0100
+++ linux-2.6-git/mm/vmstat.c 2007-01-15 09:58:54.000000000 +0100
@@ -535,9 +535,9 @@ static int zoneinfo_show(struct seq_file
"\n spanned %lu"
"\n present %lu",
zone->free_pages,
- zone->pages_min,
- zone->pages_low,
- zone->pages_high,
+ zone->pages_emerg + zone->pages_min,
+ zone->pages_emerg + zone->pages_low,
+ zone->pages_emerg + zone->pages_high,
zone->nr_active,
zone->nr_inactive,
zone->pages_scanned,
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 6/9] mm: __GFP_EMERGENCY
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (4 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 5/9] mm: emergency pool Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 7/9] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
` (3 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 3698 bytes --]
__GFP_EMERGENCY will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/gfp.h | 7 ++++++-
mm/internal.h | 10 +++++++---
2 files changed, 13 insertions(+), 4 deletions(-)
Index: linux-2.6-git/include/linux/gfp.h
===================================================================
--- linux-2.6-git.orig/include/linux/gfp.h 2006-12-14 10:02:18.000000000 +0100
+++ linux-2.6-git/include/linux/gfp.h 2006-12-14 10:02:52.000000000 +0100
@@ -35,17 +35,21 @@ struct vm_area_struct;
#define __GFP_HIGH ((__force gfp_t)0x20u) /* Should access emergency pools? */
#define __GFP_IO ((__force gfp_t)0x40u) /* Can start physical IO? */
#define __GFP_FS ((__force gfp_t)0x80u) /* Can call down to low-level FS? */
+
#define __GFP_COLD ((__force gfp_t)0x100u) /* Cache-cold page required */
#define __GFP_NOWARN ((__force gfp_t)0x200u) /* Suppress page allocation failure warning */
#define __GFP_REPEAT ((__force gfp_t)0x400u) /* Retry the allocation. Might fail */
#define __GFP_NOFAIL ((__force gfp_t)0x800u) /* Retry for ever. Cannot fail */
+
#define __GFP_NORETRY ((__force gfp_t)0x1000u)/* Do not retry. Might fail */
#define __GFP_NO_GROW ((__force gfp_t)0x2000u)/* Slab internal usage */
#define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
+
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EMERGENCY ((__force gfp_t)0x80000u) /* Use emergency reserves */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
+ __GFP_EMERGENCY)
/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h 2006-12-14 10:02:52.000000000 +0100
+++ linux-2.6-git/mm/internal.h 2006-12-14 10:02:52.000000000 +0100
@@ -75,7 +75,9 @@ static int inline gfp_to_alloc_flags(gfp
alloc_flags |= ALLOC_HARDER;
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_irq() && (p->flags & PF_MEMALLOC))
+ if (gfp_mask & __GFP_EMERGENCY)
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (!in_irq() && (p->flags & PF_MEMALLOC))
alloc_flags |= ALLOC_NO_WATERMARKS;
else if (!in_interrupt() &&
unlikely(test_thread_flag(TIF_MEMDIE)))
@@ -103,7 +105,7 @@ static inline int alloc_flags_to_rank(in
return rank;
}
-static inline int gfp_to_rank(gfp_t gfp_mask)
+static __always_inline int gfp_to_rank(gfp_t gfp_mask)
{
/*
* Although correct this full version takes a ~3% performance
@@ -118,7 +120,9 @@ static inline int gfp_to_rank(gfp_t gfp_
*/
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_irq() && (current->flags & PF_MEMALLOC))
+ if (gfp_mask & __GFP_EMERGENCY)
+ return 0;
+ else if (!in_irq() && (current->flags & PF_MEMALLOC))
return 0;
else if (!in_interrupt() &&
unlikely(test_thread_flag(TIF_MEMDIE)))
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 7/9] mm: allow mempool to fall back to memalloc reserves
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (5 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 6/9] mm: __GFP_EMERGENCY Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 8/9] slab: kmem_cache_objs_to_pages() Peter Zijlstra
` (2 subsequent siblings)
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: mempool_fixup.patch --]
[-- Type: text/plain, Size: 1167 bytes --]
Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/mempool.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux-2.6-git/mm/mempool.c
===================================================================
--- linux-2.6-git.orig/mm/mempool.c 2007-01-12 08:03:44.000000000 +0100
+++ linux-2.6-git/mm/mempool.c 2007-01-12 10:38:57.000000000 +0100
@@ -14,6 +14,7 @@
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
+#include "internal.h"
static void add_element(mempool_t *pool, void *element)
{
@@ -229,6 +230,15 @@ repeat_alloc:
}
spin_unlock_irqrestore(&pool->lock, flags);
+ /* if we really had right to the emergency reserves try those */
+ if (gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS) {
+ if (gfp_temp & __GFP_NOMEMALLOC) {
+ gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+ goto repeat_alloc;
+ } else
+ gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+ }
+
/* We must not sleep in the GFP_ATOMIC case */
if (!(gfp_mask & __GFP_WAIT))
return NULL;
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 8/9] slab: kmem_cache_objs_to_pages()
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (6 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 7/9] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 9:46 ` [PATCH 9/9] net: vm deadlock avoidance core Peter Zijlstra
2007-01-17 9:12 ` [PATCH 0/9] VM deadlock avoidance -v10 Pavel Machek
9 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: kmem_cache_objs_to_pages.patch --]
[-- Type: text/plain, Size: 1426 bytes --]
Provide a method to calculate the number of pages needed to store a given
number of slab objects (upper bound when considering possible partial and
free slabs).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/slab.h | 1 +
mm/slab.c | 6 ++++++
2 files changed, 7 insertions(+)
Index: linux-2.6-git/include/linux/slab.h
===================================================================
--- linux-2.6-git.orig/include/linux/slab.h 2007-01-09 11:28:32.000000000 +0100
+++ linux-2.6-git/include/linux/slab.h 2007-01-09 11:30:16.000000000 +0100
@@ -43,6 +43,7 @@ typedef struct kmem_cache kmem_cache_t _
*/
void __init kmem_cache_init(void);
extern int slab_is_available(void);
+extern unsigned int kmem_cache_objs_to_pages(struct kmem_cache *, int);
struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
unsigned long,
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c 2007-01-09 11:30:00.000000000 +0100
+++ linux-2.6-git/mm/slab.c 2007-01-09 11:30:16.000000000 +0100
@@ -4482,3 +4482,9 @@ unsigned int ksize(const void *objp)
return obj_size(virt_to_cache(objp));
}
+
+unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
+{
+ return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;
+}
+EXPORT_SYMBOL_GPL(kmem_cache_objs_to_pages);
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (7 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 8/9] slab: kmem_cache_objs_to_pages() Peter Zijlstra
@ 2007-01-16 9:46 ` Peter Zijlstra
2007-01-16 13:25 ` Evgeniy Polyakov
2007-01-17 9:12 ` [PATCH 0/9] VM deadlock avoidance -v10 Pavel Machek
9 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 9:46 UTC (permalink / raw)
To: linux-kernel, netdev, linux-mm; +Cc: David Miller, Peter Zijlstra
[-- Attachment #1: vm_deadlock_core.patch --]
[-- Type: text/plain, Size: 28660 bytes --]
In order to provide robust networked storage there must be a guarantee
of progress. That is, the storage device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).
This means that the device must always find enough memory to build/send packets
over the network _and_ receive (level 7) ACKs for those packets.
The network stack has a huge capacity for buffering packets; waiting for
user-space to read them. There is a practical limit imposed to avoid DoS
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network storage device waiting for an ACK to free a page).
Memory pressure will add to that; what if there is simply no memory left to
receive packets in.
This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.
When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.
The reserve is calculated to exceed the IP fragment caches and match the route
cache.
(Note on the name SOCK_VMIO; the basic problem is a circular dependency between
the network and virtual memory subsystems which needs to be broken. This does
make VM network IO - and only VM network IO - special, it does not generalize)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/skbuff.h | 13 +++-
include/net/sock.h | 42 ++++++++++++++-
net/core/dev.c | 40 +++++++++++++-
net/core/skbuff.c | 50 ++++++++++++++++--
net/core/sock.c | 121 +++++++++++++++++++++++++++++++++++++++++++++
net/core/stream.c | 5 +
net/ipv4/ip_fragment.c | 1
net/ipv4/ipmr.c | 4 +
net/ipv4/route.c | 15 +++++
net/ipv4/sysctl_net_ipv4.c | 14 ++++-
net/ipv4/tcp_ipv4.c | 27 +++++++++-
net/ipv6/reassembly.c | 1
net/ipv6/route.c | 15 +++++
net/ipv6/sysctl_net_ipv6.c | 6 +-
net/ipv6/tcp_ipv6.c | 27 +++++++++-
net/netfilter/core.c | 5 +
security/selinux/avc.c | 2
17 files changed, 361 insertions(+), 27 deletions(-)
Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h 2007-01-12 12:21:14.000000000 +0100
@@ -284,7 +284,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ emergency:1;
__be16 protocol;
void (*destructor)(struct sk_buff *skb);
@@ -329,10 +330,13 @@ struct sk_buff {
#include <asm/system.h>
+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int flags, int node);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
@@ -342,7 +346,7 @@ static inline struct sk_buff *alloc_skb(
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
}
extern struct sk_buff *alloc_skb_from_cache(struct kmem_cache *cp,
@@ -1103,7 +1107,8 @@ static inline void __skb_queue_purge(str
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb =
+ __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/include/net/sock.h 2007-01-12 13:17:45.000000000 +0100
@@ -392,6 +392,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
};
static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -414,6 +415,40 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}
+static inline int sk_has_vmio(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_SKB 3
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+ (4 * MAX_FRAGMENTS * MAX_PAGES_PER_SKB)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_skbs;
+
+static inline int sk_vmio_socks(void)
+{
+ return atomic_read(&vmio_socks);
+}
+
+extern int sk_emergency_skb_get(void);
+
+static inline void sk_emergency_skb_put(void)
+{
+ return atomic_dec(&emergency_rx_skbs);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void ipfrag_reserve_memory(int ipfrag_reserve);
+extern void iprt_reserve_memory(int rt_reserve);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
@@ -695,7 +730,8 @@ static inline struct inode *SOCK_INODE(s
}
extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+ int size, int kind);
#define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -722,13 +758,13 @@ static inline void sk_stream_writequeue_
static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
{
return (int)skb->truesize <= sk->sk_forward_alloc ||
- sk_stream_mem_schedule(sk, skb->truesize, 1);
+ sk_stream_mem_schedule(sk, skb, skb->truesize, 1);
}
static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
{
return size <= sk->sk_forward_alloc ||
- sk_stream_mem_schedule(sk, size, 0);
+ sk_stream_mem_schedule(sk, NULL, size, 0);
}
/* Used by processes to "lock" a socket state, so that
Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/dev.c 2007-01-12 12:21:55.000000000 +0100
@@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+ unsigned long pflags = current->flags;
+
+ /* Emergency skb are special, they should
+ * - be delivered to SOCK_VMIO sockets only
+ * - stay away from userspace
+ * - have bounded memory usage
+ *
+ * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+ * This saves us from propagating the allocation context down to all
+ * allocation sites.
+ */
+ if (unlikely(skb->emergency))
+ current->flags |= PF_MEMALLOC;
/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
- return NET_RX_DROP;
+ goto out;
if (!skb->tstamp.off_sec)
net_timestamp(skb);
@@ -1781,7 +1794,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);
if (!orig_dev)
- return NET_RX_DROP;
+ goto out;
__get_cpu_var(netdev_rx_stat).total++;
@@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
goto ncls;
}
#endif
+ if (unlikely(skb->emergency))
+ goto skip_taps;
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
@@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
}
}
+skip_taps:
#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
- goto out;
+ goto unlock;
}
skb->tc_verd = 0;
ncls:
#endif
+ if (unlikely(skb->emergency))
+ switch(skb->protocol) {
+ case __constant_htons(ETH_P_ARP):
+ case __constant_htons(ETH_P_IP):
+ case __constant_htons(ETH_P_IPV6):
+ break;
+
+ default:
+ goto drop;
+ }
+
if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
- goto out;
+ goto unlock;
type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1842,6 +1869,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
* me how you were going to use this. :-)
@@ -1849,8 +1877,10 @@ ncls:
ret = NET_RX_DROP;
}
-out:
+unlock:
rcu_read_unlock();
+out:
+ current->flags = pflags;
return ret;
}
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c 2007-01-12 13:29:51.000000000 +0100
@@ -142,28 +142,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ int emergency = 0;
- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ size = SKB_DATA_ALIGN(size);
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
+ if (flags & SKB_ALLOC_RX)
+ gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+retry_alloc:
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
if (!skb)
- goto out;
+ goto noskb;
/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
goto nodata;
memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->emergency = emergency;
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -180,7 +186,7 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);
@@ -188,12 +194,29 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);
child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->emergency = skb->emergency;
}
out:
return skb;
+
nodata:
kmem_cache_free(cache, skb);
skb = NULL;
+noskb:
+ /* Attempt emergency allocation when RX skb. */
+ if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+ goto out;
+
+ if (!emergency) {
+ if (sk_emergency_skb_get()) {
+ gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+ gfp_mask |= __GFP_EMERGENCY;
+ emergency = 1;
+ goto retry_alloc;
+ }
+ } else
+ sk_emergency_skb_put();
+
goto out;
}
@@ -271,7 +294,7 @@ struct sk_buff *__netdev_alloc_skb(struc
int node = dev->class_dev.dev ? dev_to_node(dev->class_dev.dev) : -1;
struct sk_buff *skb;
- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -320,6 +343,8 @@ static void skb_release_data(struct sk_b
skb_drop_fraglist(skb);
kfree(skb->head);
+ if (unlikely(skb->emergency))
+ sk_emergency_skb_put();
}
}
@@ -440,6 +465,9 @@ struct sk_buff *skb_clone(struct sk_buff
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ if (unlikely(skb->emergency))
+ gfp_mask |= __GFP_EMERGENCY;
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -474,6 +502,7 @@ struct sk_buff *skb_clone(struct sk_buff
#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
C(ipvs_property);
#endif
+ C(emergency);
C(protocol);
n->destructor = NULL;
C(mark);
@@ -689,12 +718,19 @@ int pskb_expand_head(struct sk_buff *skb
u8 *data;
int size = nhead + (skb->end - skb->head) + ntail;
long off;
+ int emergency = 0;
if (skb_shared(skb))
BUG();
size = SKB_DATA_ALIGN(size);
+ if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
+ gfp_mask |= __GFP_EMERGENCY;
+ emergency = 1;
+ } else
+ gfp_mask |= __GFP_NOMEMALLOC;
+
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;
@@ -727,6 +763,8 @@ int pskb_expand_head(struct sk_buff *skb
return 0;
nodata:
+ if (unlikely(emergency))
+ sk_emergency_skb_put();
return -ENOMEM;
}
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/sock.c 2007-01-12 12:21:14.000000000 +0100
@@ -196,6 +196,120 @@ __u32 sysctl_rmem_default __read_mostly
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_skbs;
+
+static int ipfrag_threshold;
+
+#define ipfrag_mtu() (1500) /* XXX: should be smallest mtu system wide */
+#define ipfrag_skbs() (ipfrag_threshold / ipfrag_mtu())
+#define ipfrag_pages() (ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
+
+static int iprt_pages;
+
+/*
+ * is there room for another emergency skb.
+ */
+int sk_emergency_skb_get(void)
+{
+ int nr = atomic_add_return(1, &emergency_rx_skbs);
+ int thresh = (3 * ipfrag_skbs()) / 2;
+ if (nr < thresh)
+ return 1;
+
+ atomic_dec(&emergency_rx_skbs);
+ return 0;
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_VMIO sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ * @tx_reserve_pages is an upper-bound of memory used for TX hence
+ * we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+ unsigned long flags;
+ int reserve = tx_reserve_pages;
+ int nr_socks;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+ nr_socks = atomic_add_return(socks, &vmio_socks);
+ BUG_ON(nr_socks < 0);
+
+ if (nr_socks) {
+ int rx_pages = 2 * ipfrag_pages() + iprt_pages;
+ reserve += rx_pages - rx_net_reserve;
+ rx_net_reserve = rx_pages;
+ } else {
+ reserve -= rx_net_reserve;
+ rx_net_reserve = 0;
+ }
+
+ if (reserve)
+ adjust_memalloc_reserve(reserve);
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper function to track the total ipfragment memory
+ * needed because of modular ipv6
+ */
+void ipfrag_reserve_memory(int frags)
+{
+ ipfrag_threshold += frags;
+ sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(ipfrag_reserve_memory);
+
+void iprt_reserve_memory(int pages)
+{
+ iprt_pages += pages;
+ sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(iprt_reserve_memory);
+
+/**
+ * sk_set_vmio - sets %SOCK_VMIO
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ * accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+ int set = sock_flag(sk, SOCK_VMIO);
+ if (!set) {
+ sk_adjust_memalloc(1, 0);
+ sock_set_flag(sk, SOCK_VMIO);
+ sk->sk_allocation |= __GFP_EMERGENCY;
+ }
+ return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+ int set = sock_flag(sk, SOCK_VMIO);
+ if (set) {
+ sk_adjust_memalloc(-1, 0);
+ sock_reset_flag(sk, SOCK_VMIO);
+ sk->sk_allocation &= ~__GFP_EMERGENCY;
+ }
+ return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;
@@ -239,6 +353,12 @@ int sock_queue_rcv_skb(struct sock *sk,
int err = 0;
int skb_len;
+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk)) {
+ err = -ENOMEM;
+ goto out;
+ }
+ } else
/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
number of warnings when compiling with -W --ANK
*/
@@ -868,6 +988,7 @@ void sk_free(struct sock *sk)
struct sk_filter *filter;
struct module *owner = sk->sk_prot_creator->owner;
+ sk_clear_vmio(sk);
if (sk->sk_destruct)
sk->sk_destruct(sk);
Index: linux-2.6-git/net/ipv4/ipmr.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ipmr.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/ipmr.c 2007-01-12 12:21:14.000000000 +0100
@@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
struct mfc_cache *cache;
int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;
+ if (unlikely(skb->emergency))
+ goto drop;
+
/* Packet is looped back after forward, it should not be
forwarded second time, but still can be delivered locally.
*/
@@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
dont_forward:
if (local)
return ip_local_deliver(skb);
+drop:
kfree_skb(skb);
return 0;
}
Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:21:14.000000000 +0100
@@ -18,6 +18,7 @@
#include <net/route.h>
#include <net/tcp.h>
#include <net/cipso_ipv4.h>
+#include <net/sock.h>
/* From af_inet.c */
extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c
}
+int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old_thresh = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ ipfrag_reserve_memory(*(int *)table->data - old_thresh);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
+
ctl_table ipv4_table[] = {
{
.ctl_name = NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
.data = &sysctl_ipfrag_high_thresh,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec
+ .proc_handler = &proc_dointvec_fragment
},
{
.ctl_name = NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
@@ -1604,6 +1604,22 @@ csum_err:
goto discard;
}
+static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+ if (unlikely(skb->emergency)) {
+ BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+ if (!(pflags & PF_MEMALLOC))
+ current->flags |= PF_MEMALLOC;
+ }
+
+ ret = tcp_v4_do_rcv(sk, skb);
+
+ current->flags = pflags;
+ return ret;
+}
+
/*
* From tcp_input.c
*/
@@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;
+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk))
+ goto discard_and_relse;
+ /*
+ decrease window size..
+ tcp_enter_quickack_mode(sk);
+ */
+ }
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
@@ -2429,7 +2454,7 @@ struct proto tcp_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
- .backlog_rcv = tcp_v4_do_rcv,
+ .backlog_rcv = tcp_v4_backlog_rcv,
.hash = tcp_v4_hash,
.unhash = tcp_unhash,
.get_port = tcp_v4_get_port,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c 2007-01-12 12:21:14.000000000 +0100
@@ -15,6 +15,10 @@
#ifdef CONFIG_SYSCTL
+extern int proc_dointvec_fragment(ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
static ctl_table ipv6_table[] = {
{
.ctl_name = NET_IPV6_ROUTE,
@@ -44,7 +48,7 @@ static ctl_table ipv6_table[] = {
.data = &sysctl_ip6frag_high_thresh,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec
+ .proc_handler = &proc_dointvec_fragment
},
{
.ctl_name = NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c 2007-01-12 12:21:14.000000000 +0100
@@ -181,6 +181,11 @@ next_hook:
kfree_skb(*pskb);
ret = -EPERM;
} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+ if (unlikely((*pskb)->emergency)) {
+ printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
+ "emergency skb - skipping rule.\n");
+ goto next_hook;
+ }
NFDEBUG("nf_hook: Verdict = QUEUE.\n");
if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
verdict >> NF_VERDICT_BITS))
Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c 2007-01-12 12:21:14.000000000 +0100
@@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
{
struct avc_node *node;
- node = kmem_cache_alloc(avc_node_cachep, GFP_ATOMIC);
+ node = kmem_cache_alloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/ipv4/ip_fragment.c 2007-01-12 12:21:14.000000000 +0100
@@ -743,6 +743,7 @@ void ipfrag_init(void)
ipfrag_secret_timer.function = ipfrag_secret_rebuild;
ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
add_timer(&ipfrag_secret_timer);
+ ipfrag_reserve_memory(sysctl_ipfrag_high_thresh);
}
EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/reassembly.c 2007-01-12 12:21:14.000000000 +0100
@@ -772,4 +772,5 @@ void __init ipv6_frag_init(void)
ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
add_timer(&ip6_frag_secret_timer);
+ ipfrag_reserve_memory(sysctl_ip6frag_high_thresh);
}
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/route.c 2007-01-12 12:21:14.000000000 +0100
@@ -2884,6 +2884,17 @@ static int ipv4_sysctl_rtcache_flush_str
return 0;
}
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+ *(int *)table->data - old));
+ return ret;
+}
+
ctl_table ipv4_route_table[] = {
{
.ctl_name = NET_IPV4_ROUTE_FLUSH,
@@ -2926,7 +2937,7 @@ ctl_table ipv4_route_table[] = {
.data = &ip_rt_max_size,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_dointvec_rt_size,
},
{
/* Deprecated. Use gc_min_interval_ms */
@@ -3153,6 +3164,8 @@ int __init ip_rt_init(void)
ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+ ip_rt_max_size));
devinet_init();
ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/route.c 2007-01-12 12:21:14.000000000 +0100
@@ -2356,6 +2356,17 @@ int ipv6_sysctl_rtcache_flush(ctl_table
return -EINVAL;
}
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+ *(int *)table->data - old));
+ return ret;
+}
+
ctl_table ipv6_route_table[] = {
{
.ctl_name = NET_IPV6_ROUTE_FLUSH,
@@ -2379,7 +2390,7 @@ ctl_table ipv6_route_table[] = {
.data = &ip6_rt_max_size,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_dointvec_rt_size,
},
{
.ctl_name = NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2464,6 +2475,8 @@ void __init ip6_route_init(void)
proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
#endif
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+ ip6_rt_max_size));
#ifdef CONFIG_XFRM
xfrm6_init();
#endif
Index: linux-2.6-git/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/tcp_ipv6.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/tcp_ipv6.c 2007-01-12 12:21:14.000000000 +0100
@@ -1678,6 +1678,22 @@ ipv6_pktoptions:
return 0;
}
+static int tcp_v6_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+ if (unlikely(skb->emergency)) {
+ BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+ if (!(pflags & PF_MEMALLOC))
+ current->flags |= PF_MEMALLOC;
+ }
+
+ ret = tcp_v6_do_rcv(sk, skb);
+
+ current->flags = pflags;
+ return ret;
+}
+
static int tcp_v6_rcv(struct sk_buff **pskb)
{
struct sk_buff *skb = *pskb;
@@ -1723,6 +1739,15 @@ static int tcp_v6_rcv(struct sk_buff **p
if (!sk)
goto no_tcp_socket;
+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk))
+ goto discard_and_relse;
+ /*
+ decrease window size..
+ tcp_enter_quickack_mode(sk);
+ */
+ }
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
@@ -2127,7 +2152,7 @@ struct proto tcpv6_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
- .backlog_rcv = tcp_v6_do_rcv,
+ .backlog_rcv = tcp_v6_backlog_rcv,
.hash = tcp_v6_hash,
.unhash = tcp_unhash,
.get_port = tcp_v6_get_port,
Index: linux-2.6-git/net/core/stream.c
===================================================================
--- linux-2.6-git.orig/net/core/stream.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/stream.c 2007-01-12 13:17:08.000000000 +0100
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
EXPORT_SYMBOL(__sk_stream_mem_reclaim);
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind)
{
int amt = sk_stream_pages(size);
@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
/* Over hard limit. */
if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
sk->sk_prot->enter_memory_pressure();
- goto suppress_allocation;
+ if (likely(!skb || !skb->emergency))
+ goto suppress_allocation;
}
/* Under pressure. */
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 9:46 ` [PATCH 9/9] net: vm deadlock avoidance core Peter Zijlstra
@ 2007-01-16 13:25 ` Evgeniy Polyakov
2007-01-16 13:47 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-16 13:25 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Tue, Jan 16, 2007 at 10:46:06AM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> In order to provide robust networked storage there must be a guarantee
> of progress. That is, the storage device must never stall because of (physical)
> OOM, because the device itself might be needed to get out of it (reclaim).
> /* Used by processes to "lock" a socket state, so that
> Index: linux-2.6-git/net/core/dev.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/dev.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/dev.c 2007-01-12 12:21:55.000000000 +0100
> @@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
> struct net_device *orig_dev;
> int ret = NET_RX_DROP;
> __be16 type;
> + unsigned long pflags = current->flags;
> +
> + /* Emergency skb are special, they should
> + * - be delivered to SOCK_VMIO sockets only
> + * - stay away from userspace
> + * - have bounded memory usage
> + *
> + * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> + * This saves us from propagating the allocation context down to all
> + * allocation sites.
> + */
> + if (unlikely(skb->emergency))
> + current->flags |= PF_MEMALLOC;
Access to 'current' in netif_receive_skb()???
Why do you want to work with, for example keventd?
> /* if we've gotten here through NAPI, check netpoll */
> if (skb->dev->poll && netpoll_rx(skb))
> - return NET_RX_DROP;
> + goto out;
>
> if (!skb->tstamp.off_sec)
> net_timestamp(skb);
> @@ -1781,7 +1794,7 @@ int netif_receive_skb(struct sk_buff *sk
> orig_dev = skb_bond(skb);
>
> if (!orig_dev)
> - return NET_RX_DROP;
> + goto out;
>
> __get_cpu_var(netdev_rx_stat).total++;
>
> @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> goto ncls;
> }
> #endif
> + if (unlikely(skb->emergency))
> + goto skip_taps;
>
> list_for_each_entry_rcu(ptype, &ptype_all, list) {
> if (!ptype->dev || ptype->dev == skb->dev) {
> @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> }
> }
>
> +skip_taps:
It is still a 'tap'.
> #ifdef CONFIG_NET_CLS_ACT
> if (pt_prev) {
> ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
>
> if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> kfree_skb(skb);
> - goto out;
> + goto unlock;
> }
>
> skb->tc_verd = 0;
> ncls:
> #endif
>
> + if (unlikely(skb->emergency))
> + switch(skb->protocol) {
> + case __constant_htons(ETH_P_ARP):
> + case __constant_htons(ETH_P_IP):
> + case __constant_htons(ETH_P_IPV6):
> + break;
Poor vlans and appletalk.
> + default:
> + goto drop;
> + }
> +
> if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
> - goto out;
> + goto unlock;
>
> type = skb->protocol;
> list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
> @@ -1842,6 +1869,7 @@ ncls:
> if (pt_prev) {
> ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> } else {
> +drop:
> kfree_skb(skb);
> /* Jamal, now you will not able to escape explaining
> * me how you were going to use this. :-)
> @@ -1849,8 +1877,10 @@ ncls:
> ret = NET_RX_DROP;
> }
>
> -out:
> +unlock:
> rcu_read_unlock();
> +out:
> + current->flags = pflags;
> return ret;
> }
>
> Index: linux-2.6-git/net/core/skbuff.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/skbuff.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/skbuff.c 2007-01-12 13:29:51.000000000 +0100
> @@ -142,28 +142,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
> * %GFP_ATOMIC.
> */
> struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> - int fclone, int node)
> + int flags, int node)
> {
> struct kmem_cache *cache;
> struct skb_shared_info *shinfo;
> struct sk_buff *skb;
> u8 *data;
> + int emergency = 0;
>
> - cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> + size = SKB_DATA_ALIGN(size);
> + cache = (flags & SKB_ALLOC_FCLONE)
> + ? skbuff_fclone_cache : skbuff_head_cache;
> + if (flags & SKB_ALLOC_RX)
> + gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
>
> +retry_alloc:
> /* Get the HEAD */
> skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> if (!skb)
> - goto out;
> + goto noskb;
>
> /* Get the DATA. Size must match skb_add_mtu(). */
> - size = SKB_DATA_ALIGN(size);
> data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
> gfp_mask, node);
> if (!data)
> goto nodata;
>
> memset(skb, 0, offsetof(struct sk_buff, truesize));
> + skb->emergency = emergency;
> skb->truesize = size + sizeof(struct sk_buff);
> atomic_set(&skb->users, 1);
> skb->head = data;
> @@ -180,7 +186,7 @@ struct sk_buff *__alloc_skb(unsigned int
> shinfo->ip6_frag_id = 0;
> shinfo->frag_list = NULL;
>
> - if (fclone) {
> + if (flags & SKB_ALLOC_FCLONE) {
> struct sk_buff *child = skb + 1;
> atomic_t *fclone_ref = (atomic_t *) (child + 1);
>
> @@ -188,12 +194,29 @@ struct sk_buff *__alloc_skb(unsigned int
> atomic_set(fclone_ref, 1);
>
> child->fclone = SKB_FCLONE_UNAVAILABLE;
> + child->emergency = skb->emergency;
> }
> out:
> return skb;
> +
> nodata:
> kmem_cache_free(cache, skb);
> skb = NULL;
> +noskb:
> + /* Attempt emergency allocation when RX skb. */
> + if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
> + goto out;
> +
> + if (!emergency) {
> + if (sk_emergency_skb_get()) {
> + gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
> + gfp_mask |= __GFP_EMERGENCY;
> + emergency = 1;
> + goto retry_alloc;
> + }
> + } else
> + sk_emergency_skb_put();
> +
> goto out;
> }
>
> @@ -271,7 +294,7 @@ struct sk_buff *__netdev_alloc_skb(struc
> int node = dev->class_dev.dev ? dev_to_node(dev->class_dev.dev) : -1;
> struct sk_buff *skb;
>
> - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
> if (likely(skb)) {
> skb_reserve(skb, NET_SKB_PAD);
> skb->dev = dev;
> @@ -320,6 +343,8 @@ static void skb_release_data(struct sk_b
> skb_drop_fraglist(skb);
>
> kfree(skb->head);
> + if (unlikely(skb->emergency))
> + sk_emergency_skb_put();
> }
> }
>
> @@ -440,6 +465,9 @@ struct sk_buff *skb_clone(struct sk_buff
> n->fclone = SKB_FCLONE_CLONE;
> atomic_inc(fclone_ref);
> } else {
> + if (unlikely(skb->emergency))
> + gfp_mask |= __GFP_EMERGENCY;
> +
> n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> if (!n)
> return NULL;
> @@ -474,6 +502,7 @@ struct sk_buff *skb_clone(struct sk_buff
> #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
> C(ipvs_property);
> #endif
> + C(emergency);
> C(protocol);
> n->destructor = NULL;
> C(mark);
> @@ -689,12 +718,19 @@ int pskb_expand_head(struct sk_buff *skb
> u8 *data;
> int size = nhead + (skb->end - skb->head) + ntail;
> long off;
> + int emergency = 0;
>
> if (skb_shared(skb))
> BUG();
>
> size = SKB_DATA_ALIGN(size);
>
> + if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
> + gfp_mask |= __GFP_EMERGENCY;
> + emergency = 1;
> + } else
> + gfp_mask |= __GFP_NOMEMALLOC;
> +
> data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> if (!data)
> goto nodata;
> @@ -727,6 +763,8 @@ int pskb_expand_head(struct sk_buff *skb
> return 0;
>
> nodata:
> + if (unlikely(emergency))
> + sk_emergency_skb_put();
> return -ENOMEM;
> }
>
> Index: linux-2.6-git/net/core/sock.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/sock.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/sock.c 2007-01-12 12:21:14.000000000 +0100
> @@ -196,6 +196,120 @@ __u32 sysctl_rmem_default __read_mostly
> /* Maximal space eaten by iovec or ancilliary data plus some space */
> int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int rx_net_reserve;
> +
> +atomic_t vmio_socks;
> +atomic_t emergency_rx_skbs;
> +
> +static int ipfrag_threshold;
> +
> +#define ipfrag_mtu() (1500) /* XXX: should be smallest mtu system wide */
> +#define ipfrag_skbs() (ipfrag_threshold / ipfrag_mtu())
> +#define ipfrag_pages() (ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
> +
> +static int iprt_pages;
> +
> +/*
> + * is there room for another emergency skb.
> + */
> +int sk_emergency_skb_get(void)
> +{
> + int nr = atomic_add_return(1, &emergency_rx_skbs);
> + int thresh = (3 * ipfrag_skbs()) / 2;
> + if (nr < thresh)
> + return 1;
> +
> + atomic_dec(&emergency_rx_skbs);
> + return 0;
> +}
> +
> +/**
> + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + * @socks: number of new %SOCK_VMIO sockets
> + * @tx_resserve_pages: number of pages to (un)reserve for TX
> + *
> + * This function adjusts the memalloc reserve based on system demand.
> + * The RX reserve is a limit, and only added once, not for each socket.
> + *
> + * NOTE:
> + * @tx_reserve_pages is an upper-bound of memory used for TX hence
> + * we need not account the pages like we do for RX pages.
> + */
> +void sk_adjust_memalloc(int socks, int tx_reserve_pages)
> +{
> + unsigned long flags;
> + int reserve = tx_reserve_pages;
> + int nr_socks;
> +
> + spin_lock_irqsave(&memalloc_lock, flags);
> + nr_socks = atomic_add_return(socks, &vmio_socks);
> + BUG_ON(nr_socks < 0);
> +
> + if (nr_socks) {
> + int rx_pages = 2 * ipfrag_pages() + iprt_pages;
> + reserve += rx_pages - rx_net_reserve;
> + rx_net_reserve = rx_pages;
> + } else {
> + reserve -= rx_net_reserve;
> + rx_net_reserve = 0;
> + }
> +
> + if (reserve)
> + adjust_memalloc_reserve(reserve);
> + spin_unlock_irqrestore(&memalloc_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
> +
> +/*
> + * tiny helper function to track the total ipfragment memory
> + * needed because of modular ipv6
> + */
> +void ipfrag_reserve_memory(int frags)
> +{
> + ipfrag_threshold += frags;
> + sk_adjust_memalloc(0, 0);
> +}
> +EXPORT_SYMBOL_GPL(ipfrag_reserve_memory);
> +
> +void iprt_reserve_memory(int pages)
> +{
> + iprt_pages += pages;
> + sk_adjust_memalloc(0, 0);
> +}
> +EXPORT_SYMBOL_GPL(iprt_reserve_memory);
> +
> +/**
> + * sk_set_vmio - sets %SOCK_VMIO
> + * @sk: socket to set it on
> + *
> + * Set %SOCK_VMIO on a socket and increase the memalloc reserve
> + * accordingly.
> + */
> +int sk_set_vmio(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_VMIO);
> + if (!set) {
> + sk_adjust_memalloc(1, 0);
> + sock_set_flag(sk, SOCK_VMIO);
> + sk->sk_allocation |= __GFP_EMERGENCY;
> + }
> + return !set;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_vmio);
> +
> +int sk_clear_vmio(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_VMIO);
> + if (set) {
> + sk_adjust_memalloc(-1, 0);
> + sock_reset_flag(sk, SOCK_VMIO);
> + sk->sk_allocation &= ~__GFP_EMERGENCY;
> + }
> + return set;
> +}
> +EXPORT_SYMBOL_GPL(sk_clear_vmio);
> +
> static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
> {
> struct timeval tv;
> @@ -239,6 +353,12 @@ int sock_queue_rcv_skb(struct sock *sk,
> int err = 0;
> int skb_len;
>
> + if (unlikely(skb->emergency)) {
> + if (!sk_has_vmio(sk)) {
> + err = -ENOMEM;
> + goto out;
> + }
> + } else
> /* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
> number of warnings when compiling with -W --ANK
> */
> @@ -868,6 +988,7 @@ void sk_free(struct sock *sk)
> struct sk_filter *filter;
> struct module *owner = sk->sk_prot_creator->owner;
>
> + sk_clear_vmio(sk);
> if (sk->sk_destruct)
> sk->sk_destruct(sk);
>
> Index: linux-2.6-git/net/ipv4/ipmr.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/ipmr.c 2007-01-12 12:20:08.000000000 +0100
> +++ linux-2.6-git/net/ipv4/ipmr.c 2007-01-12 12:21:14.000000000 +0100
> @@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
> struct mfc_cache *cache;
> int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;
>
> + if (unlikely(skb->emergency))
> + goto drop;
> +
> /* Packet is looped back after forward, it should not be
> forwarded second time, but still can be delivered locally.
> */
> @@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
> dont_forward:
> if (local)
> return ip_local_deliver(skb);
> +drop:
> kfree_skb(skb);
> return 0;
> }
> Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:20:08.000000000 +0100
> +++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> @@ -18,6 +18,7 @@
> #include <net/route.h>
> #include <net/tcp.h>
> #include <net/cipso_ipv4.h>
> +#include <net/sock.h>
>
> /* From af_inet.c */
> extern int sysctl_ip_nonlocal_bind;
> @@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c
>
> }
>
> +int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + int old_thresh = *(int *)table->data;
> + ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
> + ipfrag_reserve_memory(*(int *)table->data - old_thresh);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
> +
> ctl_table ipv4_table[] = {
> {
> .ctl_name = NET_IPV4_TCP_TIMESTAMPS,
> @@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
> .data = &sysctl_ipfrag_high_thresh,
> .maxlen = sizeof(int),
> .mode = 0644,
> - .proc_handler = &proc_dointvec
> + .proc_handler = &proc_dointvec_fragment
> },
> {
> .ctl_name = NET_IPV4_IPFRAG_LOW_THRESH,
> Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> @@ -1604,6 +1604,22 @@ csum_err:
> goto discard;
> }
>
> +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> + int ret;
> + unsigned long pflags = current->flags;
> + if (unlikely(skb->emergency)) {
> + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> + if (!(pflags & PF_MEMALLOC))
> + current->flags |= PF_MEMALLOC;
> + }
> +
> + ret = tcp_v4_do_rcv(sk, skb);
> +
> + current->flags = pflags;
> + return ret;
Why don't you want to just setup PF_MEMALLOC for the socket and all
related processes?
> +}
> +
> /*
> * From tcp_input.c
> */
> @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> if (!sk)
> goto no_tcp_socket;
>
> + if (unlikely(skb->emergency)) {
> + if (!sk_has_vmio(sk))
> + goto discard_and_relse;
> + /*
> + decrease window size..
> + tcp_enter_quickack_mode(sk);
> + */
How does this decrease window size?
Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
or just directly send an ack, which in turn requires allocation, which
can be bound to this received frame processing...
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 13:25 ` Evgeniy Polyakov
@ 2007-01-16 13:47 ` Peter Zijlstra
2007-01-16 15:33 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 13:47 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Tue, 2007-01-16 at 16:25 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 10:46:06AM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > @@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
> > struct net_device *orig_dev;
> > int ret = NET_RX_DROP;
> > __be16 type;
> > + unsigned long pflags = current->flags;
> > +
> > + /* Emergency skb are special, they should
> > + * - be delivered to SOCK_VMIO sockets only
> > + * - stay away from userspace
> > + * - have bounded memory usage
> > + *
> > + * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> > + * This saves us from propagating the allocation context down to all
> > + * allocation sites.
> > + */
> > + if (unlikely(skb->emergency))
> > + current->flags |= PF_MEMALLOC;
>
> Access to 'current' in netif_receive_skb()???
> Why do you want to work with, for example keventd?
Can this run in keventd?
I thought this was softirq context and thus this would either run in a
borrowed context or in ksoftirqd. See patch 3/9.
> > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > goto ncls;
> > }
> > #endif
> > + if (unlikely(skb->emergency))
> > + goto skip_taps;
> >
> > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > if (!ptype->dev || ptype->dev == skb->dev) {
> > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > }
> > }
> >
> > +skip_taps:
>
> It is still a 'tap'.
Not sure what you are saying, I thought this should stop delivery of
skbs to taps?
> > #ifdef CONFIG_NET_CLS_ACT
> > if (pt_prev) {
> > ret = deliver_skb(skb, pt_prev, orig_dev);
> > @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
> >
> > if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> > kfree_skb(skb);
> > - goto out;
> > + goto unlock;
> > }
> >
> > skb->tc_verd = 0;
> > ncls:
> > #endif
> >
> > + if (unlikely(skb->emergency))
> > + switch(skb->protocol) {
> > + case __constant_htons(ETH_P_ARP):
> > + case __constant_htons(ETH_P_IP):
> > + case __constant_htons(ETH_P_IPV6):
> > + break;
>
> Poor vlans and appletalk.
Yeah and all those other too, maybe some day.
> > Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> > ===================================================================
> > --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> > +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> > @@ -1604,6 +1604,22 @@ csum_err:
> > goto discard;
> > }
> >
> > +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> > +{
> > + int ret;
> > + unsigned long pflags = current->flags;
> > + if (unlikely(skb->emergency)) {
> > + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> > + if (!(pflags & PF_MEMALLOC))
> > + current->flags |= PF_MEMALLOC;
> > + }
> > +
> > + ret = tcp_v4_do_rcv(sk, skb);
> > +
> > + current->flags = pflags;
> > + return ret;
>
> Why don't you want to just setup PF_MEMALLOC for the socket and all
> related processes?
I'm not understanding what you're saying here.
I want grant the processing of skb->emergency packets access to the
memory reserves.
How would I set PF_MEMALLOC on a socket, its a process flag? And which
related processes?
> > +}
> > +
> > /*
> > * From tcp_input.c
> > */
> > @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > if (!sk)
> > goto no_tcp_socket;
> >
> > + if (unlikely(skb->emergency)) {
> > + if (!sk_has_vmio(sk))
> > + goto discard_and_relse;
> > + /*
> > + decrease window size..
> > + tcp_enter_quickack_mode(sk);
> > + */
>
> How does this decrease window size?
> Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> or just directly send an ack, which in turn requires allocation, which
> can be bound to this received frame processing...
It doesn't, I thought that it might be a good idea doing that, but never
got around to actually figuring out how to do it.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 13:47 ` Peter Zijlstra
@ 2007-01-16 15:33 ` Evgeniy Polyakov
2007-01-16 16:08 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-16 15:33 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > + if (unlikely(skb->emergency))
> > > + current->flags |= PF_MEMALLOC;
> >
> > Access to 'current' in netif_receive_skb()???
> > Why do you want to work with, for example keventd?
>
> Can this run in keventd?
Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
(or dedicated kernel thread, I do not recall).
> I thought this was softirq context and thus this would either run in a
> borrowed context or in ksoftirqd. See patch 3/9.
And how are you going to access 'current' in softirq?
netif_receive_skb() can also be called from a lot of other places
including keventd and/or different context - it is permitted to call it
everywhere to process packet.
I meant that you break the rule accessing 'current' in that context.
> > > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > > goto ncls;
> > > }
> > > #endif
> > > + if (unlikely(skb->emergency))
> > > + goto skip_taps;
> > >
> > > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > > if (!ptype->dev || ptype->dev == skb->dev) {
> > > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > > }
> > > }
> > >
> > > +skip_taps:
> >
> > It is still a 'tap'.
>
> Not sure what you are saying, I thought this should stop delivery of
> skbs to taps?
Ingres filter can do whatever it wants with skb at that point, likely
you want to skip that hunk too.
> > > #ifdef CONFIG_NET_CLS_ACT
> > > if (pt_prev) {
> > > ret = deliver_skb(skb, pt_prev, orig_dev);
> > > @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
> > >
> > > if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> > > kfree_skb(skb);
> > > - goto out;
> > > + goto unlock;
> > > }
> > >
> > > skb->tc_verd = 0;
> > > ncls:
> > > #endif
> > >
> > > + if (unlikely(skb->emergency))
> > > + switch(skb->protocol) {
> > > + case __constant_htons(ETH_P_ARP):
> > > + case __constant_htons(ETH_P_IP):
> > > + case __constant_htons(ETH_P_IPV6):
> > > + break;
> >
> > Poor vlans and appletalk.
>
> Yeah and all those other too, maybe some day.
>
> > > Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> > > ===================================================================
> > > --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> > > +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> > > @@ -1604,6 +1604,22 @@ csum_err:
> > > goto discard;
> > > }
> > >
> > > +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > + int ret;
> > > + unsigned long pflags = current->flags;
> > > + if (unlikely(skb->emergency)) {
> > > + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> > > + if (!(pflags & PF_MEMALLOC))
> > > + current->flags |= PF_MEMALLOC;
> > > + }
> > > +
> > > + ret = tcp_v4_do_rcv(sk, skb);
> > > +
> > > + current->flags = pflags;
> > > + return ret;
> >
> > Why don't you want to just setup PF_MEMALLOC for the socket and all
> > related processes?
>
> I'm not understanding what you're saying here.
>
> I want grant the processing of skb->emergency packets access to the
> memory reserves.
>
> How would I set PF_MEMALLOC on a socket, its a process flag? And which
> related processes?
You use special flag for sockets to mark them as capable of
'reserve-eating', too many flags are a bit confusing.
I meant that you can just mark process which created such socket as
PF_MEMALLOC, and clone that flag on forks and other relatest calls without
all that checks for 'current' in different places.
> > > +}
> > > +
> > > /*
> > > * From tcp_input.c
> > > */
> > > @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > if (!sk)
> > > goto no_tcp_socket;
> > >
> > > + if (unlikely(skb->emergency)) {
> > > + if (!sk_has_vmio(sk))
> > > + goto discard_and_relse;
> > > + /*
> > > + decrease window size..
> > > + tcp_enter_quickack_mode(sk);
> > > + */
> >
> > How does this decrease window size?
> > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > or just directly send an ack, which in turn requires allocation, which
> > can be bound to this received frame processing...
>
> It doesn't, I thought that it might be a good idea doing that, but never
> got around to actually figuring out how to do it.
tcp_send_ack()?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 15:33 ` Evgeniy Polyakov
@ 2007-01-16 16:08 ` Peter Zijlstra
2007-01-17 4:54 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-16 16:08 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > + if (unlikely(skb->emergency))
> > > > + current->flags |= PF_MEMALLOC;
> > >
> > > Access to 'current' in netif_receive_skb()???
> > > Why do you want to work with, for example keventd?
> >
> > Can this run in keventd?
>
> Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> (or dedicated kernel thread, I do not recall).
>
> > I thought this was softirq context and thus this would either run in a
> > borrowed context or in ksoftirqd. See patch 3/9.
>
> And how are you going to access 'current' in softirq?
>
> netif_receive_skb() can also be called from a lot of other places
> including keventd and/or different context - it is permitted to call it
> everywhere to process packet.
>
> I meant that you break the rule accessing 'current' in that context.
Yeah, I know, but as long as we're not actually in hard irq context
current does point to the task_struct in charge of current execution and
as long as we restore whatever was in the flags field before we started
poking, nothing can go wrong.
So, yes this is unconventional, but it does work as expected.
As for breaking, 3/9 makes it legal.
> > > > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > > > goto ncls;
> > > > }
> > > > #endif
> > > > + if (unlikely(skb->emergency))
> > > > + goto skip_taps;
> > > >
> > > > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > > > if (!ptype->dev || ptype->dev == skb->dev) {
> > > > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > > > }
> > > > }
> > > >
> > > > +skip_taps:
> > >
> > > It is still a 'tap'.
> >
> > Not sure what you are saying, I thought this should stop delivery of
> > skbs to taps?
>
> Ingres filter can do whatever it wants with skb at that point, likely
> you want to skip that hunk too.
Will look into Ingres filters, thanks for the pointer.
> > > Why don't you want to just setup PF_MEMALLOC for the socket and all
> > > related processes?
> >
> > I'm not understanding what you're saying here.
> >
> > I want grant the processing of skb->emergency packets access to the
> > memory reserves.
> >
> > How would I set PF_MEMALLOC on a socket, its a process flag? And which
> > related processes?
>
> You use special flag for sockets to mark them as capable of
> 'reserve-eating', too many flags are a bit confusing.
Right, and I use PF_MEMALLOC to implement that reserve-eating. There
must be a link between SOCK_VMIO and all allocations associated with
that socket.
> I meant that you can just mark process which created such socket as
> PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> all that checks for 'current' in different places.
Ah, thats the wrong level to think here, these processes never reach
user-space - nor should these sockets.
Also, I only want the processing of the actual network packet to be able
to eat the reserves, not any other thing that might happen in that
context.
And since network processing is mostly done in softirq context I must
mark these sections like I did.
> > > > + /*
> > > > + decrease window size..
> > > > + tcp_enter_quickack_mode(sk);
> > > > + */
> > >
> > > How does this decrease window size?
> > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > or just directly send an ack, which in turn requires allocation, which
> > > can be bound to this received frame processing...
> >
> > It doesn't, I thought that it might be a good idea doing that, but never
> > got around to actually figuring out how to do it.
>
> tcp_send_ack()?
>
does that shrink the window automagically?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-16 16:08 ` Peter Zijlstra
@ 2007-01-17 4:54 ` Evgeniy Polyakov
2007-01-17 9:07 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-17 4:54 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Tue, Jan 16, 2007 at 05:08:15PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > > + if (unlikely(skb->emergency))
> > > > > + current->flags |= PF_MEMALLOC;
> > > >
> > > > Access to 'current' in netif_receive_skb()???
> > > > Why do you want to work with, for example keventd?
> > >
> > > Can this run in keventd?
> >
> > Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> > (or dedicated kernel thread, I do not recall).
> >
> > > I thought this was softirq context and thus this would either run in a
> > > borrowed context or in ksoftirqd. See patch 3/9.
> >
> > And how are you going to access 'current' in softirq?
> >
> > netif_receive_skb() can also be called from a lot of other places
> > including keventd and/or different context - it is permitted to call it
> > everywhere to process packet.
> >
> > I meant that you break the rule accessing 'current' in that context.
>
> Yeah, I know, but as long as we're not actually in hard irq context
> current does point to the task_struct in charge of current execution and
> as long as we restore whatever was in the flags field before we started
> poking, nothing can go wrong.
>
> So, yes this is unconventional, but it does work as expected.
>
> As for breaking, 3/9 makes it legal.
You operate with 'current' in different contexts without any locks which
looks racy and even is not allowed. What will be 'current' for
netif_rx() case, which schedules softirq from hard irq context -
ksoftirqd, why do you want to set its flags?
> > I meant that you can just mark process which created such socket as
> > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > all that checks for 'current' in different places.
>
> Ah, thats the wrong level to think here, these processes never reach
> user-space - nor should these sockets.
You limit this just to send an ack?
What about 'level-7' ack as you described in introduction?
> Also, I only want the processing of the actual network packet to be able
> to eat the reserves, not any other thing that might happen in that
> context.
>
> And since network processing is mostly done in softirq context I must
> mark these sections like I did.
You artificially limit system to just add a reserve to generate one ack.
For that purpose you do not need to have all those flags - just reseve
some data in network core and use it when system is in OOM (or reclaim)
for critical data pathes.
> > > > > + /*
> > > > > + decrease window size..
> > > > > + tcp_enter_quickack_mode(sk);
> > > > > + */
> > > >
> > > > How does this decrease window size?
> > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > or just directly send an ack, which in turn requires allocation, which
> > > > can be bound to this received frame processing...
> > >
> > > It doesn't, I thought that it might be a good idea doing that, but never
> > > got around to actually figuring out how to do it.
> >
> > tcp_send_ack()?
> >
>
> does that shrink the window automagically?
Yes, it updates window, but having ack generated in that place is
actually very wrong. In that place system has not processed incoming
packet yet, so it can not generate correct ACK for received frame at
all. And it seems that the only purpose of the whole patchset is to
generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
in system startup and reuse them when you are under memory pressure.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-17 4:54 ` Evgeniy Polyakov
@ 2007-01-17 9:07 ` Peter Zijlstra
2007-01-18 10:41 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-17 9:07 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Wed, 2007-01-17 at 07:54 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 05:08:15PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> > > On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > > > + if (unlikely(skb->emergency))
> > > > > > + current->flags |= PF_MEMALLOC;
> > > > >
> > > > > Access to 'current' in netif_receive_skb()???
> > > > > Why do you want to work with, for example keventd?
> > > >
> > > > Can this run in keventd?
> > >
> > > Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> > > (or dedicated kernel thread, I do not recall).
> > >
> > > > I thought this was softirq context and thus this would either run in a
> > > > borrowed context or in ksoftirqd. See patch 3/9.
> > >
> > > And how are you going to access 'current' in softirq?
> > >
> > > netif_receive_skb() can also be called from a lot of other places
> > > including keventd and/or different context - it is permitted to call it
> > > everywhere to process packet.
> > >
> > > I meant that you break the rule accessing 'current' in that context.
> >
> > Yeah, I know, but as long as we're not actually in hard irq context
> > current does point to the task_struct in charge of current execution and
> > as long as we restore whatever was in the flags field before we started
> > poking, nothing can go wrong.
> >
> > So, yes this is unconventional, but it does work as expected.
> >
> > As for breaking, 3/9 makes it legal.
>
> You operate with 'current' in different contexts without any locks which
> looks racy and even is not allowed. What will be 'current' for
> netif_rx() case, which schedules softirq from hard irq context -
> ksoftirqd, why do you want to set its flags?
I don't touch current in hardirq context, do I (if I did, that is indeed
a mistake)?
In all other contexts, current is valid.
> > > I meant that you can just mark process which created such socket as
> > > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > > all that checks for 'current' in different places.
> >
> > Ah, thats the wrong level to think here, these processes never reach
> > user-space - nor should these sockets.
>
> You limit this just to send an ack?
> What about 'level-7' ack as you described in introduction?
Take NFS, it does full data traffic in kernel.
> > Also, I only want the processing of the actual network packet to be able
> > to eat the reserves, not any other thing that might happen in that
> > context.
> >
> > And since network processing is mostly done in softirq context I must
> > mark these sections like I did.
>
> You artificially limit system to just add a reserve to generate one ack.
> For that purpose you do not need to have all those flags - just reseve
> some data in network core and use it when system is in OOM (or reclaim)
> for critical data pathes.
How would that end up being different, I would have to replace all
allocations done in the full network processing path.
This seems a much less invasive method, all the (allocation) code can
stay the way it is and use the normal allocation functions.
> > > > > > + /*
> > > > > > + decrease window size..
> > > > > > + tcp_enter_quickack_mode(sk);
> > > > > > + */
> > > > >
> > > > > How does this decrease window size?
> > > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > > or just directly send an ack, which in turn requires allocation, which
> > > > > can be bound to this received frame processing...
> > > >
> > > > It doesn't, I thought that it might be a good idea doing that, but never
> > > > got around to actually figuring out how to do it.
> > >
> > > tcp_send_ack()?
> > >
> >
> > does that shrink the window automagically?
>
> Yes, it updates window, but having ack generated in that place is
> actually very wrong. In that place system has not processed incoming
> packet yet, so it can not generate correct ACK for received frame at
> all. And it seems that the only purpose of the whole patchset is to
> generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
> in system startup and reuse them when you are under memory pressure.
Right, I suspected something like that; hence I wanted to just shrink
the window. Anyway, this is not a very important issue.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/9] VM deadlock avoidance -v10
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
` (8 preceding siblings ...)
2007-01-16 9:46 ` [PATCH 9/9] net: vm deadlock avoidance core Peter Zijlstra
@ 2007-01-17 9:12 ` Pavel Machek
2007-01-17 9:20 ` Peter Zijlstra
9 siblings, 1 reply; 32+ messages in thread
From: Pavel Machek @ 2007-01-17 9:12 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
Hi!
> These patches implement the basic infrastructure to allow swap over networked
> storage.
>
> The basic idea is to reserve some memory up front to use when regular memory
> runs out.
>
> To bound network behaviour we accept only a limited number of concurrent
> packets and drop those packets that are not aimed at the connection(s) servicing
> the VM. Also all network paths that interact with userspace are to be avoided -
> e.g. taps and NF_QUEUE.
>
> PF_MEMALLOC is set when processing emergency skbs. This makes sense in that we
> are indeed working on behalf of the swapper/VM. This allows us to use the
> regular memory allocators for processing but requires that said processing have
> bounded memory usage and has that accounted in the reserve.
How does it work with ARP, for example? You still need to reply to ARP
if you want to keep your ethernet connections.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/9] VM deadlock avoidance -v10
2007-01-17 9:12 ` [PATCH 0/9] VM deadlock avoidance -v10 Pavel Machek
@ 2007-01-17 9:20 ` Peter Zijlstra
0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-17 9:20 UTC (permalink / raw)
To: Pavel Machek; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Wed, 2007-01-17 at 10:12 +0100, Pavel Machek wrote:
> Hi!
>
> > These patches implement the basic infrastructure to allow swap over networked
> > storage.
> >
> > The basic idea is to reserve some memory up front to use when regular memory
> > runs out.
> >
> > To bound network behaviour we accept only a limited number of concurrent
> > packets and drop those packets that are not aimed at the connection(s) servicing
> > the VM. Also all network paths that interact with userspace are to be avoided -
> > e.g. taps and NF_QUEUE.
> >
> > PF_MEMALLOC is set when processing emergency skbs. This makes sense in that we
> > are indeed working on behalf of the swapper/VM. This allows us to use the
> > regular memory allocators for processing but requires that said processing have
> > bounded memory usage and has that accounted in the reserve.
>
> How does it work with ARP, for example? You still need to reply to ARP
> if you want to keep your ethernet connections.
ETH_P_ARP is fully processed (under PF_MEMALLOC).
ETH_P_IP{,V6} starts to drop packets not for selected sockets
(SOCK_VMIO) and processes the rest (under PF_MEMALLOC) with limitations;
the packet may never depend on user-space to complete processing.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-17 9:07 ` Peter Zijlstra
@ 2007-01-18 10:41 ` Evgeniy Polyakov
2007-01-18 12:18 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-18 10:41 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Wed, Jan 17, 2007 at 10:07:28AM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > You operate with 'current' in different contexts without any locks which
> > looks racy and even is not allowed. What will be 'current' for
> > netif_rx() case, which schedules softirq from hard irq context -
> > ksoftirqd, why do you want to set its flags?
>
> I don't touch current in hardirq context, do I (if I did, that is indeed
> a mistake)?
>
> In all other contexts, current is valid.
Well, if you think that setting PF_MEMALLOC flag for keventd and
ksoftirqd is valid, then probably yes...
> > > > I meant that you can just mark process which created such socket as
> > > > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > > > all that checks for 'current' in different places.
> > >
> > > Ah, thats the wrong level to think here, these processes never reach
> > > user-space - nor should these sockets.
> >
> > You limit this just to send an ack?
> > What about 'level-7' ack as you described in introduction?
>
> Take NFS, it does full data traffic in kernel.
NFS case is exactly the situation, when you only need to generate an ACK.
> > > Also, I only want the processing of the actual network packet to be able
> > > to eat the reserves, not any other thing that might happen in that
> > > context.
> > >
> > > And since network processing is mostly done in softirq context I must
> > > mark these sections like I did.
> >
> > You artificially limit system to just add a reserve to generate one ack.
> > For that purpose you do not need to have all those flags - just reseve
> > some data in network core and use it when system is in OOM (or reclaim)
> > for critical data pathes.
>
> How would that end up being different, I would have to replace all
> allocations done in the full network processing path.
>
> This seems a much less invasive method, all the (allocation) code can
> stay the way it is and use the normal allocation functions.
Ack is only generated in one place in TCP.
And acutally we are starting to talk about different approach - having
separated allocator for network, which will be turned on on OOM (reclaim
or at any other time). If you do not mind, I would likw to refresh a
discussion about network tree allocator, which utilizes own pool of
pages, performs self-defragmentation of the memeory, is very SMP
friendly in that regard that it is per-cpu like slab and never free
objects on different CPUs, so they always stay in the same cache.
Among other goodies it allows to have full sending/receiving zero-copy.
Here is a link:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=nta
> > > > > > > + /*
> > > > > > > + decrease window size..
> > > > > > > + tcp_enter_quickack_mode(sk);
> > > > > > > + */
> > > > > >
> > > > > > How does this decrease window size?
> > > > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > > > or just directly send an ack, which in turn requires allocation, which
> > > > > > can be bound to this received frame processing...
> > > > >
> > > > > It doesn't, I thought that it might be a good idea doing that, but never
> > > > > got around to actually figuring out how to do it.
> > > >
> > > > tcp_send_ack()?
> > > >
> > >
> > > does that shrink the window automagically?
> >
> > Yes, it updates window, but having ack generated in that place is
> > actually very wrong. In that place system has not processed incoming
> > packet yet, so it can not generate correct ACK for received frame at
> > all. And it seems that the only purpose of the whole patchset is to
> > generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
> > in system startup and reuse them when you are under memory pressure.
>
> Right, I suspected something like that; hence I wanted to just shrink
> the window. Anyway, this is not a very important issue.
tcp_enter_quickack_mode() does not update window, it allows to send ack
immediately after packet has been processed, window can be changed in
any way TCP state machine and congestion control want.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 9/9] net: vm deadlock avoidance core
2007-01-18 10:41 ` Evgeniy Polyakov
@ 2007-01-18 12:18 ` Peter Zijlstra
2007-01-18 13:58 ` Possible ways of dealing with OOM conditions Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-18 12:18 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, 2007-01-18 at 13:41 +0300, Evgeniy Polyakov wrote:
> > > What about 'level-7' ack as you described in introduction?
> >
> > Take NFS, it does full data traffic in kernel.
>
> NFS case is exactly the situation, when you only need to generate an ACK.
No it is not, it needs the full RPC response.
> > > You artificially limit system to just add a reserve to generate one ack.
> > > For that purpose you do not need to have all those flags - just reseve
> > > some data in network core and use it when system is in OOM (or reclaim)
> > > for critical data pathes.
> >
> > How would that end up being different, I would have to replace all
> > allocations done in the full network processing path.
> >
> > This seems a much less invasive method, all the (allocation) code can
> > stay the way it is and use the normal allocation functions.
> And acutally we are starting to talk about different approach - having
> separated allocator for network, which will be turned on on OOM (reclaim
> or at any other time).
I think we might be, I'm more talking about requirements on the
allocator, while you seem to talk about implementations.
Replacing the allocator, or splitting it in two based on a condition are
all fine as long as they observe the requirements.
The requirement I add is that there is a reserve nobody touches unless
given express permission.
You could implement this by modifying each reachable allocator call site
and stick a branch in and use an alternate allocator when the normal
route fails and we do have permission; much like:
foo = kmalloc(size, gfp_mask);
+ if (!foo && special)
+ foo = my_alloc(size)
And earlier versions of this work did something like that. But it
litters the code quite badly and its quite easy to miss spots. There can
be quite a few allocations in processing network data.
Hence my work on integrating this into the regular memory allocators.
FYI; 'special' evaluates to something like:
!(gfp_mask & __GFP_NOMEMALLOC) &&
((gfp_mask & __GFP_EMERGENCY) ||
(!in_irq() && (current->flags & PF_MEMALLOC)))
> If you do not mind, I would likw to refresh a
> discussion about network tree allocator,
> which utilizes own pool of
> pages,
very high order pages, no?
This means that you have to either allocate at boot time and cannot
resize/add pools; which means you waste all that memory if the network
load never comes near using the reserved amount.
Or, you get into all the same trouble the hugepages folks are trying so
very hard to solve.
> performs self-defragmentation of the memeory,
Does it move memory about?
All it does is try to avoid fragmentation by policy - a problem
impossible to solve in general; but can achieve good results in view of
practical limitations on program behaviour.
Does your policy work for the given workload? we'll see.
Also, on what level, each level has both internal and external
fragmentation. I can argue that having large immovable objects in memory
adds to the fragmentation issues on the page-allocator level.
> is very SMP
> friendly in that regard that it is per-cpu like slab and never free
> objects on different CPUs, so they always stay in the same cache.
This makes it very hard to guarantee a reserve limit. (Not impossible,
just more difficult)
> Among other goodies it allows to have full sending/receiving zero-copy.
That won't ever work unless you have page aligned objects, otherwise you
cannot map them into user-space. Which seems to be at odds with your
tight packing/reduce internal fragmentation goals.
Zero-copy entails mapping the page the hardware writes the packet in
into user-space, right?
Since its impossible to predict to whoem the next packet is addressed
the packets must be written (by hardware) to different pages.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Possible ways of dealing with OOM conditions.
2007-01-18 12:18 ` Peter Zijlstra
@ 2007-01-18 13:58 ` Evgeniy Polyakov
2007-01-18 15:10 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-18 13:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, Jan 18, 2007 at 01:18:44PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > How would that end up being different, I would have to replace all
> > > allocations done in the full network processing path.
> > >
> > > This seems a much less invasive method, all the (allocation) code can
> > > stay the way it is and use the normal allocation functions.
>
> > And acutally we are starting to talk about different approach - having
> > separated allocator for network, which will be turned on on OOM (reclaim
> > or at any other time).
>
> I think we might be, I'm more talking about requirements on the
> allocator, while you seem to talk about implementations.
>
> Replacing the allocator, or splitting it in two based on a condition are
> all fine as long as they observe the requirements.
>
> The requirement I add is that there is a reserve nobody touches unless
> given express permission.
>
> You could implement this by modifying each reachable allocator call site
> and stick a branch in and use an alternate allocator when the normal
> route fails and we do have permission; much like:
>
> foo = kmalloc(size, gfp_mask);
> + if (!foo && special)
> + foo = my_alloc(size)
Network is special in this regard, since it only has one allocation path
(actually it has one cache for skb, and usual kmalloc, but they are
called from only two functions).
So it would become
ptr = network_alloc();
and network_alloc() would be usual kmalloc or call for own allocator in
case of deadlock.
> And earlier versions of this work did something like that. But it
> litters the code quite badly and its quite easy to miss spots. There can
> be quite a few allocations in processing network data.
>
> Hence my work on integrating this into the regular memory allocators.
>
> FYI; 'special' evaluates to something like:
> !(gfp_mask & __GFP_NOMEMALLOC) &&
> ((gfp_mask & __GFP_EMERGENCY) ||
> (!in_irq() && (current->flags & PF_MEMALLOC)))
>
>
> > If you do not mind, I would likw to refresh a
> > discussion about network tree allocator,
>
> > which utilizes own pool of
> > pages,
>
> very high order pages, no?
>
> This means that you have to either allocate at boot time and cannot
> resize/add pools; which means you waste all that memory if the network
> load never comes near using the reserved amount.
>
> Or, you get into all the same trouble the hugepages folks are trying so
> very hard to solve.
It is configurable - by default it takes pool of 32k pages for allocations for
jumbo-frames (e1000 requires such allocations for 9k frames
unfortunately), without jumbo-frame support it works with pool of 0-order
pages, which grows dynamically when needed.
> > performs self-defragmentation of the memeory,
>
> Does it move memory about?
It works in a page, not as pages - when neighbour regions are freed,
they are combined into single one with bigger size - it would be
extended to move pages around to combied them into bigger one though
too, but network stack requires high-order allocations in extremely rare
cases of broken design (Intel folks, sorry, but your hardware sucks in
that regard - jumbo frame of 9k should not require 16k of mem plu
network overhead).
NTA also does not align buffers to the power of two - extremely significant
win of that approach can be found on project's homepage with graps of
failed allocations and state of the mem for different sizes of
allocaions. Power-of-two overhead of SLAB is extremely high.
> All it does is try to avoid fragmentation by policy - a problem
> impossible to solve in general; but can achieve good results in view of
> practical limitations on program behaviour.
>
> Does your policy work for the given workload? we'll see.
>
> Also, on what level, each level has both internal and external
> fragmentation. I can argue that having large immovable objects in memory
> adds to the fragmentation issues on the page-allocator level.
NTA works with pages, not with contiguous memory, it reduces
fragmentation inside pages, which can not be solved in SLAB, where
objects from the same page can live in different caches and thus _never_
can be combined. Thus, the only soultuin for SLAB is copy, which is not a
good one for big sizes and is just wrong for big pages.
It is not about page moving and VM tricks, which are generally described
as fragmentation avoidance technique, but about how fragmentation
problem is solved in one page.
> > is very SMP
> > friendly in that regard that it is per-cpu like slab and never free
> > objects on different CPUs, so they always stay in the same cache.
>
> This makes it very hard to guarantee a reserve limit. (Not impossible,
> just more difficult)
The whole pool of pages becomes reserve, since no one (and mainly VFS)
can consume that reserve.
> > Among other goodies it allows to have full sending/receiving zero-copy.
>
> That won't ever work unless you have page aligned objects, otherwise you
> cannot map them into user-space. Which seems to be at odds with your
> tight packing/reduce internal fragmentation goals.
>
> Zero-copy entails mapping the page the hardware writes the packet in
> into user-space, right?
>
> Since its impossible to predict to whoem the next packet is addressed
> the packets must be written (by hardware) to different pages.
Yes, receiving zero-copy without appropriate hardware assist is
impossible, so either absence of such facility at all, or special overhead,
which forces object to lie in different pages. With hardware assist it
would be possible to select a flow in advance, so data would be packet
in the same page.
Sending zero-copy from userspace memory does not suffer with any such
problem.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 13:58 ` Possible ways of dealing with OOM conditions Evgeniy Polyakov
@ 2007-01-18 15:10 ` Peter Zijlstra
2007-01-18 15:50 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-18 15:10 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:
> Network is special in this regard, since it only has one allocation path
> (actually it has one cache for skb, and usual kmalloc, but they are
> called from only two functions).
>
> So it would become
> ptr = network_alloc();
> and network_alloc() would be usual kmalloc or call for own allocator in
> case of deadlock.
There is more to networking that skbs only, what about route cache,
there is quite a lot of allocs in this fib_* stuff, IGMP etc...
> > very high order pages, no?
> >
> > This means that you have to either allocate at boot time and cannot
> > resize/add pools; which means you waste all that memory if the network
> > load never comes near using the reserved amount.
> >
> > Or, you get into all the same trouble the hugepages folks are trying so
> > very hard to solve.
>
> It is configurable - by default it takes pool of 32k pages for allocations for
> jumbo-frames (e1000 requires such allocations for 9k frames
> unfortunately), without jumbo-frame support it works with pool of 0-order
> pages, which grows dynamically when needed.
With 0-order pages, you can only fit 2 1500 byte packets in there, you
could perhaps stick some small skb heads in there as well, but why
bother, the waste isn't _that_ high.
Esp if you would make a slab for 1500 mtu packets (5*1638 < 2*4096; and
1638 should be enough, right?)
It would make sense to pack related objects into a page so you could
free all together.
> > > performs self-defragmentation of the memeory,
> >
> > Does it move memory about?
>
> It works in a page, not as pages - when neighbour regions are freed,
> they are combined into single one with bigger size
Yeah, that is not defragmentation, defragmentation is moving active
regions about to create contiguous free space. What you do is free space
coalescence.
> but network stack requires high-order allocations in extremely rare
> cases of broken design (Intel folks, sorry, but your hardware sucks in
> that regard - jumbo frame of 9k should not require 16k of mem plu
> network overhead).
Well, if you have such hardware its not rare at all, But yeah that
sucks.
> NTA also does not align buffers to the power of two - extremely significant
> win of that approach can be found on project's homepage with graps of
> failed allocations and state of the mem for different sizes of
> allocaions. Power-of-two overhead of SLAB is extremely high.
Sure you can pack the page a little better(*), but I thought the main
advantage was a speed increase.
(*) memory is generally cheaper than engineering efforts, esp on this
scale. The only advantage in the manual packing is that (with the fancy
hardware stream engine mentioned below) you could ensure they are
grouped together (then again, the hardware stream engine would, together
with a SG-DMA engine, take care of that).
>
> > All it does is try to avoid fragmentation by policy - a problem
> > impossible to solve in general; but can achieve good results in view of
> > practical limitations on program behaviour.
> >
> > Does your policy work for the given workload? we'll see.
> >
> > Also, on what level, each level has both internal and external
> > fragmentation. I can argue that having large immovable objects in memory
> > adds to the fragmentation issues on the page-allocator level.
>
> NTA works with pages, not with contiguous memory, it reduces
> fragmentation inside pages, which can not be solved in SLAB, where
> objects from the same page can live in different caches and thus _never_
> can be combined. Thus, the only soultuin for SLAB is copy, which is not a
> good one for big sizes and is just wrong for big pages.
By allocating, and never returning the page to the page-allocator you've
increased the fragmentation on the page-allocator level significantly.
It will avoid a super page ever forming around that page.
> It is not about page moving and VM tricks, which are generally described
> as fragmentation avoidance technique, but about how fragmentation
> problem is solved in one page.
Short of defragmentation (move active regions about) fragmentation is an
unsolved problem. For any heuristic there is a pattern that will defeat
it.
Luckily program allocation behaviour is usually very regular (or
decomposable in well behaved groups).
> > > is very SMP
> > > friendly in that regard that it is per-cpu like slab and never free
> > > objects on different CPUs, so they always stay in the same cache.
> >
> > This makes it very hard to guarantee a reserve limit. (Not impossible,
> > just more difficult)
>
> The whole pool of pages becomes reserve, since no one (and mainly VFS)
> can consume that reserve.
Ah, but there you violate my requirement, any network allocation can
claim the last bit of memory. The whole idea was that the reserve is
explicitly managed.
It not only needs protection from other users but also from itself.
> > > Among other goodies it allows to have full sending/receiving zero-copy.
> >
> > That won't ever work unless you have page aligned objects, otherwise you
> > cannot map them into user-space. Which seems to be at odds with your
> > tight packing/reduce internal fragmentation goals.
> >
> > Zero-copy entails mapping the page the hardware writes the packet in
> > into user-space, right?
> >
> > Since its impossible to predict to whoem the next packet is addressed
> > the packets must be written (by hardware) to different pages.
>
> Yes, receiving zero-copy without appropriate hardware assist is
> impossible, so either absence of such facility at all, or special overhead,
> which forces object to lie in different pages. With hardware assist it
> would be possible to select a flow in advance, so data would be packet
> in the same page.
I was not aware that hardware could order the packets in such a fashion.
Yes, if it can do that it becomes doable.
> Sending zero-copy from userspace memory does not suffer with any such
> problem.
True, that is properly ordered. But for that I'm not sure how NTA (you
really should change that name, there is no Tree anymore) helps here.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 15:10 ` Peter Zijlstra
@ 2007-01-18 15:50 ` Evgeniy Polyakov
2007-01-18 17:31 ` Peter Zijlstra
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-18 15:50 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, Jan 18, 2007 at 04:10:52PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:
>
> > Network is special in this regard, since it only has one allocation path
> > (actually it has one cache for skb, and usual kmalloc, but they are
> > called from only two functions).
> >
> > So it would become
> > ptr = network_alloc();
> > and network_alloc() would be usual kmalloc or call for own allocator in
> > case of deadlock.
>
> There is more to networking that skbs only, what about route cache,
> there is quite a lot of allocs in this fib_* stuff, IGMP etc...
skbs are the most extensively used path.
Actually the same is applied to route - dst_entries and rtable are
allocated through own wrappers.
> > > very high order pages, no?
> > >
> > > This means that you have to either allocate at boot time and cannot
> > > resize/add pools; which means you waste all that memory if the network
> > > load never comes near using the reserved amount.
> > >
> > > Or, you get into all the same trouble the hugepages folks are trying so
> > > very hard to solve.
> >
> > It is configurable - by default it takes pool of 32k pages for allocations for
> > jumbo-frames (e1000 requires such allocations for 9k frames
> > unfortunately), without jumbo-frame support it works with pool of 0-order
> > pages, which grows dynamically when needed.
>
> With 0-order pages, you can only fit 2 1500 byte packets in there, you
> could perhaps stick some small skb heads in there as well, but why
> bother, the waste isn't _that_ high.
>
> Esp if you would make a slab for 1500 mtu packets (5*1638 < 2*4096; and
> 1638 should be enough, right?)
>
> It would make sense to pack related objects into a page so you could
> free all together.
With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
packet (roughly), it is actaly one ACK packet - and I hear it from
person who develops a system, which is aimed to guarantee ACK
allocation in OOM :)
SLAB overhead is _very_ expensive for network - what if jumbo frame is
used? It becomes incredible in that case, although modern NICs allows
scatter-gather, which is aimed to fix the problem.
Cache misses for small packet flow due to the fact, that the same data
is allocated and freed and accessed on different CPUs will become an
issue soon, not right now, since two-four core CPUs are not yet to be
very popular and price for the cache miss is not _that_ high.
> > > > performs self-defragmentation of the memeory,
> > >
> > > Does it move memory about?
> >
> > It works in a page, not as pages - when neighbour regions are freed,
> > they are combined into single one with bigger size
>
> Yeah, that is not defragmentation, defragmentation is moving active
> regions about to create contiguous free space. What you do is free space
> coalescence.
That is wrong definition just because no one developed different system.
Defragmentation is a result of broken system.
Existing design _does_not_ allow to have the situation when whole page
belongs to the same cache after it was actively used, the same is
applied to the situation when several pages, which create contiguous
region, are used by different users, so people start develop VM tricks
to move pages around so they would be placed near in address space.
Do not fix the result, fix the reason.
> > but network stack requires high-order allocations in extremely rare
> > cases of broken design (Intel folks, sorry, but your hardware sucks in
> > that regard - jumbo frame of 9k should not require 16k of mem plu
> > network overhead).
>
> Well, if you have such hardware its not rare at all, But yeah that
> sucks.
They do a good jop developing different approaches to workaround that
hardware 'feature', but this is still wrong situation.
> > NTA also does not align buffers to the power of two - extremely significant
> > win of that approach can be found on project's homepage with graps of
> > failed allocations and state of the mem for different sizes of
> > allocaions. Power-of-two overhead of SLAB is extremely high.
>
> Sure you can pack the page a little better(*), but I thought the main
> advantage was a speed increase.
>
> (*) memory is generally cheaper than engineering efforts, esp on this
> scale. The only advantage in the manual packing is that (with the fancy
> hardware stream engine mentioned below) you could ensure they are
> grouped together (then again, the hardware stream engine would, together
> with a SG-DMA engine, take care of that).
Extensoin way of doing things.
That is wrong.
> > > All it does is try to avoid fragmentation by policy - a problem
> > > impossible to solve in general; but can achieve good results in view of
> > > practical limitations on program behaviour.
> > >
> > > Does your policy work for the given workload? we'll see.
> > >
> > > Also, on what level, each level has both internal and external
> > > fragmentation. I can argue that having large immovable objects in memory
> > > adds to the fragmentation issues on the page-allocator level.
> >
> > NTA works with pages, not with contiguous memory, it reduces
> > fragmentation inside pages, which can not be solved in SLAB, where
> > objects from the same page can live in different caches and thus _never_
> > can be combined. Thus, the only soultuin for SLAB is copy, which is not a
> > good one for big sizes and is just wrong for big pages.
>
> By allocating, and never returning the page to the page-allocator you've
> increased the fragmentation on the page-allocator level significantly.
> It will avoid a super page ever forming around that page.
Not at all - SLAB fragmentation is so high, that stealing pages from its
highly fragmented pool does not result in any lose or win for SPAB
users. And it is possible to allocate at boot time.
NTA cache grows in _very_ rare cases, and it can be preallocated at
startup.
> > It is not about page moving and VM tricks, which are generally described
> > as fragmentation avoidance technique, but about how fragmentation
> > problem is solved in one page.
>
> Short of defragmentation (move active regions about) fragmentation is an
> unsolved problem. For any heuristic there is a pattern that will defeat
> it.
>
> Luckily program allocation behaviour is usually very regular (or
> decomposable in well behaved groups).
We are talking about different approaces here.
Per-page defragmentation by playing games with memory management is one
approach. Run-time defragmentation by groupping neighbour regions is
another one.
Main issue is the fact, that with second one, requirement for the first
one becomes MUCH smaller, since when application, no matter how strange
its allocation pattern is, frees object, it will be groupped with
neighbours. In SLAB that will almost never happen, so situation with
memory tricks.
> > > > is very SMP
> > > > friendly in that regard that it is per-cpu like slab and never free
> > > > objects on different CPUs, so they always stay in the same cache.
> > >
> > > This makes it very hard to guarantee a reserve limit. (Not impossible,
> > > just more difficult)
> >
> > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > can consume that reserve.
>
> Ah, but there you violate my requirement, any network allocation can
> claim the last bit of memory. The whole idea was that the reserve is
> explicitly managed.
>
> It not only needs protection from other users but also from itself.
Specifying some users as good and others as bad generally tends to very
bad behaviour. Your appwoach only covers some users, mine does not
differentiate between users, but prevents system from such situation at all.
> > > > Among other goodies it allows to have full sending/receiving zero-copy.
> > >
> > > That won't ever work unless you have page aligned objects, otherwise you
> > > cannot map them into user-space. Which seems to be at odds with your
> > > tight packing/reduce internal fragmentation goals.
> > >
> > > Zero-copy entails mapping the page the hardware writes the packet in
> > > into user-space, right?
> > >
> > > Since its impossible to predict to whoem the next packet is addressed
> > > the packets must be written (by hardware) to different pages.
> >
> > Yes, receiving zero-copy without appropriate hardware assist is
> > impossible, so either absence of such facility at all, or special overhead,
> > which forces object to lie in different pages. With hardware assist it
> > would be possible to select a flow in advance, so data would be packet
> > in the same page.
>
> I was not aware that hardware could order the packets in such a fashion.
> Yes, if it can do that it becomes doable.
Not hardware, but allocator, which can provide data with special
requirement like alignment and offset for given flow id.
Hardware just provides needed info and does DMA transfer into specified area.
You can find more on receiving zero-copy with _emulation_ of such
hardware (MMIO copy of the header) for example here:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=recv_zero_copy
where system perfomed data receiving of 1500 MTU sized frames directly
into VFS cache.
> > Sending zero-copy from userspace memory does not suffer with any such
> > problem.
>
> True, that is properly ordered. But for that I'm not sure how NTA (you
> really should change that name, there is no Tree anymore) helps here.
Because user has access to the memory which will be used directly by
hardware, it should not care about preallocation, although there are
problems with notification about completeness of operation, which can be
postponed in case of fancy egress filters used.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 15:50 ` Evgeniy Polyakov
@ 2007-01-18 17:31 ` Peter Zijlstra
2007-01-18 18:34 ` Evgeniy Polyakov
2007-01-19 17:54 ` Christoph Lameter
0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-18 17:31 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, 2007-01-18 at 18:50 +0300, Evgeniy Polyakov wrote:
> On Thu, Jan 18, 2007 at 04:10:52PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:
> >
> > > Network is special in this regard, since it only has one allocation path
> > > (actually it has one cache for skb, and usual kmalloc, but they are
> > > called from only two functions).
> > >
> > > So it would become
> > > ptr = network_alloc();
> > > and network_alloc() would be usual kmalloc or call for own allocator in
> > > case of deadlock.
> >
> > There is more to networking that skbs only, what about route cache,
> > there is quite a lot of allocs in this fib_* stuff, IGMP etc...
>
> skbs are the most extensively used path.
> Actually the same is applied to route - dst_entries and rtable are
> allocated through own wrappers.
Still, edit all places and perhaps forget one and make sure all new code
doesn't forget about it, or pick a solution that covers everything.
> With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
> packet (roughly), it is actaly one ACK packet - and I hear it from
> person who develops a system, which is aimed to guarantee ACK
> allocation in OOM :)
I need full data traffic during OOM, not just a single ACK.
> SLAB overhead is _very_ expensive for network - what if jumbo frame is
> used? It becomes incredible in that case, although modern NICs allows
> scatter-gather, which is aimed to fix the problem.
Jumbo frames are fine if the hardware can do SG-DMA..
> Cache misses for small packet flow due to the fact, that the same data
> is allocated and freed and accessed on different CPUs will become an
> issue soon, not right now, since two-four core CPUs are not yet to be
> very popular and price for the cache miss is not _that_ high.
SGI does networking too, right?
> > > > > performs self-defragmentation of the memeory,
> > > >
> > > > Does it move memory about?
> > >
> > > It works in a page, not as pages - when neighbour regions are freed,
> > > they are combined into single one with bigger size
> >
> > Yeah, that is not defragmentation, defragmentation is moving active
> > regions about to create contiguous free space. What you do is free space
> > coalescence.
>
> That is wrong definition just because no one developed different system.
> Defragmentation is a result of broken system.
>
> Existing design _does_not_ allow to have the situation when whole page
> belongs to the same cache after it was actively used, the same is
> applied to the situation when several pages, which create contiguous
> region, are used by different users, so people start develop VM tricks
> to move pages around so they would be placed near in address space.
>
> Do not fix the result, fix the reason.
*plonk* 30+yrs of research ignored.
> > > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > > can consume that reserve.
> >
> > Ah, but there you violate my requirement, any network allocation can
> > claim the last bit of memory. The whole idea was that the reserve is
> > explicitly managed.
> >
> > It not only needs protection from other users but also from itself.
>
> Specifying some users as good and others as bad generally tends to very
> bad behaviour. Your appwoach only covers some users, mine does not
> differentiate between users,
The kernel is special, right? It has priority over whatever user-land
does.
> but prevents system from such situation at all.
I'm not seeing that, with your approach nobody stops the kernel from
filling up the memory with user-space network traffic.
swapping is not some random user process, its a fundamental kernel task,
if this fails the machine is history.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 17:31 ` Peter Zijlstra
@ 2007-01-18 18:34 ` Evgeniy Polyakov
2007-01-19 12:53 ` Peter Zijlstra
2007-01-19 17:54 ` Christoph Lameter
1 sibling, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-18 18:34 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Thu, Jan 18, 2007 at 06:31:53PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > skbs are the most extensively used path.
> > Actually the same is applied to route - dst_entries and rtable are
> > allocated through own wrappers.
>
> Still, edit all places and perhaps forget one and make sure all new code
> doesn't forget about it, or pick a solution that covers everything.
There is _one_ place for allocation of any kind of object.
skb path has two places.
> > With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
> > packet (roughly), it is actaly one ACK packet - and I hear it from
> > person who develops a system, which is aimed to guarantee ACK
> > allocation in OOM :)
>
> I need full data traffic during OOM, not just a single ACK.
But your code exactly limit codepath to several allocaions, which must
be ACK. You do not have enough reserve to support whole traffic.
So the right solution, IMO, is to _prevent_ such situation, which means
that allocation is not allowed to depend on external conditions like
VFS.
Actually my above sentences were about the case, when anly having
different allocator, it is possible to dramatically change memory usage
model, which supffers greatly from power-of-two allocations. OOM
condition is one of the results which has big SLAB overhead among other
roots. Actually all pathes which work with kmem_cache are safe against
it, since kernel cache packs objects, but thos who uses raw kmalloc has
problems.
> > SLAB overhead is _very_ expensive for network - what if jumbo frame is
> > used? It becomes incredible in that case, although modern NICs allows
> > scatter-gather, which is aimed to fix the problem.
>
> Jumbo frames are fine if the hardware can do SG-DMA..
Notice word _IF_ in you sentence. e1000 for example can not (or it can,
but driver is not developed for such scenario).
> > Cache misses for small packet flow due to the fact, that the same data
> > is allocated and freed and accessed on different CPUs will become an
> > issue soon, not right now, since two-four core CPUs are not yet to be
> > very popular and price for the cache miss is not _that_ high.
>
> SGI does networking too, right?
Yep, Cristoph Lameter developed own allocator too.
I agreee with you, that if that price is too high already, then it is a
dditional sign to look into network tree allocator (yep, name is bad)
again.
> > That is wrong definition just because no one developed different system.
> > Defragmentation is a result of broken system.
> >
> > Existing design _does_not_ allow to have the situation when whole page
> > belongs to the same cache after it was actively used, the same is
> > applied to the situation when several pages, which create contiguous
> > region, are used by different users, so people start develop VM tricks
> > to move pages around so they would be placed near in address space.
> >
> > Do not fix the result, fix the reason.
>
> *plonk* 30+yrs of research ignored.
30 years to develop SLAB allocator? In what universe that is all about?
> > > > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > > > can consume that reserve.
> > >
> > > Ah, but there you violate my requirement, any network allocation can
> > > claim the last bit of memory. The whole idea was that the reserve is
> > > explicitly managed.
> > >
> > > It not only needs protection from other users but also from itself.
> >
> > Specifying some users as good and others as bad generally tends to very
> > bad behaviour. Your appwoach only covers some users, mine does not
> > differentiate between users,
>
> The kernel is special, right? It has priority over whatever user-land
> does.
Kernel only does ACK generation and allocation for userspace.
Kernel does not know that some of users are potentially good or bad, and
if you will export this socket option to the userspace, everyone will
think that his application is good enough to use reserve.
So, for kernel-only side you just need to preallocate pool of packets
and use them when system is in OOM (reclaim). For the long direction,
new approach of memory allocaiton should be developed, and there are
different works in that direction - NTA is one of them and not the only
one, for the best resutlts it must be combined with vm-tricks
defragmentation too.
> > but prevents system from such situation at all.
>
> I'm not seeing that, with your approach nobody stops the kernel from
> filling up the memory with user-space network traffic.
>
> swapping is not some random user process, its a fundamental kernel task,
> if this fails the machine is history.
You completely misses the point. The main goal is to
1. reduce fragmentation and/or enable self defragmentation (which is
done in NTA), this also reduces memory usage.
2. perform correct recover steps in OOM - reduce memory usage, use
different allocator and/or reserve (which is the case, where NTA can be
used)
3. do not allow OOM condition - unfortunately it is not always possible,
but having separated allocation allows to not depend on external
conditions such as VFS memory usage, thus this approach reduces
condition when memory deadlock related to network path can happen.
Let me briefly describe your approach and possible drawbacks in it.
You start reserving some memory when systems is under memory pressure.
when system is in real trouble, you start using that reserve for special
tasks mainly for network path to allocate packets and process them in
order to get committed some memory swapping.
So, the problems I see here, are following:
1. it is possible that when you are starting to create a reserve, there
will not be enough memeory at all. So the solution is to reserve in
advance.
2. You differentiate by hand between critical and non-critical
allocations by specifying some kernel users as potentially possible to
allocate from reserve. This does not prevent from NVIDIA module to
allocate from that reserve too, does it? And you artificially limit
system to process only tiny bits of what it must do, thus potentially
leaking pathes which must use reserve too.
So, solution is to have a reserve in advance, and manage it using
special path when system is in OOM. So you will have network memory
reserve, which will be used when system is in trouble. It is very
similar to what you had.
But the whole reserve can never be used at all, so it should be used,
but not by those who can create OOM condition, thus it should be
exported to, for example, network only, and when system is in trouble,
network would be still functional (although only critical pathes).
Even further development of such idea is to prevent such OOM condition
at all - by starting swapping early (but wisely) and reduce memory
usage.
Network tree allocator does exactly above cases.
Here advertisement is over.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 18:34 ` Evgeniy Polyakov
@ 2007-01-19 12:53 ` Peter Zijlstra
2007-01-19 22:56 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-01-19 12:53 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, linux-mm, David Miller
> Let me briefly describe your approach and possible drawbacks in it.
> You start reserving some memory when systems is under memory pressure.
> when system is in real trouble, you start using that reserve for special
> tasks mainly for network path to allocate packets and process them in
> order to get committed some memory swapping.
>
> So, the problems I see here, are following:
> 1. it is possible that when you are starting to create a reserve, there
> will not be enough memeory at all. So the solution is to reserve in
> advance.
Swap is usually enabled at startup, but sure, if you want you can mess
this up.
> 2. You differentiate by hand between critical and non-critical
> allocations by specifying some kernel users as potentially possible to
> allocate from reserve.
True, all sockets that are needed for swap, no-one else.
> This does not prevent from NVIDIA module to
> allocate from that reserve too, does it?
All users of the NVidiot crap deserve all the pain they get.
If it breaks they get to keep both pieces.
> And you artificially limit
> system to process only tiny bits of what it must do, thus potentially
> leaking pathes which must use reserve too.
How so? I cover pretty much every allocation needed to process an skb by
setting PF_MEMALLOC - the only drawback there is that the reserve might
not actually be large enough because it covers more allocations that
were considered. (thats one of the TODO items, validate the reserve
functions parameters)
> So, solution is to have a reserve in advance, and manage it using
> special path when system is in OOM. So you will have network memory
> reserve, which will be used when system is in trouble. It is very
> similar to what you had.
>
> But the whole reserve can never be used at all, so it should be used,
> but not by those who can create OOM condition, thus it should be
> exported to, for example, network only, and when system is in trouble,
> network would be still functional (although only critical pathes).
But the network can create OOM conditions for itself just fine.
Consider the remote storage disappearing for a while (it got rebooted,
someone tripped over the wire etc..). Now the rest of the network
traffic keeps coming and will queue up - because user-space is stalled,
waiting for more memory - and we run out of memory.
There must be a point where we start dropping packets that are not
critical to the survival of the machine.
> Even further development of such idea is to prevent such OOM condition
> at all - by starting swapping early (but wisely) and reduce memory
> usage.
These just postpone execution but will not avoid it.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-18 17:31 ` Peter Zijlstra
2007-01-18 18:34 ` Evgeniy Polyakov
@ 2007-01-19 17:54 ` Christoph Lameter
1 sibling, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2007-01-19 17:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Evgeniy Polyakov, linux-kernel, netdev, linux-mm, David Miller
On Thu, 18 Jan 2007, Peter Zijlstra wrote:
>
> > Cache misses for small packet flow due to the fact, that the same data
> > is allocated and freed and accessed on different CPUs will become an
> > issue soon, not right now, since two-four core CPUs are not yet to be
> > very popular and price for the cache miss is not _that_ high.
>
> SGI does networking too, right?
Sslab deals with those issues the right way. We have per processor
queues that attempt to keep the cache hot state. A special shared queue
exists between neighboring processors to facilitate exchange of objects
between then.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-19 12:53 ` Peter Zijlstra
@ 2007-01-19 22:56 ` Evgeniy Polyakov
2007-01-20 22:36 ` Rik van Riel
0 siblings, 1 reply; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-19 22:56 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, netdev, linux-mm, David Miller
On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > 2. You differentiate by hand between critical and non-critical
> > allocations by specifying some kernel users as potentially possible to
> > allocate from reserve.
>
> True, all sockets that are needed for swap, no-one else.
>
> > This does not prevent from NVIDIA module to
> > allocate from that reserve too, does it?
>
> All users of the NVidiot crap deserve all the pain they get.
> If it breaks they get to keep both pieces.
I meant that pretty anyone can be those user, who can just add a bit
into own gfp_flags which are used for allocation.
> > And you artificially limit
> > system to process only tiny bits of what it must do, thus potentially
> > leaking pathes which must use reserve too.
>
> How so? I cover pretty much every allocation needed to process an skb by
> setting PF_MEMALLOC - the only drawback there is that the reserve might
> not actually be large enough because it covers more allocations that
> were considered. (thats one of the TODO items, validate the reserve
> functions parameters)
You only covered ipv4/v6 and arp, maybe some route updates.
But it is very possible, that some allocations are missed like
multicast/broadcast. Selecting only special pathes out of the whole
possible network alocations tends to create a situation, when something
is missed or cross dependant on other pathes.
> > So, solution is to have a reserve in advance, and manage it using
> > special path when system is in OOM. So you will have network memory
> > reserve, which will be used when system is in trouble. It is very
> > similar to what you had.
> >
> > But the whole reserve can never be used at all, so it should be used,
> > but not by those who can create OOM condition, thus it should be
> > exported to, for example, network only, and when system is in trouble,
> > network would be still functional (although only critical pathes).
>
> But the network can create OOM conditions for itself just fine.
>
> Consider the remote storage disappearing for a while (it got rebooted,
> someone tripped over the wire etc..). Now the rest of the network
> traffic keeps coming and will queue up - because user-space is stalled,
> waiting for more memory - and we run out of memory.
Hmm... Neither UDP, nor TCP work that way actually.
> There must be a point where we start dropping packets that are not
> critical to the survival of the machine.
You still can drop them, the main point is that network allocations do
not depend on other allocations.
> > Even further development of such idea is to prevent such OOM condition
> > at all - by starting swapping early (but wisely) and reduce memory
> > usage.
>
> These just postpone execution but will not avoid it.
No. If system allows to have such a condition, then
something is broken. It must be prevented, instead of creating special
hacks to recover from it.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-19 22:56 ` Evgeniy Polyakov
@ 2007-01-20 22:36 ` Rik van Riel
2007-01-21 1:46 ` Evgeniy Polyakov
0 siblings, 1 reply; 32+ messages in thread
From: Rik van Riel @ 2007-01-20 22:36 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Peter Zijlstra, linux-kernel, netdev, linux-mm, David Miller
Evgeniy Polyakov wrote:
> On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
>>> Even further development of such idea is to prevent such OOM condition
>>> at all - by starting swapping early (but wisely) and reduce memory
>>> usage.
>> These just postpone execution but will not avoid it.
>
> No. If system allows to have such a condition, then
> something is broken. It must be prevented, instead of creating special
> hacks to recover from it.
Evgeniy, you may want to learn something about the VM before
stating that reality should not occur.
Due to the way everything in the kernel works, you cannot
prevent the memory allocator from allocating everything and
running out, except maybe by setting aside reserves to deal
with special subsystems.
As for your "swapping early and reduce memory usage", that is
just not possible in a system where a memory writeout may need
one or more memory allocations to succeed and other I/O paths
(eg. file writes) can take memory from the same pools.
With something like iscsi it may be _necessary_ for file writes
and swap to take memory from the same pools, because they can
share the same block device.
Please get out of your fantasy world and accept the constraints
the VM has to operate under. Maybe then you and Peter can agree
on something.
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-20 22:36 ` Rik van Riel
@ 2007-01-21 1:46 ` Evgeniy Polyakov
2007-01-21 2:14 ` Evgeniy Polyakov
2007-01-21 16:30 ` Rik van Riel
0 siblings, 2 replies; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-21 1:46 UTC (permalink / raw)
To: Rik van Riel; +Cc: Peter Zijlstra, linux-kernel, netdev, linux-mm, David Miller
On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel (riel@surriel.com) wrote:
> Evgeniy Polyakov wrote:
> >On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra
> >(a.p.zijlstra@chello.nl) wrote:
>
> >>>Even further development of such idea is to prevent such OOM condition
> >>>at all - by starting swapping early (but wisely) and reduce memory
> >>>usage.
> >>These just postpone execution but will not avoid it.
> >
> >No. If system allows to have such a condition, then
> >something is broken. It must be prevented, instead of creating special
> >hacks to recover from it.
>
> Evgeniy, you may want to learn something about the VM before
> stating that reality should not occur.
I.e. I should start believing that OOM can not be prevented, bugs can
not be fixed and things can not be changed just because it happens right
now? That is why I'm not subscribed to lkml :)
> Due to the way everything in the kernel works, you cannot
> prevent the memory allocator from allocating everything and
> running out, except maybe by setting aside reserves to deal
> with special subsystems.
>
> As for your "swapping early and reduce memory usage", that is
> just not possible in a system where a memory writeout may need
> one or more memory allocations to succeed and other I/O paths
> (eg. file writes) can take memory from the same pools.
When system starts swapping only when it can not allocate new page,
then it is broken system. I bet you get warm closing way before you
hands are frostbitten, and you do not have a liter of alcohol in the
packet for such emergency. And to get warm closing you still need to
go over cold street into the shop, but you will do it before weather
becomes arctic.
> With something like iscsi it may be _necessary_ for file writes
> and swap to take memory from the same pools, because they can
> share the same block device.
Of course swapping can require additional allocation, when it happens
over network it is quite obvious.
The main problem is the fact, that if system was put into the state,
when its life depends on the last possible allocation, then it is
broken.
There is a light connected to car's fuel tank which starts blinking,
when amount of fuel is less then predefined level. Car just does not
stop suddenly and starts to get fuel from reserve (well eventually it
stops, but it says about problem long before it dies).
> Please get out of your fantasy world and accept the constraints
> the VM has to operate under. Maybe then you and Peter can agree
> on something.
I can not accept the situation, when problem is not fixed, but instead
recovery path is added. There must be both ways of dealing with it -
emergency force majeur recovery and preventive steps.
What we are talking about (except pointing to obvious things and sending
to school-classes), at least how I see this, is ways of dealing with
possible OOM condition. If OOM has happend, then there must be recovery
path, but OOM must be prevented, and ways to do this were described too.
> --
> Politics is the struggle between those who want to make their country
> the best in the world, and those who believe it already is. Each group
> calls the other unpatriotic.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-21 1:46 ` Evgeniy Polyakov
@ 2007-01-21 2:14 ` Evgeniy Polyakov
2007-01-21 16:30 ` Rik van Riel
1 sibling, 0 replies; 32+ messages in thread
From: Evgeniy Polyakov @ 2007-01-21 2:14 UTC (permalink / raw)
To: Rik van Riel; +Cc: Peter Zijlstra, linux-kernel, netdev, linux-mm, David Miller
> On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel (riel@surriel.com) wrote:
> > Due to the way everything in the kernel works, you cannot
> > prevent the memory allocator from allocating everything and
> > running out, except maybe by setting aside reserves to deal
> > with special subsystems.
As a technical side gets described, this is exactly the way I proposed -
there is special dedicated pool which does not depend on main system
allocator, so if the latter is empty, the former still _can_ work,
although it is possible that it will be empty too.
Separation.
It removes avalanche effect when one problem produces several different.
I do not say that some allocator is the best for dealing with such
situation, I just pointed that critical pathes were separated in NTA, so
they do not depend on each one's failure.
Actually that separation was introduced way too long ago with memory
pools, this is some kind of continuation, which adds a lot of additional
extremely useful features.
NTA used for network allocations is that pool, since in real life
packets can not be allocated in advance without memory overhead. For
simple situations like only ACK generatinos it is possible, which I
suggested first, but long-term solution is special allocator.
I selected NTA for this task because it has _additional_ features like
self-deragmentation, which is very useful part for networking, but if
only OOM recovery condition is concerned, then actually any other
allocator can be used of course.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Possible ways of dealing with OOM conditions.
2007-01-21 1:46 ` Evgeniy Polyakov
2007-01-21 2:14 ` Evgeniy Polyakov
@ 2007-01-21 16:30 ` Rik van Riel
1 sibling, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2007-01-21 16:30 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Peter Zijlstra, linux-kernel, netdev, linux-mm, David Miller
Evgeniy Polyakov wrote:
> On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel (riel@surriel.com) wrote:
>> Evgeniy Polyakov wrote:
>>> On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra
>>> (a.p.zijlstra@chello.nl) wrote:
>>>>> Even further development of such idea is to prevent such OOM condition
>>>>> at all - by starting swapping early (but wisely) and reduce memory
>>>>> usage.
>>>> These just postpone execution but will not avoid it.
>>> No. If system allows to have such a condition, then
>>> something is broken. It must be prevented, instead of creating special
>>> hacks to recover from it.
>> Evgeniy, you may want to learn something about the VM before
>> stating that reality should not occur.
>
> I.e. I should start believing that OOM can not be prevented, bugs can
> not be fixed and things can not be changed just because it happens right
> now? That is why I'm not subscribed to lkml :)
The reasons for this are often not inside the VM itself,
but are due to the constraints imposed on the VM.
For example, with many of the journaled filesystems there
is no way to know in advance how much IO needs to be done
to complete a writeout of one dirty page (and consequently,
how much memory needs to be allocated to complete this one
writeout).
Parts of the VM could be changed to reduce the pressure
somewhat, eg. limiting the number of IOs in flight, but
that will probably have performance consequences that may
not be acceptable to Andrew and Linus and never get merged.
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2007-01-21 16:31 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-16 9:45 [PATCH 0/9] VM deadlock avoidance -v10 Peter Zijlstra
2007-01-16 9:45 ` [PATCH 1/9] mm: page allocation rank Peter Zijlstra
2007-01-16 9:45 ` [PATCH 2/9] mm: slab allocation fairness Peter Zijlstra
2007-01-16 9:46 ` [PATCH 3/9] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-01-16 9:46 ` [PATCH 4/9] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-01-16 9:46 ` [PATCH 5/9] mm: emergency pool Peter Zijlstra
2007-01-16 9:46 ` [PATCH 6/9] mm: __GFP_EMERGENCY Peter Zijlstra
2007-01-16 9:46 ` [PATCH 7/9] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-01-16 9:46 ` [PATCH 8/9] slab: kmem_cache_objs_to_pages() Peter Zijlstra
2007-01-16 9:46 ` [PATCH 9/9] net: vm deadlock avoidance core Peter Zijlstra
2007-01-16 13:25 ` Evgeniy Polyakov
2007-01-16 13:47 ` Peter Zijlstra
2007-01-16 15:33 ` Evgeniy Polyakov
2007-01-16 16:08 ` Peter Zijlstra
2007-01-17 4:54 ` Evgeniy Polyakov
2007-01-17 9:07 ` Peter Zijlstra
2007-01-18 10:41 ` Evgeniy Polyakov
2007-01-18 12:18 ` Peter Zijlstra
2007-01-18 13:58 ` Possible ways of dealing with OOM conditions Evgeniy Polyakov
2007-01-18 15:10 ` Peter Zijlstra
2007-01-18 15:50 ` Evgeniy Polyakov
2007-01-18 17:31 ` Peter Zijlstra
2007-01-18 18:34 ` Evgeniy Polyakov
2007-01-19 12:53 ` Peter Zijlstra
2007-01-19 22:56 ` Evgeniy Polyakov
2007-01-20 22:36 ` Rik van Riel
2007-01-21 1:46 ` Evgeniy Polyakov
2007-01-21 2:14 ` Evgeniy Polyakov
2007-01-21 16:30 ` Rik van Riel
2007-01-19 17:54 ` Christoph Lameter
2007-01-17 9:12 ` [PATCH 0/9] VM deadlock avoidance -v10 Pavel Machek
2007-01-17 9:20 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).