LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/7] cpu alloc stage 2
@ 2008-11-05 23:16 Christoph Lameter
  2008-11-05 23:16 ` [patch 1/7] Increase default reserve percpu area Christoph Lameter
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

The second stage of the cpu_alloc patchset can be pulled from

git.kernel.org/pub/scm/linux/kernel/git/christoph/work.git cpu_alloc_stage2

Stage 2 includes the conversion of the page allocator
and slub allocator to the use of the cpu allocator.

It also includes the core of the atomic vs. interrupt cpu ops and uses those
for the vm statistics.

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/7] Increase default reserve percpu area
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-05 23:16 ` [patch 2/7] cpu alloc: Use in slub Christoph Lameter
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_increase_percpu_default --]
[-- Type: text/plain, Size: 1418 bytes --]

SLUB now requires a portion of the per cpu reserve. There are on average
about 70 real slabs on a system (aliases do not count) and each needs 12 bytes
of per cpu space. Thats 840 bytes. In debug mode all slabs will be real slabs
which will make us end up with 150 -> 1800.

Things work fine without this patch but then slub will reduce the percpu reserve
for modules.

Percpu data must be available regardless if modules are in use or not. So get
rid of the #ifdef CONFIG_MODULES.

Make the size of the percpu area dependant on the size of a machine word. That
way we have larger sizes for 64 bit machines. 64 bit machines need more percpu
memory since the pointer and counters may have double the size. Plus there is
lots of memory available on 64 bit.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-11-05 12:05:46.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2008-11-05 14:29:15.000000000 -0600
@@ -44,7 +44,7 @@
 extern unsigned int percpu_reserve;
 /* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
 #ifndef PERCPU_AREA_SIZE
-#define PERCPU_RESERVE_SIZE	8192
+#define PERCPU_RESERVE_SIZE   (sizeof(unsigned long) * 2500)
 
 #define PERCPU_AREA_SIZE						\
 	(__per_cpu_end - __per_cpu_start + percpu_reserve)

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 2/7] cpu alloc: Use in slub
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
  2008-11-05 23:16 ` [patch 1/7] Increase default reserve percpu area Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-05 23:16 ` [patch 3/7] cpu alloc: Remove slub fields Christoph Lameter
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_slub_conversion --]
[-- Type: text/plain, Size: 10371 bytes --]

Using cpu alloc removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
The use of cpu_alloc means that:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-10-08 11:09:12.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2008-10-23 15:30:26.000000000 -0500
@@ -68,6 +68,7 @@
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -102,11 +103,6 @@
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-10-23 15:21:50.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-10-23 15:30:26.000000000 -0500
@@ -227,15 +227,6 @@
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1088,7 +1079,7 @@
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(THIS_CPU(s->cpu_slab), ORDER_FALLBACK);
 	}
 	page->objects = oo_objects(oo);
 	mod_zone_page_state(page_zone(page),
@@ -1365,7 +1356,7 @@
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = THIS_CPU(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1397,7 +1388,7 @@
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__THIS_CPU(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1450,7 +1441,7 @@
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1553,7 +1544,7 @@
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __THIS_CPU(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1589,24 +1580,22 @@
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	c = __THIS_CPU(s->cpu_slab);
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
 	return object;
 }
@@ -1640,7 +1629,7 @@
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __THIS_CPU(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1711,7 +1700,7 @@
 	unsigned long flags;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __THIS_CPU(s->cpu_slab);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
@@ -1938,130 +1927,19 @@
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu,
-				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
 static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
-		return;
+	s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu, flags);
 
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpu_set(cpu, kmem_cach_cpu_free_init_once);
-}
-
-static void __init init_alloc_cpu(void)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
-
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2428,9 +2306,8 @@
 	int node;
 
 	flush_all(s);
-
+	CPU_FREE(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2947,8 +2824,6 @@
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3016,11 +2891,12 @@
 	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
-
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3116,7 +2992,7 @@
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3164,11 +3040,9 @@
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3178,13 +3052,9 @@
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3675,7 +3545,7 @@
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4080,7 +3950,7 @@
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = CPU_PTR(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 3/7] cpu alloc: Remove slub fields
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
  2008-11-05 23:16 ` [patch 1/7] Increase default reserve percpu area Christoph Lameter
  2008-11-05 23:16 ` [patch 2/7] cpu alloc: Use in slub Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-05 23:16 ` [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_remove_slub_fields --]
[-- Type: text/plain, Size: 6515 bytes --]

Remove the fields in kmem_cache_cpu that were used to cache data from
kmem_cache when they were in different cachelines. The cacheline that holds
the per cpu array pointer now also holds these values. We can cut down the
struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the field does not require accessing an additional cacheline anymore. This
results in consistent use of setting the freepointer for objects throughout
SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.
And all fields are set to zero. So just use __GFP_ZERO on cpu alloc.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 --
 mm/slub.c                |   39 +++++++++++----------------------------
 2 files changed, 11 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-10-23 15:30:26.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2008-10-23 15:30:30.000000000 -0500
@@ -36,8 +36,6 @@
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-10-23 15:30:26.000000000 -0500
+++ linux-2.6/mm/slub.c	2008-10-23 15:30:30.000000000 -0500
@@ -245,13 +245,6 @@
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1416,10 +1409,10 @@
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1515,7 +1508,7 @@
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1559,7 +1552,7 @@
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1589,7 +1582,7 @@
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1623,7 +1616,7 @@
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-				void *x, void *addr, unsigned int offset)
+				void *x, void *addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1637,7 +1630,8 @@
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1701,15 +1695,15 @@
 
 	local_irq_save(flags);
 	c = __THIS_CPU(s->cpu_slab);
-	debug_check_no_locks_freed(object, c->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -1890,19 +1884,6 @@
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -1927,20 +1908,6 @@
 #endif
 }
 
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu, flags);
-
-	if (!s->cpu_slab)
-		return 0;
-
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
-	return 1;
-}
-
 #ifdef CONFIG_NUMA
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
@@ -2197,8 +2164,11 @@
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu,
+				(flags & ~SLUB_DMA) | __GFP_ZERO);
+	if (s->cpu_slab)
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -2978,8 +2948,6 @@
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
@@ -2987,13 +2955,6 @@
 		 */
 		s->objsize = max(s->objsize, (int)size);
 
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3038,14 +2999,6 @@
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-11-05 23:16 ` [patch 3/7] cpu alloc: Remove slub fields Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-06  3:58   ` Dave Chinner
  2008-11-05 23:16 ` [patch 5/7] x86_64: Support for cpu ops Christoph Lameter
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_ops_base --]
[-- Type: text/plain, Size: 7826 bytes --]

Currently the per cpu subsystem is not able to use the atomic capabilities
that are provided by many of the available processors.

This patch adds new functionality that allows the optimizing of per cpu
variable handling. In particular it provides a simple way to exploit
atomic operations in order to avoid having to disable interrupts or
performing address calculation to access per cpu data.

F.e. Using our current methods we may do

	unsigned long flags;
	struct stat_struct *p;

	local_irq_save(flags);
	/* Calculate address of per processor area */
	p = CPU_PTR(stat, smp_processor_id());
	p->counter++;
	local_irq_restore(flags);

The segment can be replaced by a single atomic CPU operation:

	CPU_INC(stat->counter);

Most processors have instructions to perform the increment using a
a single atomic instruction. Processors may have segment registers,
global registers or per cpu mappings of per cpu areas that can be used
to generate atomic instructions that combine the following in a single
operation:

1. Adding of an offset / register to a base address
2. Read modify write operation on the address calculated by
   the instruction.

If 1+2 are combined in an instruction then the instruction is atomic
vs interrupts. This means that percpu atomic operations do not need
to disable interrupts to increments counters etc.

The existing methods in use in the kernel cannot utilize the power of
these atomic instructions. local_t is not really addressing the issue
since the offset calculation performed before the atomic operation. The
operation is therefor not atomic. Disabling interrupt or preemption is
required in order to use local_t.

local_t is also very specific to the x86 processor. The solution here can
utilize other methods than just those provided by the x86 instruction set.



On x86 the above CPU_INC translated into a single instruction:

	inc %%gs:(&stat->counter)

This instruction is interrupt safe since it can either be completed
or not. Both adding of the offset and the read modify write are combined
in one instruction.

The determination of the correct per cpu area for the current processor
does not require access to smp_processor_id() (expensive...). The gs
register is used to provide a processor specific offset to the respective
per cpu area where the per cpu variable resides.

Note that the counter offset into the struct was added *before* the segment
selector was added. This is necessary to avoid calculations.  In the past
we first determine the address of the stats structure on the respective
processor and then added the field offset. However, the offset may as
well be added earlier. The adding of the per cpu offset (here through the
gs register) must be done by the instruction used for atomic per cpu
access.



If "stat" was declared via DECLARE_PER_CPU then this patchset is capable of
convincing the linker to provide the proper base address. In that case
no calculations are necessary.

Should the stat structure be reachable via a register then the address
calculation capabilities can be leveraged to avoid calculations.

On IA64 we can get the same combination of operations in a single instruction
by using the virtual address that always maps to the local per cpu area:

	fetchadd &stat->counter + (VCPU_BASE - __per_cpu_start)

The access is forced into the per cpu address reachable via the virtualized
address. IA64 allows the embedding of an offset into the instruction. So the
fetchadd can perform both the relocation of the pointer into the per cpu
area as well as the atomic read modify write cycle.



In order to be able to exploit the atomicity of these instructions we
introduce a series of new functions that take either:

1. A per cpu pointer as returned by cpu_alloc() or CPU_ALLOC().

2. A per cpu variable address as returned by per_cpu_var(<percpuvarname>).

CPU_READ()
CPU_WRITE()
CPU_INC
CPU_DEC
CPU_ADD
CPU_SUB
CPU_XCHG
CPU_CMPXCHG

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/percpu.h |  135 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-11-03 13:27:57.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2008-11-03 13:28:00.000000000 -0600
@@ -162,4 +162,139 @@
 #define CPU_FREE(pointer)	cpu_free((pointer), sizeof(*(pointer)))
 
 
+/*
+ * Fast atomic per cpu operations.
+ *
+ * The following operations can be overridden by arches to implement fast
+ * and efficient operations. The operations are atomic meaning that the
+ * determination of the processor, the calculation of the address and the
+ * operation on the data is an atomic operation.
+ *
+ * The parameter passed to the atomic per cpu operations is an lvalue not a
+ * pointer to the object.
+ */
+#ifndef CONFIG_HAVE_CPU_OPS
+
+/*
+ * Fallback in case the arch does not provide for atomic per cpu operations.
+ *
+ * The first group of macros is used when it is safe to update the per
+ * cpu variable because preemption is off (per cpu variables that are not
+ * updated from interrupt context) or because interrupts are already off.
+ */
+#define __CPU_READ(var)				\
+({						\
+	(*THIS_CPU(&(var)));			\
+})
+
+#define __CPU_WRITE(var, value)			\
+({						\
+	*THIS_CPU(&(var)) = (value);		\
+})
+
+#define __CPU_ADD(var, value)			\
+({						\
+	*THIS_CPU(&(var)) += (value);		\
+})
+
+#define __CPU_INC(var) __CPU_ADD((var), 1)
+#define __CPU_DEC(var) __CPU_ADD((var), -1)
+#define __CPU_SUB(var, value) __CPU_ADD((var), -(value))
+
+#define __CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = THIS_CPU(&(obj));	\
+	x = *p;					\
+	if (x == (old))				\
+		*p = (new);			\
+	(x);					\
+})
+
+#define __CPU_XCHG(obj, new)			\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = THIS_CPU(&(obj));	\
+	x = *p;					\
+	*p = (new);				\
+	(x);					\
+})
+
+/*
+ * Second group used for per cpu variables that are not updated from an
+ * interrupt context. In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without support for preemption.
+ */
+#define _CPU_READ __CPU_READ
+#define _CPU_WRITE __CPU_WRITE
+
+#define _CPU_ADD(var, value)			\
+({						\
+	preempt_disable();			\
+	__CPU_ADD((var), (value));		\
+	preempt_enable();			\
+})
+
+#define _CPU_INC(var) _CPU_ADD((var), 1)
+#define _CPU_DEC(var) _CPU_ADD((var), -1)
+#define _CPU_SUB(var, value) _CPU_ADD((var), -(value))
+
+#define _CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(addr) x;				\
+	preempt_disable();			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	preempt_enable();			\
+	(x);					\
+})
+
+#define _CPU_XCHG(var, new)			\
+({						\
+	typeof(var) x;				\
+	preempt_disable();			\
+	x = __CPU_XCHG((var), (new));		\
+	preempt_enable();			\
+	(x);					\
+})
+
+/*
+ * Third group: Interrupt safe CPU functions
+ */
+#define CPU_READ __CPU_READ
+#define CPU_WRITE __CPU_WRITE
+
+#define CPU_ADD(var, value)			\
+({						\
+	unsigned long flags;			\
+	local_irq_save(flags);			\
+	__CPU_ADD((var), (value));		\
+	local_irq_restore(flags);		\
+})
+
+#define CPU_INC(var) CPU_ADD((var), 1)
+#define CPU_DEC(var) CPU_ADD((var), -1)
+#define CPU_SUB(var, value) CPU_ADD((var), -(value))
+
+#define CPU_CMPXCHG(var, old, new)		\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
+#define CPU_XCHG(var, new)			\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_XCHG((var), (new));		\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
+#endif /* CONFIG_HAVE_CPU_OPS */
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 5/7] x86_64: Support for cpu ops
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-11-05 23:16 ` [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-06  7:12   ` Ingo Molnar
  2008-11-05 23:16 ` [patch 6/7] VM statistics: Use CPU ops Christoph Lameter
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_ops_x86 --]
[-- Type: text/plain, Size: 4120 bytes --]

Support fast cpu ops in x86_64 by providing a series of functions that
generate the proper instructions.

Define CONFIG_HAVE_CPU_OPS so that core code
can exploit the availability of fast per cpu operations.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/Kconfig         |    9 +++++++++
 include/asm-x86/percpu.h |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig	2008-10-23 15:21:50.000000000 -0500
+++ linux-2.6/arch/x86/Kconfig	2008-10-23 15:32:18.000000000 -0500
@@ -164,6 +164,15 @@
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+#
+# X86_64's spare segment register points to the PDA instead of the per
+# cpu area. Therefore x86_64 is not able to generate atomic vs. interrupt
+# per cpu instructions.
+#
+config HAVE_CPU_OPS
+	def_bool y
+	depends on X86_32
+
 config X86_SMP
 	bool
 	depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2008-10-23 15:21:50.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/percpu.h	2008-10-23 15:33:55.000000000 -0500
@@ -162,6 +162,53 @@
 	ret__;						\
 })
 
+#define percpu_addr_op(op, var)				\
+({							\
+	switch (sizeof(var)) {				\
+	case 1:						\
+		asm(op "b "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	case 2:						\
+		asm(op "w "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	case 4:						\
+		asm(op "l "__percpu_seg"%0"		\
+				: : "m"(var));		\
+	break;					\
+		default: __bad_percpu_size();			\
+	}						\
+})
+
+#define percpu_cmpxchg_op(var, old, new)				\
+({									\
+	typeof(var) prev;						\
+	switch (sizeof(var)) {						\
+	case 1:								\
+		asm("cmpxchgb %b1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "q"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	case 2:								\
+		asm("cmpxchgw %w1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "r"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	case 4:								\
+		asm("cmpxchgl %k1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "r"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	default:							\
+		__bad_percpu_size();					\
+	}								\
+	return prev;							\
+})
+
 #define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
 #define x86_write_percpu(var, val) percpu_to_op("mov", per_cpu__##var, val)
 #define x86_add_percpu(var, val) percpu_to_op("add", per_cpu__##var, val)
@@ -215,4 +262,44 @@
 
 #endif	/* !CONFIG_SMP */
 
+/*
+ * x86_64 uses available segment register for pda instead of per cpu access.
+ * Therefore we cannot generate these atomic vs. interrupt instructions
+ * on x86_64.
+ */
+#ifdef CONFIG_X86_32
+
+#define CPU_READ(obj)		percpu_from_op("mov", obj)
+#define CPU_WRITE(obj,val)	percpu_to_op("mov", obj, val)
+#define CPU_ADD(obj,val)	percpu_to_op("add", obj, val)
+#define CPU_SUB(obj,val)	percpu_to_op("sub", obj, val)
+#define CPU_INC(obj)		percpu_addr_op("inc", obj)
+#define CPU_DEC(obj)		percpu_addr_op("dec", obj)
+#define CPU_XCHG(obj,val)	percpu_to_op("xchg", var, val)
+#define CPU_CMPXCHG(obj, old, new) percpu_cmpxchg_op(var, old, new)
+
+/*
+ * All cpu operations are interrupt safe and do not need to disable
+ * preempt. So the other variants all reduce to the same instruction.
+ */
+#define _CPU_READ CPU_READ
+#define _CPU_WRITE CPU_WRITE
+#define _CPU_ADD CPU_ADD
+#define _CPU_SUB CPU_SUB
+#define _CPU_INC CPU_INC
+#define _CPU_DEC CPU_DEC
+#define _CPU_XCHG CPU_XCHG
+#define _CPU_CMPXCHG CPU_CMPXCHG
+
+#define __CPU_READ CPU_READ
+#define __CPU_WRITE CPU_WRITE
+#define __CPU_ADD CPU_ADD
+#define __CPU_SUB CPU_SUB
+#define __CPU_INC CPU_INC
+#define __CPU_DEC CPU_DEC
+#define __CPU_XCHG CPU_XCHG
+#define __CPU_CMPXCHG CPU_CMPXCHG
+
+#endif /* CONFIG_X86_32 */
+
 #endif /* _ASM_X86_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 6/7] VM statistics: Use CPU ops
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-11-05 23:16 ` [patch 5/7] x86_64: Support for cpu ops Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-05 23:16 ` [patch 7/7] cpu alloc: page allocator conversion Christoph Lameter
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, linux-kernel, linux-mm, travis, Stephen Rothwell,
	Vegard Nossum

[-- Attachment #1: cpu_alloc_ops_vmstat --]
[-- Type: text/plain, Size: 1662 bytes --]

The use of CPU ops here avoids the offset calculations that we used to have
to do with per cpu operations. The result of this patch is that event counters
are coded with a single instruction the following way:

	incq   %gs:offset(%rip)

Without these patches this was:

	mov    %gs:0x8,%rdx
	mov    %eax,0x38(%rsp)
	mov    xxx(%rip),%eax
	mov    %eax,0x48(%rsp)
	mov    varoffset,%rax
	incq   0x110(%rax,%rdx,1)

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/vmstat.h |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2008-10-23 15:21:52.000000000 -0500
+++ linux-2.6/include/linux/vmstat.h	2008-10-23 15:34:02.000000000 -0500
@@ -75,24 +75,22 @@
 
 static inline void __count_vm_event(enum vm_event_item item)
 {
-	__get_cpu_var(vm_event_states).event[item]++;
+	__CPU_INC(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void count_vm_event(enum vm_event_item item)
 {
-	get_cpu_var(vm_event_states).event[item]++;
-	put_cpu();
+	_CPU_INC(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void __count_vm_events(enum vm_event_item item, long delta)
 {
-	__get_cpu_var(vm_event_states).event[item] += delta;
+	__CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 static inline void count_vm_events(enum vm_event_item item, long delta)
 {
-	get_cpu_var(vm_event_states).event[item] += delta;
-	put_cpu();
+	_CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 extern void all_vm_events(unsigned long *);

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 7/7] cpu alloc: page allocator conversion
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-11-05 23:16 ` [patch 6/7] VM statistics: Use CPU ops Christoph Lameter
@ 2008-11-05 23:16 ` Christoph Lameter
  2008-11-06  2:52   ` KOSAKI Motohiro
  2008-11-11 23:56 ` [patch 0/7] cpu alloc stage 2 Andrew Morton
  2008-11-12  6:57 ` Stephen Rothwell
  8 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-05 23:16 UTC (permalink / raw)
  To: akpm
  Cc: Pekka Enberg, Christoph Lameter, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

[-- Attachment #1: cpu_alloc_page_allocator_conversion --]
[-- Type: text/plain, Size: 12340 bytes --]

Use the new cpu_alloc functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with a large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Surprisingly this clears up much of the painful NUMA bringup. Bootstrap
becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs are
reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  162 +++++++++++++++++++------------------------------
 mm/vmstat.c            |   15 ++--
 4 files changed, 74 insertions(+), 119 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-11-04 14:39:18.000000000 -0600
+++ linux-2.6/include/linux/mm.h	2008-11-04 14:39:20.000000000 -0600
@@ -1042,11 +1042,7 @@
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 /* prio_tree.c */
 void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2008-11-04 14:39:18.000000000 -0600
+++ linux-2.6/include/linux/mmzone.h	2008-11-04 14:39:20.000000000 -0600
@@ -182,13 +182,7 @@
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -283,10 +277,8 @@
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2008-11-04 14:39:18.000000000 -0600
+++ linux-2.6/mm/page_alloc.c	2008-11-04 15:32:17.000000000 -0600
@@ -903,7 +903,7 @@
 		if (!populated_zone(zone))
 			continue;
 
-		pset = zone_pcp(zone, cpu);
+		pset = CPU_PTR(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
@@ -986,7 +986,7 @@
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
+	pcp = &THIS_CPU(zone->pageset)->pcp;
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
 	if (cold)
@@ -1000,7 +1000,6 @@
 		pcp->count -= pcp->batch;
 	}
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1042,15 +1041,13 @@
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
+		pcp = &THIS_CPU(zone->pageset)->pcp;
 		local_irq_save(flags);
 		if (!pcp->count) {
 			pcp->count = rmqueue_bulk(zone, 0,
@@ -1090,7 +1087,6 @@
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1099,7 +1095,6 @@
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -1854,7 +1849,7 @@
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = CPU_PTR(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2714,82 +2709,33 @@
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
 /*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
+ * Configure pageset array in struct zone.
  */
-static int __cpuinit process_zones(int cpu)
+static void __cpuinit process_zones(int cpu)
 {
-	struct zone *zone, *dzone;
+	struct zone *zone;
 	int node = cpu_to_node(cpu);
 
 	node_set_state(node, N_CPU);	/* this node has a cpu */
 
 	for_each_zone(zone) {
+		struct per_cpu_pageset *pcp =
+				CPU_PTR(zone->pageset, cpu);
 
 		if (!populated_zone(zone))
 			continue;
 
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+		setup_pageset(pcp, zone_batchsize(zone));
 
 		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
+			setup_pagelist_highmark(pcp, zone->present_pages /
+						percpu_pagelist_fraction);
 
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = NULL;
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = NULL;
 	}
 }
 
+#ifdef CONFIG_SMP
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
@@ -2800,14 +2746,7 @@
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		process_zones(cpu);
 		break;
 	default:
 		break;
@@ -2817,21 +2756,15 @@
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
+#endif
 
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
-
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
-	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	process_zones(smp_processor_id());
+#ifdef CONFIG_SMP
 	register_cpu_notifier(&pageset_notifier);
-}
-
 #endif
+}
 
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
@@ -2876,23 +2809,35 @@
 	return 0;
 }
 
-static __meminit void zone_pcp_init(struct zone *zone)
+static inline void alloc_pageset(struct zone *zone)
 {
-	int cpu;
 	unsigned long batch = zone_batchsize(zone);
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
+	zone->pageset = CPU_ALLOC(struct per_cpu_pageset, GFP_KERNEL);
+	setup_pageset(THIS_CPU(zone->pageset), batch);
+}
+/*
+ * Allocate and initialize pcp structures
+ */
+static __meminit void zone_pcp_init(struct zone *zone)
+{
+	if (slab_is_available())
+		alloc_pageset(zone);
 	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+			zone->name, zone->present_pages,
+			zone_batchsize(zone));
+}
+
+/*
+ * Allocate pcp structures that we were unable to allocate during early boot.
+ */
+void __init allocate_pagesets(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone)
+		alloc_pageset(zone);
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -3438,6 +3383,8 @@
 		unsigned long size, realsize, memmap_pages;
 		enum lru_list l;
 
+		printk("+++ Free area init core for zone %p\n", zone);
+
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
 								zholes_size);
@@ -4438,11 +4385,13 @@
 	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
 	if (!write || (ret == -EINVAL))
 		return ret;
-	for_each_zone(zone) {
-		for_each_online_cpu(cpu) {
+	for_each_online_cpu(cpu) {
+		for_each_zone(zone) {
 			unsigned long  high;
+
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(CPU_PTR(zone->pageset, cpu),
+									high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2008-11-04 14:39:18.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2008-11-04 14:39:20.000000000 -0600
@@ -143,7 +143,8 @@
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			CPU_PTR(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -153,7 +154,8 @@
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -206,7 +208,7 @@
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -227,7 +229,7 @@
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -307,7 +309,7 @@
 		if (!populated_zone(zone))
 			continue;
 
-		p = zone_pcp(zone, cpu);
+		p = CPU_PTR(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -759,7 +761,7 @@
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = CPU_PTR(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"
Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- linux-2.6.orig/mm/cpu_alloc.c	2008-11-04 15:26:36.000000000 -0600
+++ linux-2.6/mm/cpu_alloc.c	2008-11-04 15:26:45.000000000 -0600
@@ -189,6 +189,8 @@
 
 void __init cpu_alloc_init(void)
 {
+	extern void allocate_pagesets(void);
+
 #ifdef CONFIG_SMP
 	base_percpu_in_units = (__per_cpu_end - __per_cpu_start
 					+ UNIT_SIZE - 1) / UNIT_SIZE;
@@ -199,5 +201,7 @@
 #ifndef CONFIG_SMP
 	cpu_alloc_start = alloc_bootmem(nr_units * UNIT_SIZE);
 #endif
+	/* Allocate pagesets whose allocation was deferred */
+	allocate_pagesets();
 }
 

-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-05 23:16 ` [patch 7/7] cpu alloc: page allocator conversion Christoph Lameter
@ 2008-11-06  2:52   ` KOSAKI Motohiro
  2008-11-06 15:04     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2008-11-06  2:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, akpm, Pekka Enberg, Christoph Lameter,
	linux-kernel, linux-mm, travis, Stephen Rothwell, Vegard Nossum

> +#ifdef CONFIG_SMP
>  static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
>  		unsigned long action,
>  		void *hcpu)
> @@ -2800,14 +2746,7 @@
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		if (process_zones(cpu))
> -			ret = NOTIFY_BAD;
> -		break;
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> -	case CPU_DEAD:
> -	case CPU_DEAD_FROZEN:
> -		free_zone_pagesets(cpu);
> +		process_zones(cpu);
>  		break;

Why do you drop cpu unplug code?




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations
  2008-11-05 23:16 ` [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
@ 2008-11-06  3:58   ` Dave Chinner
  2008-11-06 15:05     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2008-11-06  3:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

On Wed, Nov 05, 2008 at 05:16:38PM -0600, Christoph Lameter wrote:
> +
> +#define __CPU_CMPXCHG(var, old, new)		\
> +({						\
> +	typeof(obj) x;				\
> +	typeof(obj) *p = THIS_CPU(&(obj));	\
> +	x = *p;					\
> +	if (x == (old))				\
> +		*p = (new);			\
> +	(x);					\
> +})

I don't think that will compile - s/obj/var/ perhaps?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-05 23:16 ` [patch 5/7] x86_64: Support for cpu ops Christoph Lameter
@ 2008-11-06  7:12   ` Ingo Molnar
  2008-11-06 15:08     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2008-11-06  7:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum


* Christoph Lameter <cl@linux-foundation.org> wrote:

> +#
> +# X86_64's spare segment register points to the PDA instead of the per
> +# cpu area. Therefore x86_64 is not able to generate atomic vs. interrupt
> +# per cpu instructions.
> +#
> +config HAVE_CPU_OPS
> +	def_bool y
> +	depends on X86_32
> +

hm, what happened to the rebase-PDA-to-percpu-area optimization 
patches you guys were working on? I remember there was some binutils 
flakiness - weird crashes and things like that. Did you ever manage to 
stabilize it? It would be sad if only 32-bit could take advantage of 
the optimized ops.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-06  2:52   ` KOSAKI Motohiro
@ 2008-11-06 15:04     ` Christoph Lameter
  2008-11-07  0:37       ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-06 15:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, Pekka Enberg, Christoph Lameter, linux-kernel, linux-mm,
	travis, Stephen Rothwell, Vegard Nossum

On Thu, 6 Nov 2008, KOSAKI Motohiro wrote:

> > -		free_zone_pagesets(cpu);
> > +		process_zones(cpu);
> >  		break;
>
> Why do you drop cpu unplug code?

Because it does not do anything. Percpu areas are traditionally allocated
for each possible cpu not for each online cpu.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations
  2008-11-06  3:58   ` Dave Chinner
@ 2008-11-06 15:05     ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-06 15:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

On Thu, 6 Nov 2008, Dave Chinner wrote:

> On Wed, Nov 05, 2008 at 05:16:38PM -0600, Christoph Lameter wrote:
> > +
> > +#define __CPU_CMPXCHG(var, old, new)		\
> > +({						\
> > +	typeof(obj) x;				\
> > +	typeof(obj) *p = THIS_CPU(&(obj));	\
> > +	x = *p;					\
> > +	if (x == (old))				\
> > +		*p = (new);			\
> > +	(x);					\
> > +})
>
> I don't think that will compile - s/obj/var/ perhaps?

Correct.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-06  7:12   ` Ingo Molnar
@ 2008-11-06 15:08     ` Christoph Lameter
  2008-11-06 15:15       ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-06 15:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

On Thu, 6 Nov 2008, Ingo Molnar wrote:

> hm, what happened to the rebase-PDA-to-percpu-area optimization
> patches you guys were working on? I remember there was some binutils
> flakiness - weird crashes and things like that. Did you ever manage to
> stabilize it? It would be sad if only 32-bit could take advantage of
> the optimized ops.

I thought that was in your tree? I saw a conflict in -next with the zero
based stuff a couple of weeks ago. Mike is working on that AFAICT.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-06 15:08     ` Christoph Lameter
@ 2008-11-06 15:15       ` Ingo Molnar
  2008-11-06 15:44         ` Mike Travis
  2008-11-06 16:11         ` Christoph Lameter
  0 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2008-11-06 15:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum


* Christoph Lameter <cl@linux-foundation.org> wrote:

> On Thu, 6 Nov 2008, Ingo Molnar wrote:
> 
> > hm, what happened to the rebase-PDA-to-percpu-area optimization 
> > patches you guys were working on? I remember there was some 
> > binutils flakiness - weird crashes and things like that. Did you 
> > ever manage to stabilize it? It would be sad if only 32-bit could 
> > take advantage of the optimized ops.
> 
> I thought that was in your tree? I saw a conflict in -next with the 
> zero based stuff a couple of weeks ago. Mike is working on that 
> AFAICT.

No, what's in tip/core/percpu is not the PDA patches:

 f8d90d9: percpu: zero based percpu build error on s390
 cfcfdff: Merge branch 'linus' into core/percpu
 d379497: Zero based percpu: infrastructure to rebase the per cpu area to zero
 b3a0cb4: x86: extend percpu ops to 64 bit

But it's not actually utilized on x86. AFAICS you guys never came back 
with working patches for that (tip/x86/percpu is empty currently), and 
now i see something related on lkml on a separate track not Cc:-ed to 
the x86 folks so i thought i'd ask whether more coordination is 
desired here.

So ... what's the merge plan here? I like your fundamental idea, it's 
a nice improvement in a couple of areas and i'd like to help out make 
it happen. Also, the new per-cpu allocator would be nice for the 
sparseirq code.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-06 15:15       ` Ingo Molnar
@ 2008-11-06 15:44         ` Mike Travis
  2008-11-06 16:27           ` Christoph Lameter
  2008-11-06 16:11         ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Travis @ 2008-11-06 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, akpm, Pekka Enberg, linux-kernel, linux-mm,
	Stephen Rothwell, Vegard Nossum

Ingo Molnar wrote:
> * Christoph Lameter <cl@linux-foundation.org> wrote:
> 
>> On Thu, 6 Nov 2008, Ingo Molnar wrote:
>>
>>> hm, what happened to the rebase-PDA-to-percpu-area optimization 
>>> patches you guys were working on? I remember there was some 
>>> binutils flakiness - weird crashes and things like that. Did you 
>>> ever manage to stabilize it? It would be sad if only 32-bit could 
>>> take advantage of the optimized ops.
>> I thought that was in your tree? I saw a conflict in -next with the 
>> zero based stuff a couple of weeks ago. Mike is working on that 
>> AFAICT.
> 
> No, what's in tip/core/percpu is not the PDA patches:
> 
>  f8d90d9: percpu: zero based percpu build error on s390
>  cfcfdff: Merge branch 'linus' into core/percpu
>  d379497: Zero based percpu: infrastructure to rebase the per cpu area to zero
>  b3a0cb4: x86: extend percpu ops to 64 bit
> 
> But it's not actually utilized on x86. AFAICS you guys never came back 
> with working patches for that (tip/x86/percpu is empty currently), and 
> now i see something related on lkml on a separate track not Cc:-ed to 
> the x86 folks so i thought i'd ask whether more coordination is 
> desired here.
> 
> So ... what's the merge plan here? I like your fundamental idea, it's 
> a nice improvement in a couple of areas and i'd like to help out make 
> it happen. Also, the new per-cpu allocator would be nice for the 
> sparseirq code.
> 
> 	Ingo

Sorry, this was on my plate but the 4096 cpus if far more critical to get
released and available.  As soon as that's finally done, I can get back to
the pda/zero-based changes.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-06 15:15       ` Ingo Molnar
  2008-11-06 15:44         ` Mike Travis
@ 2008-11-06 16:11         ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-06 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

On Thu, 6 Nov 2008, Ingo Molnar wrote:

> But it's not actually utilized on x86. AFAICS you guys never came back
> with working patches for that (tip/x86/percpu is empty currently), and
> now i see something related on lkml on a separate track not Cc:-ed to
> the x86 folks so i thought i'd ask whether more coordination is
> desired here.

We should definitely look into this but my priorities have changed a bit.
32 bit is far more significant for me now.

Could you point me to a post that describes the currently open issues with
x86_64? Mike handled that before.

> So ... what's the merge plan here? I like your fundamental idea, it's
> a nice improvement in a couple of areas and i'd like to help out make
> it happen. Also, the new per-cpu allocator would be nice for the
> sparseirq code.

Right. Its good in many areas.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/7] x86_64: Support for cpu ops
  2008-11-06 15:44         ` Mike Travis
@ 2008-11-06 16:27           ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-06 16:27 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, akpm, Pekka Enberg, linux-kernel, linux-mm,
	Stephen Rothwell, Vegard Nossum

On Thu, 6 Nov 2008, Mike Travis wrote:

> Sorry, this was on my plate but the 4096 cpus if far more critical to get
> released and available.  As soon as that's finally done, I can get back to
> the pda/zero-based changes.

You cannot solve your 4k issues without getting the zerobased stuff in
becaus otherwise the large pointer arrays in the core (page allocator and
slab allocator etc) are not removable. Without getting zerobased sorted
out you will produce a lot of hacks around subsystems that create pointer
arrays that would go away easily if you had the percpu aallocator.

I'd say getting the zero based stuff issues fixed is a prerequisie for
further work on 4k making sense.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-06 15:04     ` Christoph Lameter
@ 2008-11-07  0:37       ` KOSAKI Motohiro
  2008-11-07 18:43         ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2008-11-07  0:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, akpm, Pekka Enberg, Christoph Lameter,
	linux-kernel, linux-mm, travis, Stephen Rothwell, Vegard Nossum

> On Thu, 6 Nov 2008, KOSAKI Motohiro wrote:
> 
> > > -		free_zone_pagesets(cpu);
> > > +		process_zones(cpu);
> > >  		break;
> >
> > Why do you drop cpu unplug code?
> 
> Because it does not do anything. Percpu areas are traditionally allocated
> for each possible cpu not for each online cpu.

Yup, Agreed.

However, if cpu-unplug happend, any pages in pcp should flush to buddy (I think).




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-07  0:37       ` KOSAKI Motohiro
@ 2008-11-07 18:43         ` Christoph Lameter
  2008-11-11  6:10           ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-07 18:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, Pekka Enberg, Christoph Lameter, linux-kernel, linux-mm,
	travis, Stephen Rothwell, Vegard Nossum

On Fri, 7 Nov 2008, KOSAKI Motohiro wrote:

> However, if cpu-unplug happend, any pages in pcp should flush to buddy (I think).

Right. They are not?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-07 18:43         ` Christoph Lameter
@ 2008-11-11  6:10           ` KOSAKI Motohiro
  2008-11-12  2:02             ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2008-11-11  6:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, akpm, Pekka Enberg, Christoph Lameter,
	linux-kernel, linux-mm, travis, Stephen Rothwell, Vegard Nossum

> On Fri, 7 Nov 2008, KOSAKI Motohiro wrote:
> 
> > However, if cpu-unplug happend, any pages in pcp should flush to buddy (I think).
> 
> Right. They are not?
> 

Doh, I really silly.
yes, pcp dropping is processed by another function.
I missed it.

very sorry.


In addition, I think cleanup is better.
I made the patch.



===========================================================
Now, page_alloc_init() doesn't have page allocator stuff and there are the cpu unplug processing for pcp
in two place (pageset_cpuup_callback() and page_alloc_init()).
it isn't reasonable nor easy readable.

cleanup here.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 init/main.c     |    1 
 mm/page_alloc.c |   60 +++++++++++++++++++++++++-------------------------------
 2 files changed, 27 insertions(+), 34 deletions(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2765,6 +2765,28 @@ static void __cpuinit process_zones(int 
 }
 
 #ifdef CONFIG_SMP
+static void drain_pages_and_fold_stats(int cpu)
+{
+	drain_pages(cpu);
+
+	/*
+	 * Spill the event counters of the dead processor
+	 * into the current processors event counters.
+	 * This artificially elevates the count of the current
+	 * processor.
+	 */
+	vm_events_fold_cpu(cpu);
+
+	/*
+	 * Zero the differential counters of the dead processor
+	 * so that the vm statistics are consistent.
+	 *
+	 * This is only okay since the processor is dead and cannot
+	 * race with what we are doing.
+	 */
+	refresh_cpu_vm_stats(cpu);
+}
+
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
@@ -2777,6 +2799,11 @@ static int __cpuinit pageset_cpuup_callb
 	case CPU_UP_PREPARE_FROZEN:
 		process_zones(cpu);
 		break;
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		drain_pages_and_fold_stats(cpu);
+		break;
+
 	default:
 		break;
 	}
@@ -4092,39 +4119,6 @@ void __init free_area_init(unsigned long
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
 }
 
-static int page_alloc_cpu_notify(struct notifier_block *self,
-				 unsigned long action, void *hcpu)
-{
-	int cpu = (unsigned long)hcpu;
-
-	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
-		drain_pages(cpu);
-
-		/*
-		 * Spill the event counters of the dead processor
-		 * into the current processors event counters.
-		 * This artificially elevates the count of the current
-		 * processor.
-		 */
-		vm_events_fold_cpu(cpu);
-
-		/*
-		 * Zero the differential counters of the dead processor
-		 * so that the vm statistics are consistent.
-		 *
-		 * This is only okay since the processor is dead and cannot
-		 * race with what we are doing.
-		 */
-		refresh_cpu_vm_stats(cpu);
-	}
-	return NOTIFY_OK;
-}
-
-void __init page_alloc_init(void)
-{
-	hotcpu_notifier(page_alloc_cpu_notify, 0);
-}
-
 /*
  * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio
  *	or min_free_kbytes changes.
Index: b/init/main.c
===================================================================
--- a/init/main.c
+++ b/init/main.c
@@ -619,7 +619,6 @@ asmlinkage void __init start_kernel(void
 	 */
 	preempt_disable();
 	build_all_zonelists();
-	page_alloc_init();
 	printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
 	parse_early_param();
 	parse_args("Booting kernel", static_command_line, __start___param,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-11-05 23:16 ` [patch 7/7] cpu alloc: page allocator conversion Christoph Lameter
@ 2008-11-11 23:56 ` Andrew Morton
  2008-11-12  0:28   ` Christoph Lameter
  2008-11-12  6:57 ` Stephen Rothwell
  8 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2008-11-11 23:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: penberg, linux-kernel, linux-mm, travis, sfr, vegard.nossum

On Wed, 05 Nov 2008 17:16:34 -0600
Christoph Lameter <cl@linux-foundation.org> wrote:

> The second stage of the cpu_alloc patchset can be pulled from
> 
> git.kernel.org/pub/scm/linux/kernel/git/christoph/work.git cpu_alloc_stage2
> 
> Stage 2 includes the conversion of the page allocator
> and slub allocator to the use of the cpu allocator.
> 
> It also includes the core of the atomic vs. interrupt cpu ops and uses those
> for the vm statistics.

It all looks very nice to me.  It's a shame about the lack of any
commonality with local_t though.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-11 23:56 ` [patch 0/7] cpu alloc stage 2 Andrew Morton
@ 2008-11-12  0:28   ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-12  0:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: penberg, linux-kernel, linux-mm, travis, sfr, vegard.nossum

On Tue, 11 Nov 2008, Andrew Morton wrote:

> It all looks very nice to me.  It's a shame about the lack of any
> commonality with local_t though.

At the end of the full patchset local_t is no more because cpu ops can
completely replace all use cases for local_t.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 7/7] cpu alloc: page allocator conversion
  2008-11-11  6:10           ` KOSAKI Motohiro
@ 2008-11-12  2:02             ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-12  2:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis,
	Stephen Rothwell, Vegard Nossum

Thanks for the patch. Folded into this one with your signoff (okay right?)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
                   ` (7 preceding siblings ...)
  2008-11-11 23:56 ` [patch 0/7] cpu alloc stage 2 Andrew Morton
@ 2008-11-12  6:57 ` Stephen Rothwell
  2008-11-12 20:07   ` Christoph Lameter
  8 siblings, 1 reply; 29+ messages in thread
From: Stephen Rothwell @ 2008-11-12  6:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis, Vegard Nossum

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

Hi Christoph,

On Wed, 05 Nov 2008 17:16:34 -0600 Christoph Lameter <cl@linux-foundation.org> wrote:
>
> The second stage of the cpu_alloc patchset can be pulled from
> 
> git.kernel.org/pub/scm/linux/kernel/git/christoph/work.git cpu_alloc_stage2
> 
> Stage 2 includes the conversion of the page allocator
> and slub allocator to the use of the cpu allocator.
> 
> It also includes the core of the atomic vs. interrupt cpu ops and uses those
> for the vm statistics.

I have seen some discussion of these patches (and some fixes for the
previous set).  Are they in a state that they should be in linux-next yet?

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-12  6:57 ` Stephen Rothwell
@ 2008-11-12 20:07   ` Christoph Lameter
  2008-11-12 23:35     ` Stephen Rothwell
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-12 20:07 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis, Vegard Nossum

On Wed, 12 Nov 2008, Stephen Rothwell wrote:

> I have seen some discussion of these patches (and some fixes for the
> previous set).  Are they in a state that they should be in linux-next yet?

I will push out a new patchset and tree in the next hour or so for
you to merge into linux-next.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-12 20:07   ` Christoph Lameter
@ 2008-11-12 23:35     ` Stephen Rothwell
  2008-11-13 14:28       ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen Rothwell @ 2008-11-12 23:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis, Vegard Nossum

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

Hi Christoph,

On Wed, 12 Nov 2008 14:07:51 -0600 (CST) Christoph Lameter <cl@linux-foundation.org> wrote:
>
> On Wed, 12 Nov 2008, Stephen Rothwell wrote:
> 
> > I have seen some discussion of these patches (and some fixes for the
> > previous set).  Are they in a state that they should be in linux-next yet?
> 
> I will push out a new patchset and tree in the next hour or so for
> you to merge into linux-next.

Why not just add these to the cpu_alloc tree I already have?

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-12 23:35     ` Stephen Rothwell
@ 2008-11-13 14:28       ` Christoph Lameter
  2008-11-13 21:09         ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2008-11-13 14:28 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis, Vegard Nossum

On Thu, 13 Nov 2008, Stephen Rothwell wrote:

> > I will push out a new patchset and tree in the next hour or so for
> > you to merge into linux-next.
>
> Why not just add these to the cpu_alloc tree I already have?

What happens if there are problems with the next stage? I want to make
sure that at least the basis is merged.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/7] cpu alloc stage 2
  2008-11-13 14:28       ` Christoph Lameter
@ 2008-11-13 21:09         ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2008-11-13 21:09 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: akpm, Pekka Enberg, linux-kernel, linux-mm, travis, Vegard Nossum

I put a cpu_alloc_stage2 onto the git archive

git.kernel.org/pub/scm/linux/kernel/git/christoph/work.git cpu_alloc_stage2

Not sure if I should dare to merge it into the cpu_alloc branch. Its
pretty touchy work with some of the core components.



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-11-13 21:10 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-05 23:16 [patch 0/7] cpu alloc stage 2 Christoph Lameter
2008-11-05 23:16 ` [patch 1/7] Increase default reserve percpu area Christoph Lameter
2008-11-05 23:16 ` [patch 2/7] cpu alloc: Use in slub Christoph Lameter
2008-11-05 23:16 ` [patch 3/7] cpu alloc: Remove slub fields Christoph Lameter
2008-11-05 23:16 ` [patch 4/7] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
2008-11-06  3:58   ` Dave Chinner
2008-11-06 15:05     ` Christoph Lameter
2008-11-05 23:16 ` [patch 5/7] x86_64: Support for cpu ops Christoph Lameter
2008-11-06  7:12   ` Ingo Molnar
2008-11-06 15:08     ` Christoph Lameter
2008-11-06 15:15       ` Ingo Molnar
2008-11-06 15:44         ` Mike Travis
2008-11-06 16:27           ` Christoph Lameter
2008-11-06 16:11         ` Christoph Lameter
2008-11-05 23:16 ` [patch 6/7] VM statistics: Use CPU ops Christoph Lameter
2008-11-05 23:16 ` [patch 7/7] cpu alloc: page allocator conversion Christoph Lameter
2008-11-06  2:52   ` KOSAKI Motohiro
2008-11-06 15:04     ` Christoph Lameter
2008-11-07  0:37       ` KOSAKI Motohiro
2008-11-07 18:43         ` Christoph Lameter
2008-11-11  6:10           ` KOSAKI Motohiro
2008-11-12  2:02             ` Christoph Lameter
2008-11-11 23:56 ` [patch 0/7] cpu alloc stage 2 Andrew Morton
2008-11-12  0:28   ` Christoph Lameter
2008-11-12  6:57 ` Stephen Rothwell
2008-11-12 20:07   ` Christoph Lameter
2008-11-12 23:35     ` Stephen Rothwell
2008-11-13 14:28       ` Christoph Lameter
2008-11-13 21:09         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).