LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
@ 2007-03-20 17:12 Eric Dumazet
  2007-03-20 19:54 ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-20 17:12 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Andi Kleen; +Cc: linux kernel

Hi

I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive.
This is because of a cache miss on struct slab.
At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot.

struct slab *slabp = virt_to_slab(objp);
nodeid = slabp->nodeid; // cache miss

So we currently need slab only to lookup nodeid, to be able to use the cachep cpu cache, or not.

Couldn't we use something less expensive, like pfn_to_nid() ?
On x86_64 pfn_to_nid usually shares one cache line for all objects (struct memnode)

Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ?

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-20 17:12 [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Eric Dumazet
@ 2007-03-20 19:54 ` Christoph Lameter
  2007-03-20 21:32   ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-20 19:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Andi Kleen, linux kernel

On Tue, 20 Mar 2007, Eric Dumazet wrote:

> 
> I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive.
> This is because of a cache miss on struct slab.
> At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot.
> 
> struct slab *slabp = virt_to_slab(objp);
> nodeid = slabp->nodeid; // cache miss
> 
> So we currently need slab only to lookup nodeid, to be able to use the cachep cpu cache, or not.
> 
> Couldn't we use something less expensive, like pfn_to_nid() ?

Nodeid describes the node that the slab was allocated for which may 
not be equal to the node that the page came from. But if GFP_THISNODE is 
used then this should always be the same. That those two nodeid are 
different was certainly frequent before the GFP_THISNODE work went in. Now 
it may just occur in corner cases. Perhaps during bootup on 
some machines that boot with empty nodes. I vaguely recall a powerpc 
issue.

> Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ?

It is possible the page allocator falls back to another node than 
requested. We would need to check that this never occurs.

If we are sure then we could drop the nodeid field completely.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-20 19:54 ` Christoph Lameter
@ 2007-03-20 21:32   ` Andi Kleen
  2007-03-20 22:09     ` Eric Dumazet
  2007-03-21  0:18     ` Christoph Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2007-03-20 21:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Eric Dumazet, Andrew Morton, Andi Kleen, linux kernel

> > Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ?
> 
> It is possible the page allocator falls back to another node than 
> requested. We would need to check that this never occurs.

The only way to ensure that would be to set a strict mempolicy.
But I'm not sure that's a good idea -- after all you don't want
to fail an allocation in this case.

But pfn_to_nid on the object like proposed by Eric should work anyways.
But I'm not sure the tables used for that will be more often cache hot
than the slab.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-20 21:32   ` Andi Kleen
@ 2007-03-20 22:09     ` Eric Dumazet
  2007-03-21  0:16       ` Christoph Lameter
  2007-03-21  0:18     ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-20 22:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Andrew Morton, linux kernel

Andi Kleen a écrit :
>>> Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ?
>> It is possible the page allocator falls back to another node than 
>> requested. We would need to check that this never occurs.
> 
> The only way to ensure that would be to set a strict mempolicy.
> But I'm not sure that's a good idea -- after all you don't want
> to fail an allocation in this case.
> 
> But pfn_to_nid on the object like proposed by Eric should work anyways.
> But I'm not sure the tables used for that will be more often cache hot
> than the slab.

pfn_to_nid() on most x86_64 machines access one cache line (struct memnode).

Node 0 MemBase 0000000000000000 Limit 0000000280000000
Node 1 MemBase 0000000280000000 Limit 0000000480000000
NUMA: Using 31 for the hash shift.

On this example, we use only 8 bytes of memnode.embedded_map[] to find nid of 
all 16 GB of ram. On profiles I have, memnode is always hot (no cache miss on it).

While virt_to_slab() has to access :

1) struct page -> page_get_slab() (page->lru.prev) (one cache miss)
2) struct slab -> nodeid (one other cache miss)


So using pfn_to_nid() would avoid 2 cache misses.

I understand we want to do special things (fallback and such tricks) at 
allocation time, but I believe that we can just trust the real nid of memory 
at free time.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-20 22:09     ` Eric Dumazet
@ 2007-03-21  0:16       ` Christoph Lameter
  2007-03-21  6:27         ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21  0:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, Andrew Morton, linux kernel

On Tue, 20 Mar 2007, Eric Dumazet wrote:

> I understand we want to do special things (fallback and such tricks) at
> allocation time, but I believe that we can just trust the real nid of memory
> at free time.

Sorry no. The node at allocation time determines which node specific 
structure tracks the slab. If we fall back then the node is allocated 
from one node but entered in the node structure of another. Thus you 
cannot free the slab without knowing the node at allocation time.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-20 21:32   ` Andi Kleen
  2007-03-20 22:09     ` Eric Dumazet
@ 2007-03-21  0:18     ` Christoph Lameter
  2007-03-21  2:44       ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21  0:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, Andrew Morton, linux kernel

On Tue, 20 Mar 2007, Andi Kleen wrote:

> > > Is it possible virt_to_slab(objp)->nodeid being different from pfn_to_nid(objp) ?
> > 
> > It is possible the page allocator falls back to another node than 
> > requested. We would need to check that this never occurs.
> 
> The only way to ensure that would be to set a strict mempolicy.
> But I'm not sure that's a good idea -- after all you don't want
> to fail an allocation in this case.
> 
> But pfn_to_nid on the object like proposed by Eric should work anyways.
> But I'm not sure the tables used for that will be more often cache hot
> than the slab.

We usually use page_to_nid(). Sure this will determine the node the object 
resides on. But this may not be the node on which the slab is tracked 
since there may have been a fallback at alloc time.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  0:18     ` Christoph Lameter
@ 2007-03-21  2:44       ` Andi Kleen
  2007-03-21  3:10         ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2007-03-21  2:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Eric Dumazet, Andrew Morton, linux kernel

> We usually use page_to_nid(). Sure this will determine the node the object 
> resides on. But this may not be the node on which the slab is tracked 
> since there may have been a fallback at alloc time.

How about your slab rewrite?  I assume it would make more sense to fix
such problems in that code instead of the old which is going to be replaced
at some point.

-Andi
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  2:44       ` Andi Kleen
@ 2007-03-21  3:10         ` Christoph Lameter
  2007-03-22 21:28           ` non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;) Siddha, Suresh B
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21  3:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, Andrew Morton, linux kernel

On Wed, 21 Mar 2007, Andi Kleen wrote:

> > We usually use page_to_nid(). Sure this will determine the node the object 
> > resides on. But this may not be the node on which the slab is tracked 
> > since there may have been a fallback at alloc time.
> 
> How about your slab rewrite?  I assume it would make more sense to fix
> such problems in that code instead of the old which is going to be replaced
> at some point.

The slab rewrite first allocates a page and then determines where it 
came from instead of requiring the page allocator to allocate from a 
certain node. Plus SLUB does not keep per cpu or per node object queues. 
So the problem does not occur in the same way. The per cpu slab in SLUB 
can contain objects from another node whereas SLAB can only put node local 
objects on its queues.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  0:16       ` Christoph Lameter
@ 2007-03-21  6:27         ` Eric Dumazet
  2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
  2007-03-21  7:03           ` [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Christoph Lameter
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2007-03-21  6:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Andrew Morton, linux kernel

Christoph Lameter a écrit :
> On Tue, 20 Mar 2007, Eric Dumazet wrote:
> 
>> I understand we want to do special things (fallback and such tricks) at
>> allocation time, but I believe that we can just trust the real nid of memory
>> at free time.
> 
> Sorry no. The node at allocation time determines which node specific 
> structure tracks the slab. If we fall back then the node is allocated 
> from one node but entered in the node structure of another. Thus you 
> cannot free the slab without knowing the node at allocation time.

I think you dont understand my point.

When we enter kmem_cache_free(), we are not freeing a slab, but an object, 
knowing a pointer to this object.

The fast path is to put the pointer, into the cpu array cache. This object 
might be given back some cycles later, because of a kmem_cache_alloc() : No 
need to access the two cache lines (struct page, struct slab)

This fast path could be entered checking the node of the page, which is 
faster, but eventually different from the virt_to_slab(obj)->nodeid. Do we 
care ? Definitly not : Node is guaranted to be correct.

Then, if we must flush the cpu array cache because it is full, we *may* access 
the slabs of the objects we are flushing, and then check the 
virt_to_slab(obj)->nodeid to be able to do the correct thing.

Fortunatly, flushing cache array is not a frequent event, and the cost of 
access to cache lines (truct page, struct slab) can be amortized because 
several 'transfered or freed' objects might share them at this time.


Actually I had to disable NUMA on my platforms because it is just overkill and 
slower. Maybe its something OK for very big machines, and not dual nodes 
Opterons ? Let me know so that I dont waste your time (and mine)


Thank you

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache()
  2007-03-21  6:27         ` Eric Dumazet
@ 2007-03-21  6:57           ` Eric Dumazet
  2007-03-21  7:21             ` [PATCH] SLAB : Dont allocate empty shared caches Eric Dumazet
                               ` (2 more replies)
  2007-03-21  7:03           ` [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Christoph Lameter
  1 sibling, 3 replies; 31+ messages in thread
From: Eric Dumazet @ 2007-03-21  6:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, Andi Kleen, linux kernel

[-- Attachment #1: Type: text/plain, Size: 986 bytes --]

The existing comment in mm/slab.c is *perfect*, so I reproduce it :

         /*
          * CPU bound tasks (e.g. network routing) can exhibit cpu bound
          * allocation behaviour: Most allocs on one cpu, most free operations
          * on another cpu. For these cases, an efficient object passing between
          * cpus is necessary. This is provided by a shared array. The array
          * replaces Bonwick's magazine layer.
          * On uniprocessor, it's functionally equivalent (but less efficient)
          * to a larger limit. Thus disabled by default.
          */

As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way 
a preprocessor #if can detect if the machine is UP or SMP. Better to use 
num_possible_cpus().

This means on UP we allocate a 'size=0 shared array', to be more efficient.

Another patch can later avoid the allocations of 'empty shared arrays', to 
save some memory.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: slab_use_num_possible_cpus.patch --]
[-- Type: text/plain, Size: 398 bytes --]

diff --git a/mm/slab.c b/mm/slab.c
index 57f7aa4..a69d0a5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3975,10 +3975,8 @@ static int enable_cpucache(struct kmem_c
 	 * to a larger limit. Thus disabled by default.
 	 */
 	shared = 0;
-#ifdef CONFIG_SMP
-	if (cachep->buffer_size <= PAGE_SIZE)
+	if (cachep->buffer_size <= PAGE_SIZE && num_possible_cpus() > 1)
 		shared = 8;
-#endif
 
 #if DEBUG
 	/*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  6:27         ` Eric Dumazet
  2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
@ 2007-03-21  7:03           ` Christoph Lameter
  2007-03-21  7:14             ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21  7:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, Andrew Morton, linux kernel

On Wed, 21 Mar 2007, Eric Dumazet wrote:

> The fast path is to put the pointer, into the cpu array cache. This object
> might be given back some cycles later, because of a kmem_cache_alloc() : No
> need to access the two cache lines (struct page, struct slab)

If you do that then the slab will no longer return objects from the 
desired nodes. The assumption is that cpu array objects are from the local 
node.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  7:03           ` [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Christoph Lameter
@ 2007-03-21  7:14             ` Eric Dumazet
  2007-03-21 14:35               ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-21  7:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Andrew Morton, linux kernel

Christoph Lameter a écrit :
> On Wed, 21 Mar 2007, Eric Dumazet wrote:
> 
>> The fast path is to put the pointer, into the cpu array cache. This object
>> might be given back some cycles later, because of a kmem_cache_alloc() : No
>> need to access the two cache lines (struct page, struct slab)
> 
> If you do that then the slab will no longer return objects from the 
> desired nodes. The assumption is that cpu array objects are from the local 
> node.

Me confused.

How the following could be wrong ?

static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
int mynode = numa_node_id();
int objnode = virt_to_nid(objp); // or whatever

if (mynode == objnode)
	return 0;
...
}

If numa_node_id() is equal to the node of the page containing the first byte 
of the object, then object is on the local node. Or what ?

Thank you

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] SLAB : Dont allocate empty shared caches
  2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
@ 2007-03-21  7:21             ` Eric Dumazet
  2007-03-21 13:13               ` Pekka Enberg
  2007-03-21 13:02             ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Pekka Enberg
  2007-03-21 18:45             ` Christoph Lameter
  2 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-21  7:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, Andi Kleen, linux kernel

[-- Attachment #1: Type: text/plain, Size: 306 bytes --]

We can avoid allocating empty shared caches and avoid unecessary check of 
cache->limit. We save some memory. We avoid bringing into CPU cache unecessary 
cache lines.

All accesses to l3->shared are already checking NULL pointers so this patch is 
safe.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: dont_allocate_empty_arraycache.patch --]
[-- Type: text/plain, Size: 1612 bytes --]

diff --git a/mm/slab.c b/mm/slab.c
index a69d0a5..abf46ae 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1223,19 +1223,20 @@ static int __cpuinit cpuup_callback(stru
 		 */
 		list_for_each_entry(cachep, &cache_chain, next) {
 			struct array_cache *nc;
-			struct array_cache *shared;
+			struct array_cache *shared = NULL;
 			struct array_cache **alien = NULL;
 
 			nc = alloc_arraycache(node, cachep->limit,
 						cachep->batchcount);
 			if (!nc)
 				goto bad;
-			shared = alloc_arraycache(node,
+			if (cachep->shared) {
+				shared = alloc_arraycache(node,
 					cachep->shared * cachep->batchcount,
 					0xbaadf00d);
-			if (!shared)
-				goto bad;
-
+				if (!shared)
+					goto bad;
+			}
 			if (use_alien_caches) {
                                 alien = alloc_alien_cache(node, cachep->limit);
                                 if (!alien)
@@ -1317,8 +1318,8 @@ #endif
 
 			shared = l3->shared;
 			if (shared) {
-				free_block(cachep, l3->shared->entry,
-					   l3->shared->avail, node);
+				free_block(cachep, shared->entry,
+					   shared->avail, node);
 				l3->shared = NULL;
 			}
 
@@ -3812,12 +3813,15 @@ static int alloc_kmemlist(struct kmem_ca
                                 goto fail;
                 }
 
-		new_shared = alloc_arraycache(node,
+		new_shared = NULL;
+		if (cachep->shared) {
+			new_shared = alloc_arraycache(node,
 				cachep->shared*cachep->batchcount,
 					0xbaadf00d);
-		if (!new_shared) {
-			free_alien_cache(new_alien);
-			goto fail;
+			if (!new_shared) {
+				free_alien_cache(new_alien);
+				goto fail;
+			}
 		}
 
 		l3 = cachep->nodelists[node];

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache()
  2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
  2007-03-21  7:21             ` [PATCH] SLAB : Dont allocate empty shared caches Eric Dumazet
@ 2007-03-21 13:02             ` Pekka Enberg
  2007-03-21 18:45             ` Christoph Lameter
  2 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2007-03-21 13:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Christoph Lameter, Andi Kleen, linux kernel

On 3/21/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
> a preprocessor #if can detect if the machine is UP or SMP. Better to use
> num_possible_cpus().

Looks good to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] SLAB : Dont allocate empty shared caches
  2007-03-21  7:21             ` [PATCH] SLAB : Dont allocate empty shared caches Eric Dumazet
@ 2007-03-21 13:13               ` Pekka Enberg
  0 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2007-03-21 13:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Christoph Lameter, Andi Kleen, linux kernel

On 3/21/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> We can avoid allocating empty shared caches and avoid unecessary check of
> cache->limit. We save some memory. We avoid bringing into CPU cache unecessary
> cache lines.
>
> All accesses to l3->shared are already checking NULL pointers so this patch is
> safe.

Looks good to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;
  2007-03-21  7:14             ` Eric Dumazet
@ 2007-03-21 14:35               ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21 14:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, Andrew Morton, linux kernel

On Wed, 21 Mar 2007, Eric Dumazet wrote:

> If numa_node_id() is equal to the node of the page containing the first byte
> of the object, then object is on the local node. Or what ?

No. The slab (the page you are referring to) may have been allocated for 
another node and been tracked via the node structs of that other node. We 
were just falling back to the node that now appears to be local.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache()
  2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
  2007-03-21  7:21             ` [PATCH] SLAB : Dont allocate empty shared caches Eric Dumazet
  2007-03-21 13:02             ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Pekka Enberg
@ 2007-03-21 18:45             ` Christoph Lameter
  2 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2007-03-21 18:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Andi Kleen, linux kernel

On Wed, 21 Mar 2007, Eric Dumazet wrote:

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Acked-by: Christoph Lameter <clameter@sgi.com>
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-21  3:10         ` Christoph Lameter
@ 2007-03-22 21:28           ` Siddha, Suresh B
  2007-03-22 22:10             ` Christoph Lameter
  2007-03-22 22:12             ` Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Siddha, Suresh B @ 2007-03-22 21:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, Eric Dumazet, Andrew Morton, linux kernel

Christoph,

While we are at this topic, recently I had reports that
cache_free_alien() is costly on non NUMA platforms too (similar
to the cache miss issues that Eric was referring to on NUMA)
and the appended patch seems to fix it for non NUMA atleast.

Appended patch gives a nice 1% perf improvement on non-NUMA platform
with database workload.

Please comment or Ack for mainline :)

thanks,
suresh
---

Subject: [patch] slab: skip cache_free_alien() on non NUMA
From: Suresh Siddha <suresh.b.siddha@intel.com>

set use_alien_caches to 0 on non NUMA platforms. And avoid calling
the cache_free_alien() when use_alien_caches is not set. This will avoid
the cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/mm/slab.c b/mm/slab.c
index 8fdaffa..146082d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 	 * Make sure we are not freeing a object from another node to the array
 	 * cache on this cpu.
 	 */
-	if (likely(slabp->nodeid == node) || unlikely(!use_alien_caches))
+	if (likely(slabp->nodeid == node))
 		return 0;
 
 	l3 = cachep->nodelists[node];
@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
 	int order;
 	int node;
 
+	if (num_online_nodes() == 1)
+		use_alien_caches = 0;
+
 	for (i = 0; i < NUM_INIT_LISTS; i++) {
 		kmem_list3_init(&initkmem_list3[i]);
 		if (i < MAX_NUMNODES)
@@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-	if (cache_free_alien(cachep, objp))
+	if (use_alien_caches && cache_free_alien(cachep, objp))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 21:28           ` non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;) Siddha, Suresh B
@ 2007-03-22 22:10             ` Christoph Lameter
  2007-03-22 22:12             ` Eric Dumazet
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2007-03-22 22:10 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Andi Kleen, Eric Dumazet, Andrew Morton, linux kernel

On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

> @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
>  	int order;
>  	int node;
>  
> +	if (num_online_nodes() == 1)
> +		use_alien_caches = 0;
> +

What happens if you bring up a second node?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 21:28           ` non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;) Siddha, Suresh B
  2007-03-22 22:10             ` Christoph Lameter
@ 2007-03-22 22:12             ` Eric Dumazet
  2007-03-22 22:40               ` Siddha, Suresh B
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-22 22:12 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Christoph Lameter, Andi Kleen, Andrew Morton, linux kernel

Siddha, Suresh B a écrit :
> Christoph,
> 
> While we are at this topic, recently I had reports that
> cache_free_alien() is costly on non NUMA platforms too (similar
> to the cache miss issues that Eric was referring to on NUMA)
> and the appended patch seems to fix it for non NUMA atleast.
> 
> Appended patch gives a nice 1% perf improvement on non-NUMA platform
> with database workload.
> 
> Please comment or Ack for mainline :)

I have one comment :)

> @@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
>  	int order;
>  	int node;
>  
> +	if (num_online_nodes() == 1)
> +		use_alien_caches = 0;
> +

Unfortunatly this part is wrong.

You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
its a variable instead of a function call)

I wonder if we could add a new SLAB_NUMA_BYPASS, so that we can declare some 
kmem_cache as non NUMA aware (for example, I feel network skb dont need the 
NUMA overhead)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 22:12             ` Eric Dumazet
@ 2007-03-22 22:40               ` Siddha, Suresh B
  2007-03-22 22:56                 ` Eric Dumazet
                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Siddha, Suresh B @ 2007-03-22 22:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Siddha, Suresh B, Christoph Lameter, Andi Kleen, Andrew Morton,
	linux kernel

On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:
> Siddha, Suresh B a écrit :
> >+	if (num_online_nodes() == 1)
> >+		use_alien_caches = 0;
> >+
> 
> Unfortunatly this part is wrong.

oops.

> 
> You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> its a variable instead of a function call)

But that is based on compile time option, isn't it? Perhaps I need
to use some other mechanism to find out the platform is not NUMA capable..

thanks,
suresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 22:40               ` Siddha, Suresh B
@ 2007-03-22 22:56                 ` Eric Dumazet
  2007-03-23  1:25                 ` Christoph Lameter
  2007-03-23 14:12                 ` Andi Kleen
  2 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2007-03-22 22:56 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Christoph Lameter, Andi Kleen, Andrew Morton, linux kernel

Siddha, Suresh B a écrit :
> On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote:
>> Siddha, Suresh B a écrit :
>>> +	if (num_online_nodes() == 1)
>>> +		use_alien_caches = 0;
>>> +
>> Unfortunatly this part is wrong.
> 
> oops.
> 
>> You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
>> its a variable instead of a function call)
> 
> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

nr_node_ids is defined to 1 if you compile a non NUMA kernel.

If CONFIG_NUMA is on, then nr_node_ids is a variable, that is filled with the 
maximum nodeid of possible node (+1). If your machine is not CPU hot plug 
capable, and you have say one node, (one dual core processor for example), 
then nr_node_ids will be set to 1
(see mm/page_alloc.c function setup_nr_node_ids() )

So this is OK for your need...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 22:40               ` Siddha, Suresh B
  2007-03-22 22:56                 ` Eric Dumazet
@ 2007-03-23  1:25                 ` Christoph Lameter
  2007-03-23 14:14                   ` Andi Kleen
  2007-03-23 14:12                 ` Andi Kleen
  2 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-03-23  1:25 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Eric Dumazet, Andi Kleen, Andrew Morton, linux kernel

On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

> > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> > its a variable instead of a function call)
> 
> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

No its runtime.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-22 22:40               ` Siddha, Suresh B
  2007-03-22 22:56                 ` Eric Dumazet
  2007-03-23  1:25                 ` Christoph Lameter
@ 2007-03-23 14:12                 ` Andi Kleen
  2007-04-02 22:55                   ` Siddha, Suresh B
  2 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2007-03-23 14:12 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Eric Dumazet, Christoph Lameter, Andi Kleen, Andrew Morton, linux kernel

> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

We can probably make it runtime on x86. That will be needed sooner or 
later for correct NUMA hotplug support anyways.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-23  1:25                 ` Christoph Lameter
@ 2007-03-23 14:14                   ` Andi Kleen
  0 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2007-03-23 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, Eric Dumazet, Andi Kleen, Andrew Morton, linux kernel

On Thu, Mar 22, 2007 at 06:25:16PM -0700, Christoph Lameter wrote:
> On Thu, 22 Mar 2007, Siddha, Suresh B wrote:
> 
> > > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> > > its a variable instead of a function call)
> > 
> > But that is based on compile time option, isn't it? Perhaps I need
> > to use some other mechanism to find out the platform is not NUMA capable..
> 
> No its runtime.

I don't see any code that would ever change the mask from the compile
default. But that is easy to fix.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-03-23 14:12                 ` Andi Kleen
@ 2007-04-02 22:55                   ` Siddha, Suresh B
  2007-04-03  0:23                     ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Siddha, Suresh B @ 2007-04-02 22:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Siddha, Suresh B, Eric Dumazet, Christoph Lameter, Andrew Morton,
	linux kernel

On Fri, Mar 23, 2007 at 03:12:10PM +0100, Andi Kleen wrote:
> > But that is based on compile time option, isn't it? Perhaps I need
> > to use some other mechanism to find out the platform is not NUMA capable..
> 
> We can probably make it runtime on x86. That will be needed sooner or 
> later for correct NUMA hotplug support anyways.

How about this patch? Thanks.

---
From: Suresh Siddha <suresh.b.siddha@intel.com>
[patch] x86_64: set node_possible_map at runtime.

Set the node_possible_map at runtime. On a non NUMA system,
num_possible_nodes() will now say '1'

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c
index b5b8dba..d6f4447 100644
--- a/arch/x86_64/mm/k8topology.c
+++ b/arch/x86_64/mm/k8topology.c
@@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	int found = 0;
 	u32 reg;
 	unsigned numnodes;
-	nodemask_t nodes_parsed;
 	unsigned dualcore = 0;
 
-	nodes_clear(nodes_parsed);
-
 	if (!early_pci_allowed())
 		return -1;
 
@@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 			       nodeid, (base>>8)&3, (limit>>8) & 3); 
 			return -1; 
 		}	
-		if (node_isset(nodeid, nodes_parsed)) { 
+		if (node_isset(nodeid, node_possible_map)) { 
 			printk(KERN_INFO "Node %d already present. Skipping\n", 
 			       nodeid);
 			continue;
@@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 
 		prevbase = base;
 
-		node_set(nodeid, nodes_parsed);
+		node_set(nodeid, node_possible_map);
 	} 
 
 	if (!found)
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 41b8fb0..5f7d4d8 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
  		       i,
  		       nodes[i].start, nodes[i].end,
  		       (nodes[i].end - nodes[i].start) >> 20);
+		node_set(i, node_possible_map);
 		node_set_online(i);
  	}
  	memnode_shift = compute_hash_shift(nodes, numa_fake);
@@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
 	int i;
 
+	nodes_clear(node_possible_map);
+
 #ifdef CONFIG_NUMA_EMU
 	if (numa_fake && !numa_emulation(start_pfn, end_pfn))
  		return;
@@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	memnodemap[0] = 0;
 	nodes_clear(node_online_map);
 	node_set_online(0);
+	node_set(0, node_possible_map);
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 2efe215..9f26e2b 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -25,7 +25,6 @@ int acpi_numa __initdata;
 
 static struct acpi_table_slit *acpi_slit;
 
-static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
@@ -43,7 +42,7 @@ static __init int setup_node(int pxm)
 static __init int conflicting_nodes(unsigned long start, unsigned long end)
 {
 	int i;
-	for_each_node_mask(i, nodes_parsed) {
+	for_each_node_mask(i, node_possible_map) {
 		struct bootnode *nd = &nodes[i];
 		if (nd->start == nd->end)
 			continue;
@@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	}
 	nd = &nodes[node];
 	oldnode = *nd;
-	if (!node_test_and_set(node, nodes_parsed)) {
+	if (!node_test_and_set(node, node_possible_map)) {
 		nd->start = start;
 		nd->end = end;
 	} else {
@@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		printk(KERN_NOTICE "SRAT: Hotplug region ignored\n");
 		*nd = oldnode;
 		if ((nd->start | nd->end) == 0)
-			node_clear(node, nodes_parsed);
+			node_clear(node, node_possible_map);
 	}
 }
 
@@ -356,7 +355,7 @@ static int nodes_cover_memory(void)
 	unsigned long pxmram, e820ram;
 
 	pxmram = 0;
-	for_each_node_mask(i, nodes_parsed) {
+	for_each_node_mask(i, node_possible_map) {
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
@@ -380,7 +379,7 @@ static int nodes_cover_memory(void)
 static void unparse_node(int node)
 {
 	int i;
-	node_clear(node, nodes_parsed);
+	node_clear(node, node_possible_map);
 	for (i = 0; i < MAX_LOCAL_APIC; i++) {
 		if (apicid_to_node[i] == node)
 			apicid_to_node[i] = NUMA_NO_NODE;
@@ -420,18 +419,18 @@ int __init acpi_scan_nodes(unsigned long start, unsigned long end)
 	}
 
 	/* Finally register nodes */
-	for_each_node_mask(i, nodes_parsed)
+	for_each_node_mask(i, node_possible_map)
 		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 	/* Try again in case setup_node_bootmem missed one due
 	   to missing bootmem */
-	for_each_node_mask(i, nodes_parsed)
+	for_each_node_mask(i, node_possible_map)
 		if (!node_online(i))
 			setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_to_node[i] == NUMA_NO_NODE)
 			continue;
-		if (!node_isset(cpu_to_node[i], nodes_parsed))
+		if (!node_isset(cpu_to_node[i], node_possible_map))
 			numa_set_node(i, NUMA_NO_NODE);
 	}
 	numa_init_array();

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-04-02 22:55                   ` Siddha, Suresh B
@ 2007-04-03  0:23                     ` Christoph Lameter
  2007-04-03  0:31                       ` Siddha, Suresh B
  2007-04-09 18:01                       ` [patch 1/2] x86_64: set node_possible_map at runtime Siddha, Suresh B
  0 siblings, 2 replies; 31+ messages in thread
From: Christoph Lameter @ 2007-04-03  0:23 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Andi Kleen, Eric Dumazet, Andrew Morton, linux kernel

On Mon, 2 Apr 2007, Siddha, Suresh B wrote:

> Set the node_possible_map at runtime. On a non NUMA system,
> num_possible_nodes() will now say '1'

How does this relate to nr_node_ids?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)
  2007-04-03  0:23                     ` Christoph Lameter
@ 2007-04-03  0:31                       ` Siddha, Suresh B
  2007-04-09 18:01                       ` [patch 1/2] x86_64: set node_possible_map at runtime Siddha, Suresh B
  1 sibling, 0 replies; 31+ messages in thread
From: Siddha, Suresh B @ 2007-04-03  0:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, Andi Kleen, Eric Dumazet, Andrew Morton, linux kernel

On Mon, Apr 02, 2007 at 05:23:20PM -0700, Christoph Lameter wrote:
> On Mon, 2 Apr 2007, Siddha, Suresh B wrote:
> 
> > Set the node_possible_map at runtime. On a non NUMA system,
> > num_possible_nodes() will now say '1'
> 
> How does this relate to nr_node_ids?

With this patch, nr_node_ids on non NUMA will also be '1' and
as before nr_node_ids is same as num_possible_nodes()

thanks,
suresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 1/2] x86_64: set node_possible_map at runtime.
  2007-04-03  0:23                     ` Christoph Lameter
  2007-04-03  0:31                       ` Siddha, Suresh B
@ 2007-04-09 18:01                       ` Siddha, Suresh B
  2007-04-09 18:07                         ` [patch 2/2] slab, x86_64: skip cache_free_alien() on non NUMA Siddha, Suresh B
  2007-04-09 20:23                         ` [patch 1/2] x86_64: set node_possible_map at runtime Andrew Morton
  1 sibling, 2 replies; 31+ messages in thread
From: Siddha, Suresh B @ 2007-04-09 18:01 UTC (permalink / raw)
  To: akpm; +Cc: christoph, Andi Kleen, Eric Dumazet, linux kernel, suresh.b.siddha

Set the node_possible_map at runtime on x86_64. On a non NUMA system,
num_possible_nodes() will now say '1'.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/arch/x86_64/mm/k8topology.c b/arch/x86_64/mm/k8topology.c
index b5b8dba..d6f4447 100644
--- a/arch/x86_64/mm/k8topology.c
+++ b/arch/x86_64/mm/k8topology.c
@@ -49,11 +49,8 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 	int found = 0;
 	u32 reg;
 	unsigned numnodes;
-	nodemask_t nodes_parsed;
 	unsigned dualcore = 0;
 
-	nodes_clear(nodes_parsed);
-
 	if (!early_pci_allowed())
 		return -1;
 
@@ -102,7 +99,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 			       nodeid, (base>>8)&3, (limit>>8) & 3); 
 			return -1; 
 		}	
-		if (node_isset(nodeid, nodes_parsed)) { 
+		if (node_isset(nodeid, node_possible_map)) { 
 			printk(KERN_INFO "Node %d already present. Skipping\n", 
 			       nodeid);
 			continue;
@@ -155,7 +152,7 @@ int __init k8_scan_nodes(unsigned long start, unsigned long end)
 
 		prevbase = base;
 
-		node_set(nodeid, nodes_parsed);
+		node_set(nodeid, node_possible_map);
 	} 
 
 	if (!found)
diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c
index 41b8fb0..5f7d4d8 100644
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -383,6 +383,7 @@ static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
  		       i,
  		       nodes[i].start, nodes[i].end,
  		       (nodes[i].end - nodes[i].start) >> 20);
+		node_set(i, node_possible_map);
 		node_set_online(i);
  	}
  	memnode_shift = compute_hash_shift(nodes, numa_fake);
@@ -405,6 +406,8 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
 	int i;
 
+	nodes_clear(node_possible_map);
+
 #ifdef CONFIG_NUMA_EMU
 	if (numa_fake && !numa_emulation(start_pfn, end_pfn))
  		return;
@@ -432,6 +435,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 	memnodemap[0] = 0;
 	nodes_clear(node_online_map);
 	node_set_online(0);
+	node_set(0, node_possible_map);
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c
index 2efe215..9f26e2b 100644
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -25,7 +25,6 @@ int acpi_numa __initdata;
 
 static struct acpi_table_slit *acpi_slit;
 
-static nodemask_t nodes_parsed __initdata;
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
 static struct bootnode nodes_add[MAX_NUMNODES];
 static int found_add_area __initdata;
@@ -43,7 +42,7 @@ static __init int setup_node(int pxm)
 static __init int conflicting_nodes(unsigned long start, unsigned long end)
 {
 	int i;
-	for_each_node_mask(i, nodes_parsed) {
+	for_each_node_mask(i, node_possible_map) {
 		struct bootnode *nd = &nodes[i];
 		if (nd->start == nd->end)
 			continue;
@@ -321,7 +320,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	}
 	nd = &nodes[node];
 	oldnode = *nd;
-	if (!node_test_and_set(node, nodes_parsed)) {
+	if (!node_test_and_set(node, node_possible_map)) {
 		nd->start = start;
 		nd->end = end;
 	} else {
@@ -344,7 +343,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		printk(KERN_NOTICE "SRAT: Hotplug region ignored\n");
 		*nd = oldnode;
 		if ((nd->start | nd->end) == 0)
-			node_clear(node, nodes_parsed);
+			node_clear(node, node_possible_map);
 	}
 }
 
@@ -356,7 +355,7 @@ static int nodes_cover_memory(void)
 	unsigned long pxmram, e820ram;
 
 	pxmram = 0;
-	for_each_node_mask(i, nodes_parsed) {
+	for_each_node_mask(i, node_possible_map) {
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
@@ -380,7 +379,7 @@ static int nodes_cover_memory(void)
 static void unparse_node(int node)
 {
 	int i;
-	node_clear(node, nodes_parsed);
+	node_clear(node, node_possible_map);
 	for (i = 0; i < MAX_LOCAL_APIC; i++) {
 		if (apicid_to_node[i] == node)
 			apicid_to_node[i] = NUMA_NO_NODE;
@@ -420,18 +419,18 @@ int __init acpi_scan_nodes(unsigned long start, unsigned long end)
 	}
 
 	/* Finally register nodes */
-	for_each_node_mask(i, nodes_parsed)
+	for_each_node_mask(i, node_possible_map)
 		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 	/* Try again in case setup_node_bootmem missed one due
 	   to missing bootmem */
-	for_each_node_mask(i, nodes_parsed)
+	for_each_node_mask(i, node_possible_map)
 		if (!node_online(i))
 			setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_to_node[i] == NUMA_NO_NODE)
 			continue;
-		if (!node_isset(cpu_to_node[i], nodes_parsed))
+		if (!node_isset(cpu_to_node[i], node_possible_map))
 			numa_set_node(i, NUMA_NO_NODE);
 	}
 	numa_init_array();

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 2/2] slab, x86_64: skip cache_free_alien() on non NUMA
  2007-04-09 18:01                       ` [patch 1/2] x86_64: set node_possible_map at runtime Siddha, Suresh B
@ 2007-04-09 18:07                         ` Siddha, Suresh B
  2007-04-09 20:23                         ` [patch 1/2] x86_64: set node_possible_map at runtime Andrew Morton
  1 sibling, 0 replies; 31+ messages in thread
From: Siddha, Suresh B @ 2007-04-09 18:07 UTC (permalink / raw)
  To: akpm; +Cc: christoph, Andi Kleen, Eric Dumazet, linux kernel, suresh.b.siddha

set use_alien_caches to 0 on non NUMA platforms. And avoid calling
the cache_free_alien() when use_alien_caches is not set. This will avoid
the cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/mm/slab.c b/mm/slab.c
index 8fdaffa..146082d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1146,7 +1146,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 	 * Make sure we are not freeing a object from another node to the array
 	 * cache on this cpu.
 	 */
-	if (likely(slabp->nodeid == node) || unlikely(!use_alien_caches))
+	if (likely(slabp->nodeid == node))
 		return 0;
 
 	l3 = cachep->nodelists[node];
@@ -1394,6 +1394,9 @@ void __init kmem_cache_init(void)
 	int order;
 	int node;
 
+	if (num_possible_nodes() == 1)
+		use_alien_caches = 0;
+
 	for (i = 0; i < NUM_INIT_LISTS; i++) {
 		kmem_list3_init(&initkmem_list3[i]);
 		if (i < MAX_NUMNODES)
@@ -3563,7 +3566,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-	if (cache_free_alien(cachep, objp))
+	if (use_alien_caches && cache_free_alien(cachep, objp))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/2] x86_64: set node_possible_map at runtime.
  2007-04-09 18:01                       ` [patch 1/2] x86_64: set node_possible_map at runtime Siddha, Suresh B
  2007-04-09 18:07                         ` [patch 2/2] slab, x86_64: skip cache_free_alien() on non NUMA Siddha, Suresh B
@ 2007-04-09 20:23                         ` Andrew Morton
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2007-04-09 20:23 UTC (permalink / raw)
  To: Siddha, Suresh B, David Rientjes
  Cc: christoph, Andi Kleen, Eric Dumazet, linux kernel

On Mon, 9 Apr 2007 11:01:45 -0700
"Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:

> Set the node_possible_map at runtime on x86_64. On a non NUMA system,
> num_possible_nodes() will now say '1'.
> 

I had to rework this a bit in the numa-emulation case due to changes in
Andi's tree.  The change to numa_emulation() got moved into
setup_node_range().

We have 166 x86 patches pending at present - the chances of a patch against
mainline applying and working against the development tree are reduced.



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2007-04-09 20:24 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-20 17:12 [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Eric Dumazet
2007-03-20 19:54 ` Christoph Lameter
2007-03-20 21:32   ` Andi Kleen
2007-03-20 22:09     ` Eric Dumazet
2007-03-21  0:16       ` Christoph Lameter
2007-03-21  6:27         ` Eric Dumazet
2007-03-21  6:57           ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Eric Dumazet
2007-03-21  7:21             ` [PATCH] SLAB : Dont allocate empty shared caches Eric Dumazet
2007-03-21 13:13               ` Pekka Enberg
2007-03-21 13:02             ` [PATCH] SLAB : Use num_possible_cpus() in enable_cpucache() Pekka Enberg
2007-03-21 18:45             ` Christoph Lameter
2007-03-21  7:03           ` [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid; Christoph Lameter
2007-03-21  7:14             ` Eric Dumazet
2007-03-21 14:35               ` Christoph Lameter
2007-03-21  0:18     ` Christoph Lameter
2007-03-21  2:44       ` Andi Kleen
2007-03-21  3:10         ` Christoph Lameter
2007-03-22 21:28           ` non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;) Siddha, Suresh B
2007-03-22 22:10             ` Christoph Lameter
2007-03-22 22:12             ` Eric Dumazet
2007-03-22 22:40               ` Siddha, Suresh B
2007-03-22 22:56                 ` Eric Dumazet
2007-03-23  1:25                 ` Christoph Lameter
2007-03-23 14:14                   ` Andi Kleen
2007-03-23 14:12                 ` Andi Kleen
2007-04-02 22:55                   ` Siddha, Suresh B
2007-04-03  0:23                     ` Christoph Lameter
2007-04-03  0:31                       ` Siddha, Suresh B
2007-04-09 18:01                       ` [patch 1/2] x86_64: set node_possible_map at runtime Siddha, Suresh B
2007-04-09 18:07                         ` [patch 2/2] slab, x86_64: skip cache_free_alien() on non NUMA Siddha, Suresh B
2007-04-09 20:23                         ` [patch 1/2] x86_64: set node_possible_map at runtime Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).