LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* hash table sizes
@ 2003-11-25 13:35 Jes Sorensen
  2003-11-25 13:42 ` William Lee Irwin III
  2003-11-25 20:48 ` Jack Steiner
  0 siblings, 2 replies; 32+ messages in thread
From: Jes Sorensen @ 2003-11-25 13:35 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrew Morton, William Lee Irwin, III, linux-kernel, jbarnes, steiner

Hi,

On NUMA systems with way too much memory, the current algorithms for
determining the size of the inode and dentry hash tables ends up trying
to allocate tables that are so big they may not fit within the physical
memory of a single node. Ie. on a 256 node system with 512GB of RAM with
16KB pages it basically ends up eating up all the memory on node before
completing a boot because of this. The inode and dentry hashes are 256MB
each and the IP routing table hash is 128MB.

I have tried changing the algorithm as below and it seems to produce
reasonable results and almost identical numbers for the smaller /
mid-sized configs I looked at.

This is not meant to be a final patch, any input/oppinion is welcome.

Thanks,
Jes

--- orig/linux-2.6.0-test10/fs/dcache.c	Sat Oct 25 11:42:58 2003
+++ linux-2.6.0-test10/fs/dcache.c	Tue Nov 25 05:33:04 2003
@@ -1549,9 +1549,8 @@
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
@@ -1571,12 +1570,17 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
+#if 0
 #if PAGE_SHIFT < 13
 	mempages >>= (13 - PAGE_SHIFT);
 #endif
 	mempages *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
+#endif
+	mempages >>= (23 - (PAGE_SHIFT - 1));
+	order = max(2, fls(mempages));
+	order = min(12, order);
 
 	do {
 		unsigned long tmp;
@@ -1594,7 +1598,7 @@
 			__get_free_pages(GFP_ATOMIC, order);
 	} while (dentry_hashtable == NULL && --order >= 0);
 
-	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
+	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %d, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
 	if (!dentry_hashtable)
--- orig/linux-2.6.0-test10/fs/inode.c	Sat Oct 25 11:44:53 2003
+++ linux-2.6.0-test10/fs/inode.c	Tue Nov 25 05:33:27 2003
@@ -1333,17 +1333,21 @@
 void __init inode_init(unsigned long mempages)
 {
 	struct hlist_head *head;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
+#if 0
 	mempages >>= (14 - PAGE_SHIFT);
 	mempages *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
+#endif
+	mempages >>= (23 - (PAGE_SHIFT - 1));
+	order = max(2, fls(mempages));
+	order = min(12, order);
 
 	do {
 		unsigned long tmp;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 13:35 hash table sizes Jes Sorensen
@ 2003-11-25 13:42 ` William Lee Irwin III
  2003-11-25 13:54   ` Jes Sorensen
  2003-11-25 20:48 ` Jack Steiner
  1 sibling, 1 reply; 32+ messages in thread
From: William Lee Irwin III @ 2003-11-25 13:42 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Alexander Viro, Andrew Morton, linux-kernel, jbarnes, steiner

On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> +	mempages >>= (23 - (PAGE_SHIFT - 1));
> +	order = max(2, fls(mempages));
> +	order = min(12, order);

This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to represent?


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 13:42 ` William Lee Irwin III
@ 2003-11-25 13:54   ` Jes Sorensen
  2003-11-25 16:25     ` Thomas Schlichter
  0 siblings, 1 reply; 32+ messages in thread
From: Jes Sorensen @ 2003-11-25 13:54 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Alexander Viro, Andrew Morton, linux-kernel, jbarnes, steiner

>>>>> "William" == William Lee Irwin <wli@holomorphy.com> writes:

William> On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
>> + mempages >>= (23 - (PAGE_SHIFT - 1)); + order = max(2,
>> fls(mempages)); + order = min(12, order);

William> This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to
William> represent?

Well MB >> 2 or consider it trial and error ;-)

Cheers,
Jes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 13:54   ` Jes Sorensen
@ 2003-11-25 16:25     ` Thomas Schlichter
  2003-11-25 17:52       ` Antonio Vargas
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Schlichter @ 2003-11-25 16:25 UTC (permalink / raw)
  To: Jes Sorensen, William Lee Irwin III
  Cc: Alexander Viro, Andrew Morton, linux-kernel, jbarnes, steiner


[-- Attachment #1.1: Type: text/plain, Size: 2490 bytes --]

Hi Jens,

I was wondering about you funny formula, too. (especially because it was 23 - 
(PAGE_SHIFT -1) and not 24 - PAGE_SHIFT ;-) So I wanted to find out why this:
1:	mempages >>= (23 - (PAGE_SHIFT - 1));
2:	order = max(2, fls(mempages));
3:	order = min(12, order);

should be similar to this original code:
4:	mempages >>= (14 - PAGE_SHIFT);
5:	mempages *= sizeof(struct hlist_head);
6:	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
7:		;

Well, first I saw that lines 2 and 3 make order to be between 2 and 12. OK 
this is new, and I like it ;-) Then I saw you tried to convert the ugly lines 
6 + 7 into a fls() call. Fine! (I like that, too) Line 6 + 7 can be converted 
to following (almost equivalent) lines:
6 + 7	-> 8a:	order = fls(mempages) - PAGE_SHIFT;
or
6 + 7	-> 8b:	order = fls(mempages >> PAGE_SHIFT);

Now the shift is not present in line 2, so I folded it from line 8b into line 
4 and got:
4	-> 9:	mempages >>= 14;
5:		mempages *= sizeof(struct hlist_head);
8b	-> 10:	order = fls(mempages);

But to make the math a bit more correct the multiplication from line 5 has to 
be done before the shift in line 9. So I got following simple two lines 
equivalent to lines 4 to 7:
5 + 9	-> 11:	mempages = (mempages * sizeof(struct hlist_head)) >> 14;
10:		order = fls(mempages);

So, now we can compare line 1 and 11. OK, let's assume PAGE_SHIFT = 12. So we 
get:
1	-> 12:	mempages >>= (23 - (12 - 1)); // right shift by 12

If sizeof(struct hlist_head) == 4, we get for line 10:
11	-> 13:	mempages = (mempages * 4) >> 14; // right shift by 12

And HURRAY, we've got it...!
But they are only equivalent if (PAGE_SHIFT == 12) && (sizeof(struct 
hlist_head) == 4)...

So I propose the attached patch which should be closer to the original and 
makes order not to be greater than 12. (Btw. this only changes anything for 
more than 2^24 pages of memory, so I think this is a good idea!)

Best reagrds
   Thomas Schlichter

On Tuesday 25 November 2003 14:54, Jes Sorensen wrote:
> >>>>> "William" == William Lee Irwin <wli@holomorphy.com> writes:
>
> William> On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> >> + mempages >>= (23 - (PAGE_SHIFT - 1)); + order = max(2,
> >> fls(mempages)); + order = min(12, order);
>
> William> This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to
> William> represent?
>
> Well MB >> 2 or consider it trial and error ;-)
>
> Cheers,
> Jes

[-- Attachment #1.2: hash_sizes.diff --]
[-- Type: text/x-diff, Size: 2265 bytes --]

--- linux-2.6.0-test9-mm5/fs/dcache.c.orig	Tue Nov 25 15:29:38 2003
+++ linux-2.6.0-test9-mm5/fs/dcache.c	Tue Nov 25 16:18:37 2003
@@ -1530,9 +1530,8 @@
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
@@ -1552,12 +1551,10 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
-		;
+	mempages = (mempages * sizeof(struct hlist_head)) >> 13;
+	order = fls(mempages);
+	if (order > 12)
+		order = 12;
 
 	do {
 		unsigned long tmp;
@@ -1575,7 +1572,7 @@
 			__get_free_pages(GFP_ATOMIC, order);
 	} while (dentry_hashtable == NULL && --order >= 0);
 
-	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
+	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %d, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
 	if (!dentry_hashtable)
--- linux-2.6.0-test9-mm5/fs/inode.c.orig	Tue Nov 25 15:33:22 2003
+++ linux-2.6.0-test9-mm5/fs/inode.c	Tue Nov 25 16:17:38 2003
@@ -1319,17 +1319,16 @@
 void __init inode_init(unsigned long mempages)
 {
 	struct hlist_head *head;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
-		;
+	mempages = (mempages * sizeof(struct hlist_head)) >> 14;
+	order = fls(mempages);
+	if (order > 12)
+		order = 12;
 
 	do {
 		unsigned long tmp;
@@ -1347,7 +1346,7 @@
 			__get_free_pages(GFP_ATOMIC, order);
 	} while (inode_hashtable == NULL && --order >= 0);
 
-	printk("Inode-cache hash table entries: %d (order: %ld, %ld bytes)\n",
+	printk("Inode-cache hash table entries: %d (order: %d, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
 	if (!inode_hashtable)

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 16:25     ` Thomas Schlichter
@ 2003-11-25 17:52       ` Antonio Vargas
  2003-11-25 17:54         ` William Lee Irwin III
  0 siblings, 1 reply; 32+ messages in thread
From: Antonio Vargas @ 2003-11-25 17:52 UTC (permalink / raw)
  To: Thomas Schlichter
  Cc: Jes Sorensen, William Lee Irwin III, Alexander Viro,
	Andrew Morton, linux-kernel, jbarnes, steiner

On Tue, Nov 25, 2003 at 05:25:07PM +0100, Thomas Schlichter wrote:
> Hi Jens,
> 
> I was wondering about you funny formula, too. (especially because it was 23 - 
> (PAGE_SHIFT -1) and not 24 - PAGE_SHIFT ;-) So I wanted to find out why this:
> 1:	mempages >>= (23 - (PAGE_SHIFT - 1));
> 2:	order = max(2, fls(mempages));
> 3:	order = min(12, order);
> 
> should be similar to this original code:
> 4:	mempages >>= (14 - PAGE_SHIFT);
> 5:	mempages *= sizeof(struct hlist_head);
> 6:	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
> 7:		;
> 
> Well, first I saw that lines 2 and 3 make order to be between 2 and 12. OK 
> this is new, and I like it ;-) Then I saw you tried to convert the ugly lines 
> 6 + 7 into a fls() call. Fine! (I like that, too) Line 6 + 7 can be converted 

is fls(x) sort-of log2(x) via some "find-highest-bit-set"?
I recall discussing something related with Jesse Barnes
last 5 november (search for "[DMESG] cpumask_t in action" in lkml).

[SNIP]

Greets, Antonio Vargas

1. Dado un programa, siempre tiene al menos un fallo.
2. Dadas varias lineas de codigo, siempre se pueden acortar a menos lineas.
3. Por induccion, todos los programas se pueden
   reducir a una linea que no funciona.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 17:52       ` Antonio Vargas
@ 2003-11-25 17:54         ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2003-11-25 17:54 UTC (permalink / raw)
  To: Antonio Vargas
  Cc: Thomas Schlichter, Jes Sorensen, Alexander Viro, Andrew Morton,
	linux-kernel, jbarnes, steiner

On Tue, Nov 25, 2003 at 06:52:15PM +0100, Antonio Vargas wrote:
> is fls(x) sort-of log2(x) via some "find-highest-bit-set"?
> I recall discussing something related with Jesse Barnes
> last 5 november (search for "[DMESG] cpumask_t in action" in lkml).
> [SNIP]
> Greets, Antonio Vargas

fls() computes floor(lg(n)) via "find highest bit", yes. It stands
for "find last set".


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 13:35 hash table sizes Jes Sorensen
  2003-11-25 13:42 ` William Lee Irwin III
@ 2003-11-25 20:48 ` Jack Steiner
  2003-11-25 21:07   ` Andrew Morton
  2003-11-25 21:16   ` Anton Blanchard
  1 sibling, 2 replies; 32+ messages in thread
From: Jack Steiner @ 2003-11-25 20:48 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Alexander Viro, Andrew Morton, William Lee Irwin, III,
	linux-kernel, jbarnes

On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> Hi,
> 
> On NUMA systems with way too much memory, the current algorithms for
> determining the size of the inode and dentry hash tables ends up trying
> to allocate tables that are so big they may not fit within the physical
> memory of a single node. Ie. on a 256 node system with 512GB of RAM with
> 16KB pages it basically ends up eating up all the memory on node before
> completing a boot because of this. The inode and dentry hashes are 256MB
> each and the IP routing table hash is 128MB.
> 
> I have tried changing the algorithm as below and it seems to produce
> reasonable results and almost identical numbers for the smaller /
> mid-sized configs I looked at.
> 
> This is not meant to be a final patch, any input/oppinion is welcome.
> 
> Thanks,
> Jes
> 
...

> +	mempages >>= (23 - (PAGE_SHIFT - 1));
> +	order = max(2, fls(mempages));
> +	order = min(12, order);
>  

I dont think you want to constrain the allocation to a specific "order". Otherwise,
when the kernel is built with a 64k pagesize, the size of the caches will increase 4X.

Some architectures (IA64, for example) dont have severe limitations on usage of vmalloc
space. Would it make sense to use vmalloc on these architectures. Even if the
max size of the structures being allocated is limited to an acceptible size, it still
concentrates the memory on a single node. In general, we should try to
distribute the memory references across as many nodes as possible - at least in theory. 
(In practice, I wonder if we could actually measure the difference.....)


BTW, I think the network code (tcp_init()) has similar problems.

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 20:48 ` Jack Steiner
@ 2003-11-25 21:07   ` Andrew Morton
  2003-11-25 21:14     ` Jesse Barnes
  2003-12-01 21:06     ` Anton Blanchard
  2003-11-25 21:16   ` Anton Blanchard
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2003-11-25 21:07 UTC (permalink / raw)
  To: Jack Steiner; +Cc: jes, viro, wli, linux-kernel, jbarnes

Jack Steiner <steiner@sgi.com> wrote:
>
> > +	mempages >>= (23 - (PAGE_SHIFT - 1));
> > +	order = max(2, fls(mempages));
> > +	order = min(12, order);
> >  
> 
> I dont think you want to constrain the allocation to a specific "order". Otherwise,
> when the kernel is built with a 64k pagesize, the size of the caches will increase 4X.
> 
> Some architectures (IA64, for example) dont have severe limitations on usage of vmalloc
> space. Would it make sense to use vmalloc on these architectures. Even if the
> max size of the structures being allocated is limited to an acceptible size, it still
> concentrates the memory on a single node. In general, we should try to
> distribute the memory references across as many nodes as possible - at least in theory. 

I don't think we want to use vmalloc, unless it really can be shown that
the use of an entire memory zone on some machine _still_ provides a
hashtable which is so small that it unacceptably impacts performance.

And that it can be shown that a mega-hashtable is still an appropriate data
structure...

> (In practice, I wonder if we could actually measure the difference.....)

Well yes.  I suspect the inode hashtable could be shrunk; I don't think we
do hash-based inode lookups very much?

Also the inode hashtable perhaps could become per superblock, perhaps with
sizing hints from the fs.  Or not.

This is hard.  We could change the sizing of these tables so that for their
setup arithmetic they use "the size of the zone from which the table is to
be allocated" rather than "the size of all memory".  But we do want to make
the size of these tables dependent upon the number of dentries/inodes/etc
which the system is likely to support.  And that does depend upon the
amount of direct-addressible memory.


So hum.  As a starting point, what happens if we do:

-	vfs_caches_init(num_physpages);
+	vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));

?


It would be very nice to have some confirmation that the size of these
tables is being appropriately chosen, too.  Maybe we should shrink 'em 32x
and see who complains...



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:07   ` Andrew Morton
@ 2003-11-25 21:14     ` Jesse Barnes
  2003-11-25 21:24       ` Andrew Morton
  2003-12-01 21:06     ` Anton Blanchard
  1 sibling, 1 reply; 32+ messages in thread
From: Jesse Barnes @ 2003-11-25 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jack Steiner, jes, viro, wli, linux-kernel

On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> the size of these tables dependent upon the number of dentries/inodes/etc
> which the system is likely to support.  And that does depend upon the
> amount of direct-addressible memory.
> 
> 
> So hum.  As a starting point, what happens if we do:
> 
> -	vfs_caches_init(num_physpages);
> +	vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
> 
> ?

Something like that might be ok, but on our system, all memory is in
ZONE_DMA...

Jesse

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 20:48 ` Jack Steiner
  2003-11-25 21:07   ` Andrew Morton
@ 2003-11-25 21:16   ` Anton Blanchard
  2003-11-25 23:11     ` Jack Steiner
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Blanchard @ 2003-11-25 21:16 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Jes Sorensen, Alexander Viro, Andrew Morton, William Lee Irwin,
	III, linux-kernel, jbarnes

 
Hi,

> Some architectures (IA64, for example) dont have severe limitations on
> usage of vmalloc space. Would it make sense to use vmalloc on these
> architectures. Even if the max size of the structures being allocated
> is limited to an acceptible size, it still concentrates the memory on
> a single node. 

The kernel region on many architectures is mapped with large pages, we
would lose that by going to vmalloc.

Anton

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:14     ` Jesse Barnes
@ 2003-11-25 21:24       ` Andrew Morton
  2003-11-26  2:14         ` David S. Miller
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Andrew Morton @ 2003-11-25 21:24 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: steiner, jes, viro, wli, linux-kernel

jbarnes@sgi.com (Jesse Barnes) wrote:
>
> On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> > the size of these tables dependent upon the number of dentries/inodes/etc
> > which the system is likely to support.  And that does depend upon the
> > amount of direct-addressible memory.
> > 
> > 
> > So hum.  As a starting point, what happens if we do:
> > 
> > -	vfs_caches_init(num_physpages);
> > +	vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
> > 
> > ?
> 
> Something like that might be ok, but on our system, all memory is in
> ZONE_DMA...
> 

Well yes, we'd want

	vfs_caches_init(min(num_physpages, some_platform_limit()));

which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
evaluate to the size of one of those zones.

If the machine has zillions of small zones then perhaps this will result in
an undersized hashtable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:16   ` Anton Blanchard
@ 2003-11-25 23:11     ` Jack Steiner
  2003-11-26  3:39       ` Rik van Riel
  2003-11-26  7:25       ` Anton Blanchard
  0 siblings, 2 replies; 32+ messages in thread
From: Jack Steiner @ 2003-11-25 23:11 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Jes Sorensen, Alexander Viro, Andrew Morton, William Lee Irwin,
	III, linux-kernel, jbarnes

On Wed, Nov 26, 2003 at 08:16:11AM +1100, Anton Blanchard wrote:
>  
> Hi,
> 
> > Some architectures (IA64, for example) dont have severe limitations on
> > usage of vmalloc space. Would it make sense to use vmalloc on these
> > architectures. Even if the max size of the structures being allocated
> > is limited to an acceptible size, it still concentrates the memory on
> > a single node. 
> 
> The kernel region on many architectures is mapped with large pages, we
> would lose that by going to vmalloc.
> 
> Anton


That was a concern to me too. However, on IA64, all page structs are
in the vmalloc region (it isnt allocated by vmalloc but is in the
same region as vmalloc'ed pages. They are mapped with 16k pages instead of the 64MB 
pages used for memory allocated by kmalloc).

Before switching to 16K pages for the page structs, we made numerous performance
measurements. As far as we could tell, there was no performce degradation
caused by the smaller pages. It seems to me that if page structs are ok 
being mapped by 16k pages, the hash tables would be ok too. 



-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:24       ` Andrew Morton
@ 2003-11-26  2:14         ` David S. Miller
  2003-11-26  5:27         ` Matt Mackall
  2003-11-28 14:15         ` Jes Sorensen
  2 siblings, 0 replies; 32+ messages in thread
From: David S. Miller @ 2003-11-26  2:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jbarnes, steiner, jes, viro, wli, linux-kernel

On Tue, 25 Nov 2003 13:24:39 -0800
Andrew Morton <akpm@osdl.org> wrote:

> Well yes, we'd want
> 
> 	vfs_caches_init(min(num_physpages, some_platform_limit()));
> 
> which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
> evaluate to the size of one of those zones.
> 
> If the machine has zillions of small zones then perhaps this will result in
> an undersized hashtable.

While a platform may have a hard limit, there is also a generic
"usefullness" limit.

If you make the hash tables too huge, you start trading cache misses
on the hash list traversal for misses on the hash array head accesses
themselves.  Big hashes can hurt you also if you don't actually use
them to capacity.

Luckily, now that we've moved the page cache over to the rbtree thing,
there are not many hash tables that strongly depend upon the amount
of RAM in the system.  Unfortunately, the dentry and inode cache ones
being discussed here are two of the remaining ones.

I also believe that vmalloc()'ing this thing is the wrong idea.

Dynamic growth of hash tables, while intellectually interesting to
consider as a solution to the "use to capacity" issue mentioned above,
is wholly impractical in my experience for two reasons: 1) the locking
is messy if not next to impossible 2) getting the higher order allocs
after the system has fully booted is a game of Russian Roulette.

Therefore, I think the safest "fix" for this problem right now is to
just put a hard fixed cap on the hash tables sizes for now instead of
coming up with all sorts of clever new formulas.  In particular, for
the purposes of 2.6.0, we can explore better ideas later.

Andrew's suggested ideas seem to come closest to this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 23:11     ` Jack Steiner
@ 2003-11-26  3:39       ` Rik van Riel
  2003-11-26  3:59         ` William Lee Irwin III
  2003-11-26  7:25       ` Anton Blanchard
  1 sibling, 1 reply; 32+ messages in thread
From: Rik van Riel @ 2003-11-26  3:39 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Anton Blanchard, Jes Sorensen, Alexander Viro, Andrew Morton,
	William Lee Irwin, III, linux-kernel, jbarnes

On Tue, 25 Nov 2003, Jack Steiner wrote:

> That was a concern to me too. However, on IA64, all page structs are in
> the vmalloc region

Which you'll probably want to fix eventually.  At least
PAGE_VALID() doesn't seem to work as advertised currently...

(occasionally leading to "false positives", with PAGE_VALID()
saying that a page exists while it really doesn't)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  3:39       ` Rik van Riel
@ 2003-11-26  3:59         ` William Lee Irwin III
  2003-11-26  4:25           ` Andrew Morton
  2003-11-26  5:14           ` Martin J. Bligh
  0 siblings, 2 replies; 32+ messages in thread
From: William Lee Irwin III @ 2003-11-26  3:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jack Steiner, Anton Blanchard, Jes Sorensen, Alexander Viro,
	Andrew Morton, linux-kernel, jbarnes

On Tue, 25 Nov 2003, Jack Steiner wrote:
>> That was a concern to me too. However, on IA64, all page structs are in
>> the vmalloc region

On Tue, Nov 25, 2003 at 10:39:10PM -0500, Rik van Riel wrote:
> Which you'll probably want to fix eventually.  At least
> PAGE_VALID() doesn't seem to work as advertised currently...
> (occasionally leading to "false positives", with PAGE_VALID()
> saying that a page exists while it really doesn't)

Speaking of which, no one's bothered fixing the X crashes on i386
discontigmem. Untested patch below.


-- wli


diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
--- linux-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-23 17:31:56.000000000 -0800
+++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-25 19:54:31.000000000 -0800
@@ -85,13 +85,19 @@ extern struct pglist_data *node_data[];
 })
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
 /*
- * pfn_valid should be made as fast as possible, and the current definition 
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
+ * pfn_valid must absolutely be correct, regardless of speed concerns.
  */ 
-#define pfn_valid(pfn)          ((pfn) < num_physpages)
+#define pfn_valid(pfn)							\
+({									\
+	unsigned long __pfn__ = pfn;					\
+	u8 __nid__ = pfn_to_nid(__pfn__);				\
+	pg_data_t *__pgdat__;						\
+	__pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL;	\
+	__pgdat__ &&							\
+		__pfn__ >= __pgdat__->node_start_pfn &&			\
+		__pfn__ - __pgdat__->node_start_pfn			\
+				< __pgdat__->node_spanned_pages;	\
+})
 
 /*
  * generic node memory support, the following assumptions apply:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  4:25           ` Andrew Morton
@ 2003-11-26  4:23             ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2003-11-26  4:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: riel, steiner, anton, jes, viro, linux-kernel, jbarnes

William Lee Irwin III <wli@holomorphy.com> wrote:
>> -#define pfn_valid(pfn)          ((pfn) < num_physpages)
>> +#define pfn_valid(pfn)						\
>> +({									\
>> +	unsigned long __pfn__ = pfn;					\
>> +	u8 __nid__ = pfn_to_nid(__pfn__);				\
>> +	pg_data_t *__pgdat__;						\
>> +	__pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL;	\
>> +	__pgdat__ &&							\
>> +		__pfn__ >= __pgdat__->node_start_pfn &&			\
>> +		__pfn__ - __pgdat__->node_start_pfn			\
>> +				< __pgdat__->node_spanned_pages;	\
>> +})

On Tue, Nov 25, 2003 at 08:25:45PM -0800, Andrew Morton wrote:
> Boggle.
> Does this evaulate to the same thing on non-discontigmem? (surely no)
> Can we please arrange for it to?

The non-discontigmem case is unaltered, as it's handled in
include/asm-i386/page.h under a #ifdef. The semantics agree.

-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  3:59         ` William Lee Irwin III
@ 2003-11-26  4:25           ` Andrew Morton
  2003-11-26  4:23             ` William Lee Irwin III
  2003-11-26  5:14           ` Martin J. Bligh
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2003-11-26  4:25 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: riel, steiner, anton, jes, viro, linux-kernel, jbarnes

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> Speaking of which, no one's bothered fixing the X crashes on i386
> discontigmem. Untested patch below.
> 
> ...
> -#define pfn_valid(pfn)          ((pfn) < num_physpages)
> +#define pfn_valid(pfn)							\
> +({									\
> +	unsigned long __pfn__ = pfn;					\
> +	u8 __nid__ = pfn_to_nid(__pfn__);				\
> +	pg_data_t *__pgdat__;						\
> +	__pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL;	\
> +	__pgdat__ &&							\
> +		__pfn__ >= __pgdat__->node_start_pfn &&			\
> +		__pfn__ - __pgdat__->node_start_pfn			\
> +				< __pgdat__->node_spanned_pages;	\
> +})

Boggle.

Does this evaulate to the same thing on non-discontigmem? (surely no)

Can we please arrange for it to?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  3:59         ` William Lee Irwin III
  2003-11-26  4:25           ` Andrew Morton
@ 2003-11-26  5:14           ` Martin J. Bligh
  2003-11-26  9:51             ` William Lee Irwin III
  1 sibling, 1 reply; 32+ messages in thread
From: Martin J. Bligh @ 2003-11-26  5:14 UTC (permalink / raw)
  To: William Lee Irwin III, Rik van Riel
  Cc: Jack Steiner, Anton Blanchard, Jes Sorensen, Alexander Viro,
	Andrew Morton, linux-kernel, jbarnes

> Speaking of which, no one's bothered fixing the X crashes on i386
> discontigmem. Untested patch below.
> 
> 
> -- wli
> 
> 
> diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
> --- linux-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-23 17:31:56.000000000 -0800
> +++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-25 19:54:31.000000000 -0800
> @@ -85,13 +85,19 @@ extern struct pglist_data *node_data[];
>  })
>  #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
>  /*
> - * pfn_valid should be made as fast as possible, and the current definition 
> - * is valid for machines that are NUMA, but still contiguous, which is what
> - * is currently supported. A more generalised, but slower definition would
> - * be something like this - mbligh:
> - * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
> + * pfn_valid must absolutely be correct, regardless of speed concerns.
>   */ 
> -#define pfn_valid(pfn)          ((pfn) < num_physpages)
> +#define pfn_valid(pfn)							\
> +({									\
> +	unsigned long __pfn__ = pfn;					\
> +	u8 __nid__ = pfn_to_nid(__pfn__);				\
> +	pg_data_t *__pgdat__;						\
> +	__pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL;	\
> +	__pgdat__ &&							\
> +		__pfn__ >= __pgdat__->node_start_pfn &&			\
> +		__pfn__ - __pgdat__->node_start_pfn			\
> +				< __pgdat__->node_spanned_pages;	\
> +})
>  
>  /*
>   * generic node memory support, the following assumptions apply:

Would it not be rather more readable as something along the lines of:

static inline int pfn_valid (int pfn) {
        int nid = pfn_to_nid(pfn);
        pg_data_t *pgdat;

        if (nid >= MAX_NUMNODES)
                return 0;		/* node invalid */
        pgdat = NODE_DATA(nid);
        if (!pgdat)
                return 0;		/* pgdat invalid */
        if (pfn < pgdat->node_start_pfn)
                return 0;		/* before start of node */
        if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
                return 0;		/* past end of node */
        return 1;
}

However, I'm curious as to why this crashes X, as I don't see how this
code change makes a difference in practice. I didn't think we had any i386
NUMA with memory holes between nodes at the moment, though perhaps the x440
does.

M.

PS. No, I haven't tested my rephrasing of your patch either.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:24       ` Andrew Morton
  2003-11-26  2:14         ` David S. Miller
@ 2003-11-26  5:27         ` Matt Mackall
  2003-11-28 14:15         ` Jes Sorensen
  2 siblings, 0 replies; 32+ messages in thread
From: Matt Mackall @ 2003-11-26  5:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jesse Barnes, steiner, jes, viro, wli, linux-kernel

On Tue, Nov 25, 2003 at 01:24:39PM -0800, Andrew Morton wrote:
> jbarnes@sgi.com (Jesse Barnes) wrote:
> >
> > On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> > > the size of these tables dependent upon the number of dentries/inodes/etc
> > > which the system is likely to support.  And that does depend upon the
> > > amount of direct-addressible memory.
> > > 
> > > 
> > > So hum.  As a starting point, what happens if we do:
> > > 
> > > -	vfs_caches_init(num_physpages);
> > > +	vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
> > > 
> > > ?
> > 
> > Something like that might be ok, but on our system, all memory is in
> > ZONE_DMA...
> > 
> 
> Well yes, we'd want
> 
> 	vfs_caches_init(min(num_physpages, some_platform_limit()));
> 
> which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
> evaluate to the size of one of those zones.

I actually just added this to the tree I'm working on:

+       vfs_caches_init(min(1000, num_physpages-16000));

Caches are too expensive on the low end of the scale as well, when the
kernel is taking up most of RAM.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 23:11     ` Jack Steiner
  2003-11-26  3:39       ` Rik van Riel
@ 2003-11-26  7:25       ` Anton Blanchard
  1 sibling, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2003-11-26  7:25 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Jes Sorensen, Alexander Viro, Andrew Morton, William Lee Irwin,
	III, linux-kernel, jbarnes


> That was a concern to me too. However, on IA64, all page structs are
> in the vmalloc region (it isnt allocated by vmalloc but is in the same
> region as vmalloc'ed pages. They are mapped with 16k pages instead of
> the 64MB pages used for memory allocated by kmalloc).
> 
> Before switching to 16K pages for the page structs, we made numerous
> performance measurements. As far as we could tell, there was no
> performce degradation caused by the smaller pages. It seems to me that
> if page structs are ok being mapped by 16k pages, the hash tables
> would be ok too. 

OK, on ppc64 with a 4kB base pagesize Id be more worried about using the
vmalloc region.

Anton

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  5:14           ` Martin J. Bligh
@ 2003-11-26  9:51             ` William Lee Irwin III
  2003-11-26 16:17               ` Martin J. Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: William Lee Irwin III @ 2003-11-26  9:51 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Jack Steiner, Anton Blanchard, Jes Sorensen,
	Alexander Viro, Andrew Morton, linux-kernel, jbarnes

On Tue, Nov 25, 2003 at 09:14:32PM -0800, Martin J. Bligh wrote:
> Would it not be rather more readable as something along the lines of:
> static inline int pfn_valid (int pfn) {
>         int nid = pfn_to_nid(pfn);
>         pg_data_t *pgdat;
> 
>         if (nid >= MAX_NUMNODES)
>                 return 0;		/* node invalid */
>         pgdat = NODE_DATA(nid);
>         if (!pgdat)
>                 return 0;		/* pgdat invalid */
>         if (pfn < pgdat->node_start_pfn)
>                 return 0;		/* before start of node */
>         if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
>                 return 0;		/* past end of node */
>         return 1;
> }

This header at least used to be included before the relevant structures
were declared (to all appearances, some macros have been converted to
inlines since last I looked).


On Tue, Nov 25, 2003 at 09:14:32PM -0800, Martin J. Bligh wrote:
> However, I'm curious as to why this crashes X, as I don't see how this
> code change makes a difference in practice. I didn't think we had any i386
> NUMA with memory holes between nodes at the moment, though perhaps the x440
> does.
> M.
> PS. No, I haven't tested my rephrasing of your patch either.

mmap() of framebuffers. It takes the box out, not just X. There are
holes just below 4GB regardless. This has actually been reported by
rml and some others.

False positives on pfn_valid() result in manipulations of purported page
structures beyond the bounds of actual allocated pgdat->node_mem_map[]'s,
potentially either corrupting memory or accessing areas outside memory's
limits (the case causing oopsen).

Insubstantially modified version of the above below (it looks fine, I
was just typing from memory). Compiletested only.


-- wli


diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
--- linux-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-23 17:31:56.000000000 -0800
+++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-26 01:40:36.000000000 -0800
@@ -84,14 +84,30 @@ extern struct pglist_data *node_data[];
 		+ __zone->zone_start_pfn;				\
 })
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+
+static inline int pfn_to_nid(unsigned long);
 /*
- * pfn_valid should be made as fast as possible, and the current definition 
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
+ * pfn_valid must absolutely be correct, regardless of speed concerns.
  */ 
-#define pfn_valid(pfn)          ((pfn) < num_physpages)
+static inline int pfn_valid(unsigned long pfn)
+{
+	u8 nid = pfn_to_nid(pfn);
+	pg_data_t *pgdat;
+
+	if (nid < MAX_NUMNODES)
+		pgdat = NODE_DATA(nid);
+	else
+		return 0;
+
+	if (!pgdat)
+		return 0;
+	else if (pfn < pgdat->node_start_pfn)
+		return 0;
+	else if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
+		return 0;
+	else
+		return 1;
+}
 
 /*
  * generic node memory support, the following assumptions apply:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-26  9:51             ` William Lee Irwin III
@ 2003-11-26 16:17               ` Martin J. Bligh
  0 siblings, 0 replies; 32+ messages in thread
From: Martin J. Bligh @ 2003-11-26 16:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Jack Steiner, Anton Blanchard, Jes Sorensen,
	Alexander Viro, Andrew Morton, linux-kernel, jbarnes

>> However, I'm curious as to why this crashes X, as I don't see how this
>> code change makes a difference in practice. I didn't think we had any i386
>> NUMA with memory holes between nodes at the moment, though perhaps the x440
>> does.
>> M.
>> PS. No, I haven't tested my rephrasing of your patch either.
> 
> mmap() of framebuffers. It takes the box out, not just X. There are
> holes just below 4GB regardless. This has actually been reported by
> rml and some others.
>
> False positives on pfn_valid() result in manipulations of purported page
> structures beyond the bounds of actual allocated pgdat->node_mem_map[]'s,
> potentially either corrupting memory or accessing areas outside memory's
> limits (the case causing oopsen).

OK. But the hole from 3.75 - 4GB you're referring to doesn't seem to
fall under that definition. 

1) It still has a valid pfn, though the backing memory itself isn't there.
2) It's covered by node_start_pfn ... node_start_pfn + node_spanned_pages,
	which is what your patch tests for. Maybe not if you have = 4GB
	per node? Will be if you have more than that.

I agree that pfn_valid absolutely has to be correct. The current definition
was, I thought, correct unless we have holes *between* nodes. I was under
the impression that no box we had uses that setup, but I guess we might
do - were you seeing this on x440 or NUMA-Q?

NUMA-Q looks like this:

BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 0000000000100000 - 00000000e0000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec09000 (reserved)
BIOS-e820: 00000000ffe80000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000400000000 (usable)

which should create mem_map from 0 - 4GB contigious for node 0, AFAICS
(I believe struct pages are created for reserved areas still). Maybe
the srat stuff for x440 does something else.

Your patch is correct, I just don't see that it'll fix the X problem.

> diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
> --- linux-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-23 17:31:56.000000000 -0800
> +++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h	2003-11-26 01:40:36.000000000 -0800
> @@ -84,14 +84,30 @@ extern struct pglist_data *node_data[];
>  		+ __zone->zone_start_pfn;				\
>  })
>  #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
> +
> +static inline int pfn_to_nid(unsigned long);
>  /*
> - * pfn_valid should be made as fast as possible, and the current definition 
> - * is valid for machines that are NUMA, but still contiguous, which is what
> - * is currently supported. A more generalised, but slower definition would
> - * be something like this - mbligh:
> - * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) ) 
> + * pfn_valid must absolutely be correct, regardless of speed concerns.
>   */ 
> -#define pfn_valid(pfn)          ((pfn) < num_physpages)
> +static inline int pfn_valid(unsigned long pfn)
> +{
> +	u8 nid = pfn_to_nid(pfn);
> +	pg_data_t *pgdat;
> +
> +	if (nid < MAX_NUMNODES)
> +		pgdat = NODE_DATA(nid);
> +	else
> +		return 0;
> +
> +	if (!pgdat)
> +		return 0;
> +	else if (pfn < pgdat->node_start_pfn)
> +		return 0;
> +	else if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
> +		return 0;
> +	else
> +		return 1;
> +}
>  
>  /*
>   * generic node memory support, the following assumptions apply:


Cool, thanks. I'll try runtesting it.

M.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:24       ` Andrew Morton
  2003-11-26  2:14         ` David S. Miller
  2003-11-26  5:27         ` Matt Mackall
@ 2003-11-28 14:15         ` Jes Sorensen
  2003-11-28 14:52           ` Jack Steiner
  2 siblings, 1 reply; 32+ messages in thread
From: Jes Sorensen @ 2003-11-28 14:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jesse Barnes, steiner, viro, wli, linux-kernel

>>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:

Andrew> jbarnes@sgi.com (Jesse Barnes) wrote:
>> Something like that might be ok, but on our system, all memory is
>> in ZONE_DMA...

Andrew> Well yes, we'd want

Andrew> 	vfs_caches_init(min(num_physpages,
Andrew> some_platform_limit()));

Andrew> which on ia32 would evaluate to nr_free_buffer_pages() and on
Andrew> ia64 would evaluate to the size of one of those zones.

What about something like this? I believe node_present_pages should be
the same as nym_physpages on a non-NUMA machine. If not we can make it
min(num_physpages, NODE_DATA(0)->node_present_pages).

Of course this might not work perfectly if one has multiple nodes and
node 0 has no or very little memory. It would also be nice if one
could spread the various caches onto various nodes, but we can leave
that for stage 2 ;-)

Cheers,
Jes

--- orig/linux-2.6.0-test10/init/main.c	Sun Nov 23 17:31:14 2003
+++ linux-2.6.0-test10/init/main.c	Fri Nov 28 07:06:45 2003
@@ -447,7 +447,7 @@
 	proc_caches_init();
 	buffer_init();
 	security_scaffolding_startup();
-	vfs_caches_init(num_physpages);
+	vfs_caches_init(NODE_DATA(0)->node_present_pages);
 	radix_tree_init();
 	signals_init();
 	/* rootfs populating might need page-writeback */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-28 14:15         ` Jes Sorensen
@ 2003-11-28 14:52           ` Jack Steiner
  2003-11-28 16:22             ` Jes Sorensen
  0 siblings, 1 reply; 32+ messages in thread
From: Jack Steiner @ 2003-11-28 14:52 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Andrew Morton, Jesse Barnes, viro, wli, linux-kernel

On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
> >>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:
> 
> Andrew> jbarnes@sgi.com (Jesse Barnes) wrote:
> >> Something like that might be ok, but on our system, all memory is
> >> in ZONE_DMA...
> 
> Andrew> Well yes, we'd want
> 
> Andrew> 	vfs_caches_init(min(num_physpages,
> Andrew> some_platform_limit()));
> 
> Andrew> which on ia32 would evaluate to nr_free_buffer_pages() and on
> Andrew> ia64 would evaluate to the size of one of those zones.
> 
> What about something like this? I believe node_present_pages should be
> the same as nym_physpages on a non-NUMA machine. If not we can make it
> min(num_physpages, NODE_DATA(0)->node_present_pages).


I may be missing something but I dont see how this fixes the original
problem that we started with.

The system has a large number of nodes. Physically, each node has the same 
amount of memory.  After boot, we observe that several nodes have 
substantially less memory than other nodes. Some of the inbalance is
due to the kernel data/text being on node 0, but by far, the major source 
of in the inbalance is the 3 (in 2.4.x) large hash tables that are being 
allocated. 

I suspect the size of the hash tables is a lot bigger than is needed. 
That is certainly the first problem to be fixed, but unless the 
required size is a very small percentage (5-10%) of the amount of memory 
on a node (2GB to 32GB per node & 256 nodes), we still have a problem.

We run large MPI applications that place threads onto each node. Each
thread needs the same amount of local memory. The maximum problem size 
that can be efficiently solved is limited by the amount of free memory 
on the smallest node. We need an allocation scheme that doesn't deplete
a significant amount of memory from any single node.



> 
> Of course this might not work perfectly if one has multiple nodes and
> node 0 has no or very little memory. It would also be nice if one
> could spread the various caches onto various nodes, but we can leave
> that for stage 2 ;-)
> 
> Cheers,
> Jes
> 
> --- orig/linux-2.6.0-test10/init/main.c	Sun Nov 23 17:31:14 2003
> +++ linux-2.6.0-test10/init/main.c	Fri Nov 28 07:06:45 2003
> @@ -447,7 +447,7 @@
>  	proc_caches_init();
>  	buffer_init();
>  	security_scaffolding_startup();
> -	vfs_caches_init(num_physpages);
> +	vfs_caches_init(NODE_DATA(0)->node_present_pages);
>  	radix_tree_init();
>  	signals_init();
>  	/* rootfs populating might need page-writeback */

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-28 14:52           ` Jack Steiner
@ 2003-11-28 16:22             ` Jes Sorensen
  2003-11-28 19:35               ` Jack Steiner
  0 siblings, 1 reply; 32+ messages in thread
From: Jes Sorensen @ 2003-11-28 16:22 UTC (permalink / raw)
  To: Jack Steiner; +Cc: Andrew Morton, Jesse Barnes, viro, wli, linux-kernel

>>>>> "Jack" == Jack Steiner <steiner@sgi.com> writes:

Jack> On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
>>  What about something like this? I believe node_present_pages
>> should be the same as nym_physpages on a non-NUMA machine. If not
>> we can make it min(num_physpages,
>> NODE_DATA(0)->node_present_pages).

Jack> The system has a large number of nodes. Physically, each node
Jack> has the same amount of memory.  After boot, we observe that
Jack> several nodes have substantially less memory than other
Jack> nodes. Some of the inbalance is due to the kernel data/text
Jack> being on node 0, but by far, the major source of in the
Jack> inbalance is the 3 (in 2.4.x) large hash tables that are being
Jack> allocated.

Jack> I suspect the size of the hash tables is a lot bigger than is
Jack> needed.  That is certainly the first problem to be fixed, but
Jack> unless the required size is a very small percentage (5-10%) of
Jack> the amount of memory on a node (2GB to 32GB per node & 256
Jack> nodes), we still have a problem.

Jack,

I agree with you, however as you point out, there are two problems to
deal with, the excessive size of the hash tables on large systems and
the imbalance that everything goes on node zero. My patch only solves
the first problem, or rather works around it.

Solving the problem of allocating structures on multiple nodes is yet
to be solved.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-28 16:22             ` Jes Sorensen
@ 2003-11-28 19:35               ` Jack Steiner
  2003-11-28 21:18                 ` Jörn Engel
  0 siblings, 1 reply; 32+ messages in thread
From: Jack Steiner @ 2003-11-28 19:35 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Andrew Morton, Jesse Barnes, viro, wli, linux-kernel


On Fri, Nov 28, 2003 at 11:22:47AM -0500, Jes Sorensen wrote:
> >>>>> "Jack" == Jack Steiner <steiner@sgi.com> writes:
> 
> Jack> On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
> >>  What about something like this? I believe node_present_pages
> >> should be the same as nym_physpages on a non-NUMA machine. If not
> >> we can make it min(num_physpages,
> >> NODE_DATA(0)->node_present_pages).
> 
> Jack> The system has a large number of nodes. Physically, each node
> Jack> has the same amount of memory.  After boot, we observe that
> Jack> several nodes have substantially less memory than other
> Jack> nodes. Some of the inbalance is due to the kernel data/text
> Jack> being on node 0, but by far, the major source of in the
> Jack> inbalance is the 3 (in 2.4.x) large hash tables that are being
> Jack> allocated.
> 
> Jack> I suspect the size of the hash tables is a lot bigger than is
> Jack> needed.  That is certainly the first problem to be fixed, but
> Jack> unless the required size is a very small percentage (5-10%) of
> Jack> the amount of memory on a node (2GB to 32GB per node & 256
> Jack> nodes), we still have a problem.
> 
> Jack,
> 
> I agree with you, however as you point out, there are two problems to
> deal with, the excessive size of the hash tables on large systems and
> the imbalance that everything goes on node zero. My patch only solves
> the first problem, or rather works around it.
> 
> Solving the problem of allocating structures on multiple nodes is yet
> to be solved.

Jes 

Then I still dont understand your proposal. (I probably missed some piece
of the discussion).

You proposed above to limit the allocation to the amount of memory on a node.
I dont see that does anything on SN systems - allocation is already limited to 
that amount because memory between nodes is discontiguous. We need to limit 
the allocation to a small percentage of the memory on a node. I
dont see how we can do that without:
	
	- using vmalloc (on systems that dont have vmalloc issues)
		OR
	- changing algorithms so that a lrge hash table is not
	  needed. Either lots of smaller hash tables or ???.  I suspect
	  there are performance issues with this.
	  	OR
	- ????

I suppose I need to wait to see the proposal for allocating memory across nodes....


-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-28 19:35               ` Jack Steiner
@ 2003-11-28 21:18                 ` Jörn Engel
  2003-12-01  9:46                   ` Jes Sorensen
  0 siblings, 1 reply; 32+ messages in thread
From: Jörn Engel @ 2003-11-28 21:18 UTC (permalink / raw)
  To: Jack Steiner; +Cc: Jes Sorensen, linux-kernel

[pruned CC: list]

On Fri, 28 November 2003 13:35:36 -0600, Jack Steiner wrote:
> 
> Then I still dont understand your proposal. (I probably missed some piece
> of the discussion).
> 
> You proposed above to limit the allocation to the amount of memory on a node.

Jes didn't _limit_ the allocation to the memory on a node, he _based_
it on it, instead of total memory for all nodes.  Therefore a 1024
node NUMA machine with 2GB per node has no bigger hash tables, than a
single CPU machine with 2GB total memory, however big that may be.

Unless I didn't understand his patch, that is. :)

Jörn

-- 
"Error protection by error detection and correction."
-- from a university class

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-28 21:18                 ` Jörn Engel
@ 2003-12-01  9:46                   ` Jes Sorensen
  0 siblings, 0 replies; 32+ messages in thread
From: Jes Sorensen @ 2003-12-01  9:46 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Jack Steiner, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ascii, Size: 825 bytes --]

>>>>> "Jörn" == Jörn Engel <joern@wohnheim.fh-wedel.de> writes:

Jörn> [pruned CC: list]
Jörn> On Fri, 28 November 2003 13:35:36 -0600, Jack Steiner wrote:
>> You proposed above to limit the allocation to the amount of memory
>> on a node.

Jörn> Jes didn't _limit_ the allocation to the memory on a node, he
Jörn> _based_ it on it, instead of total memory for all nodes.
Jörn> Therefore a 1024 node NUMA machine with 2GB per node has no
Jörn> bigger hash tables, than a single CPU machine with 2GB total
Jörn> memory, however big that may be.

Jörn> Unless I didn't understand his patch, that is. :)

Yep, thats exactly what my patch did. There's a gotcha if node 0 has a
lot less memory than the remaining nodes, but I also suspect the size
of those hash tables was already out of whack as memory has gone up.

Thanks,
Jes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-11-25 21:07   ` Andrew Morton
  2003-11-25 21:14     ` Jesse Barnes
@ 2003-12-01 21:06     ` Anton Blanchard
  2003-12-01 22:57       ` Martin J. Bligh
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Blanchard @ 2003-12-01 21:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jack Steiner, jes, viro, wli, linux-kernel, jbarnes

[-- Attachment #1: Type: text/plain, Size: 1177 bytes --]

 
Hi,

> Well yes.  I suspect the inode hashtable could be shrunk; I don't think we
> do hash-based inode lookups very much?

Id be interested to know how often we hit the inode hash. Last time I
looked at it, there were some real deep chains due (Im guessing) to
the way ext2 inode allocation causes them to end up in groups.

The attached graph shows what is going on. The histogram below shows its
not as bad as I thought, most of the distribution is in the 0-1 bucket
range.

> It would be very nice to have some confirmation that the size of these
> tables is being appropriately chosen, too.  Maybe we should shrink 'em 32x
> and see who complains...

Why dont we just do node round robin allocations during boot? This
should mean the static boot time hashes would at least end up on
different nodes.

Anton

0 248652
1 7374
2 77
3 71
4 72
5 109
6 61
7 75
8 114
9 54
10 34
11 45
12 58
13 35
14 37
15 39
16 32
17 30
18 36
19 31
20 31
21 43
22 47
23 52
24 30
25 26
26 31
27 35
28 35
29 38
30 39
31 38
32 43
33 40
34 49
35 43
36 41
37 44
38 47
39 49
40 51
41 67
42 66
43 104
44 166
45 195
46 289
47 372
48 467
49 542
50 589
51 516
52 436
53 271
54 100
55 53
56 22
57 0
58 1

[-- Attachment #2: inode_cache.png --]
[-- Type: image/png, Size: 10241 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
  2003-12-01 21:06     ` Anton Blanchard
@ 2003-12-01 22:57       ` Martin J. Bligh
  0 siblings, 0 replies; 32+ messages in thread
From: Martin J. Bligh @ 2003-12-01 22:57 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton
  Cc: Jack Steiner, jes, viro, wli, linux-kernel, jbarnes

>> It would be very nice to have some confirmation that the size of these
>> tables is being appropriately chosen, too.  Maybe we should shrink 'em 32x
>> and see who complains...
> 
> Why dont we just do node round robin allocations during boot? This
> should mean the static boot time hashes would at least end up on
> different nodes.

We could probably implement a generic striped allocate, which would
do a vmalloc or similar on 64 bit, and either the magic boottime
node-alloc hack, or just a straight node 0 alloc on 32 bit (ie use
vmalloc where needed, without crippling other platforms).

Someone had a patch to do round-robin already (Manfred?) - IMHO doing
it from the node with the most free mem each time would be better, if
we're not going to stripe. 
 
> 0 248652
> 1 7374

...

but yes, that does look utterly screwed ;-)

M.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: hash table sizes
@ 2003-11-29 10:39 Manfred Spraul
  0 siblings, 0 replies; 32+ messages in thread
From: Manfred Spraul @ 2003-11-29 10:39 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Jes Sorensen, linux-kernel, anton, Martin J. Bligh,
	William Lee Irwin III, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 627 bytes --]

What about the attached patch?
- add command line overrides for the hash tables sizes. It's impossible 
to guess if the system is used as a HPC or as a file server. Doc update 
missing.
- limit the dcache hash to 8 mio entries, and the inode hash to 1 mio 
entries, regardless of the amount of memory in the system. The system 
admin can override the defaults at boot time if needed.
- distribute the memory allocations that happen during boot to all nodes 
- the memory will be touched by all cpus, binding all allocs to the boot 
node is wrong.

The patch compiles, but is untested due to lack of hardware.

--
    Manfred



[-- Attachment #2: patch-numa --]
[-- Type: text/plain, Size: 3591 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 0
//  EXTRAVERSION = -test11
--- 2.6/mm/page_alloc.c	2003-11-29 09:46:35.000000000 +0100
+++ build-2.6/mm/page_alloc.c	2003-11-29 11:34:04.000000000 +0100
@@ -681,6 +681,42 @@
 
 EXPORT_SYMBOL(__alloc_pages);
 
+#ifdef CONFIG_NUMA
+/* Early boot: Everything is done by one cpu, but the data structures will be
+ * used by all cpus - spread them on all nodes.
+ */
+static __init unsigned long get_boot_pages(unsigned int gfp_mask, unsigned int order)
+{
+static int nodenr;
+	int i = nodenr;
+	struct page *page;
+
+	for (;;) {
+		if (i > nodenr + numnodes)
+			return 0;
+		if (node_present_pages(i%numnodes)) {
+			struct zone **z;
+			/* The node contains memory. Check that there is 
+			 * memory in the intended zonelist.
+			 */
+			z = NODE_DATA(i%numnodes)->node_zonelists[gfp_mask & GFP_ZONEMASK].zones;
+			while (*z) {
+				if ( (*z)->free_pages > (1UL<<order))
+					goto found_node;
+				z++;
+			}
+		}
+		i++;
+	}
+found_node:
+	nodenr = i+1;
+	page = alloc_pages_node(i%numnodes, gfp_mask, order);
+	if (!page)
+		return 0;
+	return (unsigned long) page_address(page);
+}
+#endif
+
 /*
  * Common helper functions.
  */
@@ -688,6 +724,10 @@
 {
 	struct page * page;
 
+#ifdef CONFIG_NUMA
+	if (unlikely(!system_running))
+		return get_boot_pages(gfp_mask, order);
+#endif
 	page = alloc_pages(gfp_mask, order);
 	if (!page)
 		return 0;
--- 2.6/fs/inode.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/inode.c	2003-11-29 10:19:21.000000000 +0100
@@ -1327,6 +1327,20 @@
 		wake_up_all(wq);
 }
 
+static __initdata int ihash_entries;
+
+static int __init set_ihash_entries(char *str)
+{
+	get_option(&str, &ihash_entries);
+	if (ihash_entries <= 0) {
+		ihash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("ihash_entries=", set_ihash_entries);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1340,8 +1354,16 @@
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct hlist_head);
+	if (!ihash_entries) {
+		ihash_entries = mempages >> (14 - PAGE_SHIFT);
+		/* Limit inode hash size. Override for nfs servers
+		 * that handle lots of files.
+		 */
+		if (ihash_entries > 1024*1024)
+			ihash_entries = 1024*1024;
+	}
+
+	mempages = ihash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 
--- 2.6/fs/dcache.c	2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/dcache.c	2003-11-29 10:53:15.000000000 +0100
@@ -1546,6 +1546,20 @@
 	return ino;
 }
 
+static __initdata int dhash_entries;
+
+static int __init set_dhash_entries(char *str)
+{
+	get_option(&str, &dhash_entries);
+	if (dhash_entries <= 0) {
+		dhash_entries = 0;
+		return 0;
+	}
+	return 1;
+}
+
+__setup("dhash_entries=", set_dhash_entries);
+
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
@@ -1571,10 +1585,18 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
+	if (!dhash_entries) {
 #if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
+		mempages >>= (13 - PAGE_SHIFT);
 #endif
-	mempages *= sizeof(struct hlist_head);
+		dhash_entries = mempages;
+		/* 8 mio is enough for general purpose systems.
+		 * For file servers, override with "dhash_entries="
+		 */
+		if (dhash_entries > 8*1024*1024)
+			dhash_entries = 8*1024*1024;
+	}
+	mempages = dhash_entries*sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: hash table sizes
@ 2003-11-26  5:53 Zhang, Yanmin
  0 siblings, 0 replies; 32+ messages in thread
From: Zhang, Yanmin @ 2003-11-26  5:53 UTC (permalink / raw)
  To: linux-kernel

Tcp/ip have the same problem. See function tcp_init/ip_rt_init.

My colleague's testing on IA64 shows the same result like Jef's while
tcp_establish hash table applies for about 1G memory! Actually, it seems
it wants to apply for 4G memory at the beginning!

It's not a good idea to find a smart formula to fit into all scenarios.
We can add new command line parameters to setup the hash table size,
such as dentry_ht_size/inode_ht_size/tcp_estab_ht_size/ip_rt_ht_szie. If
users don't input such parameters, then kernel can use Jes's formula to
work out the size. 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2003-12-01 22:59 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-25 13:35 hash table sizes Jes Sorensen
2003-11-25 13:42 ` William Lee Irwin III
2003-11-25 13:54   ` Jes Sorensen
2003-11-25 16:25     ` Thomas Schlichter
2003-11-25 17:52       ` Antonio Vargas
2003-11-25 17:54         ` William Lee Irwin III
2003-11-25 20:48 ` Jack Steiner
2003-11-25 21:07   ` Andrew Morton
2003-11-25 21:14     ` Jesse Barnes
2003-11-25 21:24       ` Andrew Morton
2003-11-26  2:14         ` David S. Miller
2003-11-26  5:27         ` Matt Mackall
2003-11-28 14:15         ` Jes Sorensen
2003-11-28 14:52           ` Jack Steiner
2003-11-28 16:22             ` Jes Sorensen
2003-11-28 19:35               ` Jack Steiner
2003-11-28 21:18                 ` Jörn Engel
2003-12-01  9:46                   ` Jes Sorensen
2003-12-01 21:06     ` Anton Blanchard
2003-12-01 22:57       ` Martin J. Bligh
2003-11-25 21:16   ` Anton Blanchard
2003-11-25 23:11     ` Jack Steiner
2003-11-26  3:39       ` Rik van Riel
2003-11-26  3:59         ` William Lee Irwin III
2003-11-26  4:25           ` Andrew Morton
2003-11-26  4:23             ` William Lee Irwin III
2003-11-26  5:14           ` Martin J. Bligh
2003-11-26  9:51             ` William Lee Irwin III
2003-11-26 16:17               ` Martin J. Bligh
2003-11-26  7:25       ` Anton Blanchard
2003-11-26  5:53 Zhang, Yanmin
2003-11-29 10:39 Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).