LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes.
@ 2011-01-22 15:30 Robin Holt
  2011-01-22 15:34 ` shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend) Robin Holt
  0 siblings, 1 reply; 4+ messages in thread
From: Robin Holt @ 2011-01-22 15:30 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Andrew Morton; +Cc: linux-kernel

I have a customer system with 12 TB of memory.  The customer is trying
to do a shmget() call with size of 4TB and it fails due to the check in
shmem_file_setup() against SHMEM_MAX_BYTES which is 0x4020010000.

I have considered a bunch of options and really do not know which
direction I should take this.

I could add a third level and fourth level with a similar 1/4 size being
the current level of indirection, and the next quarter being a next level.
That would get me closer, but not all the way there.

Given the complexity we would be introducing, I really lean towards
having a tree of tables like the page tables instead of the current
half is one level of indirection and the other half is two levels.
It adds complexity which really does not have much value that I can see.


An alternative to the current halves being different levels
of indirection, I considered reworking the info->next_index
increment/decrement to put it inside the same locking as the walk/fill
of the table.  With that, I could resize the table depth based
upon the next_index value.  For next_index from SHMEM_NR_DIRECT
to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE (2MB), it could be direct.
>From there to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE ** 2 (1GB), it could
be one level of indirection.  Then from there to SHMEM_NR_DIRECT +
ENTRIES_PER_PAGE ** 3 (512GB), it could be two levels of indirection.
Finally, from there to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE ** 4 (256TB),
it could be three levels.  That should be enough for a little while.

I am unsure about the value of having the direct entries at the beginning.
Given they have been this way for this long, I would probably leave them
to minimize the chances for a performance impact.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 4+ messages in thread

* shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend).
  2011-01-22 15:30 shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes Robin Holt
@ 2011-01-22 15:34 ` Robin Holt
  2011-01-22 15:59   ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Robin Holt @ 2011-01-22 15:34 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Andrew Morton; +Cc: linux-kernel

I have a customer system with 12 TB of memory.  The customer is trying
to do a shmget() call with size of 4TB and it fails due to the check in
shmem_file_setup() against SHMEM_MAX_BYTES which is 0x4020010000.

I have considered a bunch of options and really do not know which
direction I should take this.

I could add a third level and fourth level with a similar 1/4 size being
the current level of indirection, and the next quarter being a next level.
That would get me closer, but not all the way there.

Given the complexity we would be introducing, I really lean towards
having a tree of tables like the page tables instead of the current
half is one level of indirection and the other half is two levels.
It adds complexity which really does not have much value that I can see.


An alternative to the current halves being different levels
of indirection, I considered reworking the info->next_index
increment/decrement to put it inside the same locking as the walk/fill
of the table.  With that, I could resize the table depth based
upon the next_index value.  For next_index from SHMEM_NR_DIRECT
to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE (2MB), it could be direct.
>From there to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE ** 2 (1GB), it could
be one level of indirection.  Then from there to SHMEM_NR_DIRECT +
ENTRIES_PER_PAGE ** 3 (512GB), it could be two levels of indirection.
Finally, from there to SHMEM_NR_DIRECT + ENTRIES_PER_PAGE ** 4 (256TB),
it could be three levels.  That should be enough for a little while.

I am unsure about the value of having the direct entries at the beginning.
Given they have been this way for this long, I would probably leave them
to minimize the chances for a performance impact.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend).
  2011-01-22 15:34 ` shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend) Robin Holt
@ 2011-01-22 15:59   ` Linus Torvalds
  2011-01-22 22:22     ` Hugh Dickins
  0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2011-01-22 15:59 UTC (permalink / raw)
  To: Robin Holt; +Cc: Hugh Dickins, Andrew Morton, linux-kernel

On Sat, Jan 22, 2011 at 7:34 AM, Robin Holt <holt@sgi.com> wrote:
> I have a customer system with 12 TB of memory.  The customer is trying
> to do a shmget() call with size of 4TB and it fails due to the check in
> shmem_file_setup() against SHMEM_MAX_BYTES which is 0x4020010000.
>
> I have considered a bunch of options and really do not know which
> direction I should take this.
>
> I could add a third level and fourth level with a similar 1/4 size being
> the current level of indirection, and the next quarter being a next level.
> That would get me closer, but not all the way there.

Ugh.

How about just changing the indexing to use a bigger page allocation?
Right now it uses PAGE_CACHE_SIZE and ENTRIES_PER_PAGE, but as far as
I can tell, the indexing logic is entirely independent from PAGE_SIZE
and PAGE_CACHE_SIZE, and could just use its own SHM_INDEX_PAGE_SIZE or
something.

That would allow increasing the indexing capability fairly easily, no?
No actual change to the (messy) algorithm at all, just make the block
size for the index pages bigger.

Sure, it means that you now require multipage allocations in
shmem_dir_alloc(), but that doesn't sound all that hard. The code is
already set up to try to handle it (because we have that conceptual
difference between PAGE_SIZE and PAGE_CACHE_SIZE, even though the two
end up being the same).

NOTE! I didn't look very closely at the details, there may be some
really basic reason why the above is a completely idiotic idea.

The alternative (and I think it might be a good alternative) is to get
rid of the shmem magic indexing entirely, and rip all the code out and
replace it with something that uses the generic radix tree functions.
Again, I didn't actually look at the code enough to judge whether that
would be the most painful effort ever or even possible at all.

                     Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend).
  2011-01-22 15:59   ` Linus Torvalds
@ 2011-01-22 22:22     ` Hugh Dickins
  0 siblings, 0 replies; 4+ messages in thread
From: Hugh Dickins @ 2011-01-22 22:22 UTC (permalink / raw)
  To: Robin Holt; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4889 bytes --]

On Sat, 22 Jan 2011, Linus Torvalds wrote:
> On Sat, Jan 22, 2011 at 7:34 AM, Robin Holt <holt@sgi.com> wrote:
> > I have a customer system with 12 TB of memory.  The customer is trying
> > to do a shmget() call with size of 4TB and it fails due to the check in
> > shmem_file_setup() against SHMEM_MAX_BYTES which is 0x4020010000.
> >
> > I have considered a bunch of options and really do not know which
> > direction I should take this.

One question to ask first, does the customer use swap?  If not,
then that limit and all the indexing and swap-related stuff in
mm/shmem.c is just pointless overhead - that remains because I
never got around to putting in the appropriate #ifdef CONFIG_SWAPs
(which would require some rearrangement to avoid too much ugliness).

If the customer does not use swap, then just work through it putting
in #ifdef CONFIG_SWAPs to remove the need for that limit, and go no
further.

> >
> > I could add a third level and fourth level with a similar 1/4 size being
> > the current level of indirection, and the next quarter being a next level.
> > That would get me closer, but not all the way there.
> 
> Ugh.
> 
> How about just changing the indexing to use a bigger page allocation?
> Right now it uses PAGE_CACHE_SIZE and ENTRIES_PER_PAGE, but as far as
> I can tell, the indexing logic is entirely independent from PAGE_SIZE
> and PAGE_CACHE_SIZE, and could just use its own SHM_INDEX_PAGE_SIZE or
> something.

That's a very sensible suggestion, for a quick measure to satisfy the
customer.

> 
> That would allow increasing the indexing capability fairly easily, no?
> No actual change to the (messy) algorithm at all, just make the block
> size for the index pages bigger.

Yes, I've no inclination to revisit that messy algorithm myself.
(I admit it was me who made it a lot messier, in enabling highmem
on those index pages.)

> 
> Sure, it means that you now require multipage allocations in
> shmem_dir_alloc(), but that doesn't sound all that hard. The code is
> already set up to try to handle it (because we have that conceptual
> difference between PAGE_SIZE and PAGE_CACHE_SIZE, even though the two
> end up being the same).

I did once try to get PAGE_SIZE versus PAGE_CACHE_SIZE right in that
file, I think it even got some testing when ChristophL had a patch to
raise PAGE_CACHE_SIZE.  But I don't promise that it will provide a
reliable guide as is.

> 
> NOTE! I didn't look very closely at the details, there may be some
> really basic reason why the above is a completely idiotic idea.

It's not at all idiotic.  But I feel a lot more confident to rely
upon order:1 pages being available than order:2 pages.  Order:2 pages
for the index will give you more than you need, 16TB I think; whereas
order:1 pages will only take you to 2TB.

If the customer won't be needing larger than 4TB for a while, then
I'd be inclined to add in a hack to limit maximum swap size (see
maxpages in SYSCALL_DEFINE2(swapon,,,,) in mm/swapfile.c), so the
swp_entry_t is sure to fit in 32-bits even on 64-bit architecture -
then you can pack twice as many into the lowest level of the index.

And save yourself some time by doing this only for 64-bit, so you
don't have to bother about all those kmap_atomics, which would only
be covering a half or a quarter of your higher order page.

> 
> The alternative (and I think it might be a good alternative) is to get
> rid of the shmem magic indexing entirely, and rip all the code out and
> replace it with something that uses the generic radix tree functions.
> Again, I didn't actually look at the code enough to judge whether that
> would be the most painful effort ever or even possible at all.

Yes, that's what I've really wanted to do for a long time now,
so in part why I've never bothered to CONFIG_SWAPify mm/shmem.c.

What I want is to use the very same radix tree that find_lock_page()
is using: where a page is present, the usual struct page *; where it's
swapped out, the swp_entry_t (or perhaps a pointer into the swap_map).
It seems wasteful to have two trees when one should be enough.

If that can be done without uglifying or duplicating too many fundamental
interfaces, then I think it's the way to go.  It's good for the people
with small memory, it's good for the people with large memory, it's
good for the people with no swap.

But not quite so good for the 32-bit highmem swap people: when a
file is entirely swapped out, the current shmem index resides in
highmem, and the pagecache radix tree gets freed up; whereas with
this proposed change, the lowmem radix tree would have to remain.

That has deterred me until recently, but I think we're now far
enough into 64-bit to afford that regression on 32-bit highmem.

However, I don't think your customer should wait for me!

Hugh

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-01-22 22:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-22 15:30 shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes Robin Holt
2011-01-22 15:34 ` shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend) Robin Holt
2011-01-22 15:59   ` Linus Torvalds
2011-01-22 22:22     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).