LKML Archive on
help / color / mirror / Atom feed
From: Hugh Dickins <>
To: Robin Holt <>
Cc: Linus Torvalds <>,
	Andrew Morton <>,
Subject: Re: shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend).
Date: Sat, 22 Jan 2011 14:22:10 -0800 (PST)	[thread overview]
Message-ID: <alpine.LSU.2.00.1101221304240.1822@sister.anvils> (raw)
In-Reply-To: <>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4889 bytes --]

On Sat, 22 Jan 2011, Linus Torvalds wrote:
> On Sat, Jan 22, 2011 at 7:34 AM, Robin Holt <> wrote:
> > I have a customer system with 12 TB of memory.  The customer is trying
> > to do a shmget() call with size of 4TB and it fails due to the check in
> > shmem_file_setup() against SHMEM_MAX_BYTES which is 0x4020010000.
> >
> > I have considered a bunch of options and really do not know which
> > direction I should take this.

One question to ask first, does the customer use swap?  If not,
then that limit and all the indexing and swap-related stuff in
mm/shmem.c is just pointless overhead - that remains because I
never got around to putting in the appropriate #ifdef CONFIG_SWAPs
(which would require some rearrangement to avoid too much ugliness).

If the customer does not use swap, then just work through it putting
in #ifdef CONFIG_SWAPs to remove the need for that limit, and go no

> >
> > I could add a third level and fourth level with a similar 1/4 size being
> > the current level of indirection, and the next quarter being a next level.
> > That would get me closer, but not all the way there.
> Ugh.
> How about just changing the indexing to use a bigger page allocation?
> Right now it uses PAGE_CACHE_SIZE and ENTRIES_PER_PAGE, but as far as
> I can tell, the indexing logic is entirely independent from PAGE_SIZE
> and PAGE_CACHE_SIZE, and could just use its own SHM_INDEX_PAGE_SIZE or
> something.

That's a very sensible suggestion, for a quick measure to satisfy the

> That would allow increasing the indexing capability fairly easily, no?
> No actual change to the (messy) algorithm at all, just make the block
> size for the index pages bigger.

Yes, I've no inclination to revisit that messy algorithm myself.
(I admit it was me who made it a lot messier, in enabling highmem
on those index pages.)

> Sure, it means that you now require multipage allocations in
> shmem_dir_alloc(), but that doesn't sound all that hard. The code is
> already set up to try to handle it (because we have that conceptual
> difference between PAGE_SIZE and PAGE_CACHE_SIZE, even though the two
> end up being the same).

I did once try to get PAGE_SIZE versus PAGE_CACHE_SIZE right in that
file, I think it even got some testing when ChristophL had a patch to
raise PAGE_CACHE_SIZE.  But I don't promise that it will provide a
reliable guide as is.

> NOTE! I didn't look very closely at the details, there may be some
> really basic reason why the above is a completely idiotic idea.

It's not at all idiotic.  But I feel a lot more confident to rely
upon order:1 pages being available than order:2 pages.  Order:2 pages
for the index will give you more than you need, 16TB I think; whereas
order:1 pages will only take you to 2TB.

If the customer won't be needing larger than 4TB for a while, then
I'd be inclined to add in a hack to limit maximum swap size (see
maxpages in SYSCALL_DEFINE2(swapon,,,,) in mm/swapfile.c), so the
swp_entry_t is sure to fit in 32-bits even on 64-bit architecture -
then you can pack twice as many into the lowest level of the index.

And save yourself some time by doing this only for 64-bit, so you
don't have to bother about all those kmap_atomics, which would only
be covering a half or a quarter of your higher order page.

> The alternative (and I think it might be a good alternative) is to get
> rid of the shmem magic indexing entirely, and rip all the code out and
> replace it with something that uses the generic radix tree functions.
> Again, I didn't actually look at the code enough to judge whether that
> would be the most painful effort ever or even possible at all.

Yes, that's what I've really wanted to do for a long time now,
so in part why I've never bothered to CONFIG_SWAPify mm/shmem.c.

What I want is to use the very same radix tree that find_lock_page()
is using: where a page is present, the usual struct page *; where it's
swapped out, the swp_entry_t (or perhaps a pointer into the swap_map).
It seems wasteful to have two trees when one should be enough.

If that can be done without uglifying or duplicating too many fundamental
interfaces, then I think it's the way to go.  It's good for the people
with small memory, it's good for the people with large memory, it's
good for the people with no swap.

But not quite so good for the 32-bit highmem swap people: when a
file is entirely swapped out, the current shmem index resides in
highmem, and the pagecache radix tree gets freed up; whereas with
this proposed change, the lowmem radix tree would have to remain.

That has deterred me until recently, but I think we're now far
enough into 64-bit to afford that regression on 32-bit highmem.

However, I don't think your customer should wait for me!


      reply	other threads:[~2011-01-22 22:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-22 15:30 shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes Robin Holt
2011-01-22 15:34 ` shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend) Robin Holt
2011-01-22 15:59   ` Linus Torvalds
2011-01-22 22:22     ` Hugh Dickins [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.00.1101221304240.1822@sister.anvils \ \ \ \ \ \
    --subject='Re: shmget limited by SHMEM_MAX_BYTES to 0x4020010000 bytes (Resend).' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).