LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Kent Overstreet <kent.overstreet@gmail.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Matthew Wilcox <willy@infradead.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
"Darrick J. Wong" <djwong@kernel.org>,
Christoph Hellwig <hch@infradead.org>,
David Howells <dhowells@redhat.com>
Subject: Re: Struct page proposal
Date: Thu, 23 Sep 2021 11:03:44 +0200 [thread overview]
Message-ID: <e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com> (raw)
In-Reply-To: <YUvWm6G16+ib+Wnb@moria.home.lan>
On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
>
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
>
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
>
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
>
>
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
>
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
>
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
>
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
>
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
>
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
>
> So if we have this:
>
> struct page {
> unsigned long allocator;
> unsigned long allocatee;
> };
>
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
>
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
> - struct folio
> - struct anon_folio, if that becomes a thing
> - struct network_pool_page
> - struct pte_page
> - struct zone_device_page
>
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
>
> Other notes & potential issues:
> - page->compound_dtor needs to die
>
> - page->rcu_head moves into the types that actually need it, no issues there
>
> - page->refcount has question marks around it. I think we can also just move it
> into the types that need it; with RCU derefing the pointer to the folio or
> whatever and grabing a ref on folio->refcount can happen under a RCU read
> lock - there's no real question about whether it's technically possible to
> get it out of struct page, and I think it would be cleaner overall that way.
>
> However, depending on how it's used from code paths that go from generic
> untyped pages, I could see it turning into more of a hassle than it's worth.
> More investigation is needed.
>
> - page->memcg_data - I don't know whether that one more properly belongs in
> struct page or in the page subtypes - I'd love it if Johannes could talk
> about that one.
>
> - page->flags - dealing with this is going to be a huge hassle but also where
> we'll find some of the biggest gains in overall sanity and readability of the
> code. Right now, PG_locked is super special and ad hoc and I have run into
> situations multiple times (and Johannes was in vehement agreement on this
> one) where I simply could not figure the behaviour of the current code re:
> who is responsible for locking pages without instrumenting the code with
> assertions.
>
> Meaning anything we do to create and enforce module boundaries between
> different chunks of code is going to suck, but the end result should be
> really worthwhile.
>
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.
>
Don't get me wrong, but before there are answers to some of the very
basic questions raised above (especially everything that lives in
page->flags, which are not only page flags, refcount, ...) this isn't
very tempting to spend more time on, from a reviewer perspective.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-09-23 9:03 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-23 1:21 Struct page proposal Kent Overstreet
2021-09-23 3:23 ` Matthew Wilcox
2021-09-23 5:15 ` Kent Overstreet
2021-09-23 11:40 ` Mapcount of subpages Matthew Wilcox
2021-09-23 12:45 ` Kirill A. Shutemov
2021-09-23 21:10 ` Hugh Dickins
2021-09-23 21:54 ` Yang Shi
2021-09-23 22:23 ` Zi Yan
2021-09-23 23:48 ` Hugh Dickins
2021-09-24 0:25 ` Zi Yan
2021-09-24 0:57 ` Hugh Dickins
2021-09-24 1:11 ` Yang Shi
2021-09-24 1:31 ` Matthew Wilcox
2021-09-24 3:26 ` Yang Shi
2021-09-24 23:05 ` Kirill A. Shutemov
2021-09-23 18:56 ` Mike Kravetz
2021-09-23 9:03 ` David Hildenbrand [this message]
2021-09-23 15:22 ` Struct page proposal Kent Overstreet
2021-09-23 15:34 ` David Hildenbrand
2021-09-27 17:48 ` Vlastimil Babka
2021-09-27 17:53 ` Kent Overstreet
2021-09-27 18:34 ` Linus Torvalds
2021-09-27 20:45 ` David Hildenbrand
2021-09-27 18:05 ` Matthew Wilcox
2021-09-27 18:09 ` Kent Overstreet
2021-09-27 18:12 ` Matthew Wilcox
2021-09-27 18:16 ` David Hildenbrand
2021-09-27 18:53 ` Vlastimil Babka
2021-09-27 19:04 ` Linus Torvalds
2021-09-27 18:16 ` Kent Overstreet
2021-09-28 3:19 ` Matthew Wilcox
2021-09-27 19:07 ` Vlastimil Babka
2021-09-27 20:14 ` Kent Overstreet
2021-09-28 11:21 ` David Laight
2021-09-27 18:33 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=djwong@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=kent.overstreet@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).