LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Kent Overstreet <kent.overstreet@gmail.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	David Howells <dhowells@redhat.com>
Subject: Re: Struct page proposal
Date: Thu, 23 Sep 2021 11:03:44 +0200	[thread overview]
Message-ID: <e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com> (raw)
In-Reply-To: <YUvWm6G16+ib+Wnb@moria.home.lan>

On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
> 
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
> 
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
> 
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
> 
> 
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
> 
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
> 
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
> 
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
> 
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
> 
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
> 
> So if we have this:
> 
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> 
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
> 
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
>   - struct folio
>   - struct anon_folio, if that becomes a thing
>   - struct network_pool_page
>   - struct pte_page
>   - struct zone_device_page
> 
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
> 
> Other notes & potential issues:
>   - page->compound_dtor needs to die
> 
>   - page->rcu_head moves into the types that actually need it, no issues there
> 
>   - page->refcount has question marks around it. I think we can also just move it
>     into the types that need it; with RCU derefing the pointer to the folio or
>     whatever and grabing a ref on folio->refcount can happen under a RCU read
>     lock - there's no real question about whether it's technically possible to
>     get it out of struct page, and I think it would be cleaner overall that way.
> 
>     However, depending on how it's used from code paths that go from generic
>     untyped pages, I could see it turning into more of a hassle than it's worth.
>     More investigation is needed.
> 
>   - page->memcg_data - I don't know whether that one more properly belongs in
>     struct page or in the page subtypes - I'd love it if Johannes could talk
>     about that one.
> 
>   - page->flags - dealing with this is going to be a huge hassle but also where
>     we'll find some of the biggest gains in overall sanity and readability of the
>     code. Right now, PG_locked is super special and ad hoc and I have run into
>     situations multiple times (and Johannes was in vehement agreement on this
>     one) where I simply could not figure the behaviour of the current code re:
>     who is responsible for locking pages without instrumenting the code with
>     assertions.
> 
>     Meaning anything we do to create and enforce module boundaries between
>     different chunks of code is going to suck, but the end result should be
>     really worthwhile.
> 
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.
> 

Don't get me wrong, but before there are answers to some of the very 
basic questions raised above (especially everything that lives in 
page->flags, which are not only page flags, refcount, ...) this isn't 
very tempting to spend more time on, from a reviewer perspective.

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2021-09-23  9:03 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-23  1:21 Struct page proposal Kent Overstreet
2021-09-23  3:23 ` Matthew Wilcox
2021-09-23  5:15   ` Kent Overstreet
2021-09-23 11:40     ` Mapcount of subpages Matthew Wilcox
2021-09-23 12:45       ` Kirill A. Shutemov
2021-09-23 21:10         ` Hugh Dickins
2021-09-23 21:54           ` Yang Shi
2021-09-23 22:23             ` Zi Yan
2021-09-23 23:48               ` Hugh Dickins
2021-09-24  0:25                 ` Zi Yan
2021-09-24  0:57                   ` Hugh Dickins
2021-09-24  1:11                 ` Yang Shi
2021-09-24  1:31                   ` Matthew Wilcox
2021-09-24  3:26                     ` Yang Shi
2021-09-24 23:05           ` Kirill A. Shutemov
2021-09-23 18:56       ` Mike Kravetz
2021-09-23  9:03 ` David Hildenbrand [this message]
2021-09-23 15:22   ` Struct page proposal Kent Overstreet
2021-09-23 15:34     ` David Hildenbrand
2021-09-27 17:48 ` Vlastimil Babka
2021-09-27 17:53   ` Kent Overstreet
2021-09-27 18:34     ` Linus Torvalds
2021-09-27 20:45       ` David Hildenbrand
2021-09-27 18:05   ` Matthew Wilcox
2021-09-27 18:09     ` Kent Overstreet
2021-09-27 18:12       ` Matthew Wilcox
2021-09-27 18:16         ` David Hildenbrand
2021-09-27 18:53           ` Vlastimil Babka
2021-09-27 19:04             ` Linus Torvalds
2021-09-27 18:16         ` Kent Overstreet
2021-09-28  3:19           ` Matthew Wilcox
2021-09-27 19:07       ` Vlastimil Babka
2021-09-27 20:14         ` Kent Overstreet
2021-09-28 11:21         ` David Laight
2021-09-27 18:33     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).