Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* The future of readahead
@ 2020-08-26 19:31 Matthew Wilcox
  2020-08-27 17:02 ` David Howells
  0 siblings, 1 reply; 3+ messages in thread
From: Matthew Wilcox @ 2020-08-26 19:31 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Kent Overstreet, David Howells, Mike Marshall

Both Kent and David have had conversations with me about improving the
readahead filesystem interface this last week, and as I don't have time
to write the code, here's the design.

1. Kent doesn't like it that we do an XArray lookup for each page.
The proposed solution adds a (small) array of page pointers (or a
pagevec) to the struct readahead_control.  It may make sense to move
__readahead_batch() and readahead_page() out of line at that point.
This should be backed up with performance numbers.

2. David wants to be sure that readahead is aligned to a granule
size (eg 256kB) to support fscache.  When we last talked about it,
I suggested encoding the granule size in the struct address_space.
I no longer think this approach should be pursued, since ...

3. Kent wants to be able to expand readahead to encompass an entire fs
extent (if, eg, that extent is compressed or encrypted).  We don't know
that at the right point; the filesystem can't pass that information
through the generic_file_buffered_read() or filemap_fault() interface
to the readahead code.  So the right approach here is for the filesystem
to ask the readahead code to expand the readahead batch.

So solving #2 and #3 looks like a new interface for filesystems to call:

void readahead_expand(struct readahead_control *rac, loff_t start, u64 len);
or possibly
void readahead_expand(struct readahead_control *rac, pgoff_t start,
		unsigned int count);

It might not actually expand the readahead attempt at all -- for example,
if there's already a page in the page cache, or if it can't allocate
memory.  But this puts the responsibility for allocating pages in the VFS,
where it belongs.

4. Mike wants to be able to do 4MB I/Os [1].  That should be covered by
the solution above.  Mike, just to clarify.  Do you need 4MB pages, or can
you work with some mixture of page sizes going as far as 1024 x 4kB pages?

5. I'm allocating larger pages in the readahead code (part of the THP
patch set [2])

[1] https://lore.kernel.org/linux-fsdevel/CAOg9mSSrJp2dqQTNDgucLoeQcE_E_aYPxnRe5xphhdSPYw7QtQ@mail.gmail.com/
[2] http://git.infradead.org/users/willy/pagecache.git/commitdiff/c00bd4082c7bc32a17b0baa29af6974286978e1f

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: The future of readahead
  2020-08-26 19:31 The future of readahead Matthew Wilcox
@ 2020-08-27 17:02 ` David Howells
  2020-08-27 17:21   ` Matthew Wilcox
  0 siblings, 1 reply; 3+ messages in thread
From: David Howells @ 2020-08-27 17:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: dhowells, linux-fsdevel, Kent Overstreet, Mike Marshall

Matthew Wilcox <willy@infradead.org> wrote:

> So solving #2 and #3 looks like a new interface for filesystems to call:
> 
> void readahead_expand(struct readahead_control *rac, loff_t start, u64 len);
> or possibly
> void readahead_expand(struct readahead_control *rac, pgoff_t start,
> 		unsigned int count);
> 
> It might not actually expand the readahead attempt at all -- for example,
> if there's already a page in the page cache, or if it can't allocate
> memory.  But this puts the responsibility for allocating pages in the VFS,
> where it belongs.

This is exactly what the fscache read helper in my fscache rewrite is doing,
except that I'm doing it in fs/fscache/read_helper.c.

Have a look here:

	https://lore.kernel.org/linux-fsdevel/159465810864.1376674.10267227421160756746.stgit@warthog.procyon.org.uk/

and look for the fscache_read_helper() function.

Note that it's slighly complicated because it handles ->readpage(),
->readpages() and ->write_begin()[*].

[*] I want to be able to bring the granule into the cache for modification.
    Ideally I'd be able to see that the entire granule is going to get written
    over and skip - kind of like write_begin for a whole granule rather than a
    page.

Shaping the readahead request has the following issues:

 (1) The request may span multiple granules.

 (2) Those granules may be a mixture of cached and uncached.

 (3) The granule size may vary.

 (4) Granules fall on power-of-2 boundaries (for example 256K boundaries)
     within the file, but the request may not start on a boundary and may not
     end on one.

To deal with this, fscache_read_helper() calls out to the cache backend
(fscache_shape_request()) and the netfs (req->ops->reshape()) to adjust the
read it's going to make.  Shaping the request may mean moving the start
earlier as well as expanding or contracting the size.  The only thing that's
guaranteed is that the first page of the request will be retained.

I also don't let a request cross a cached/uncached boundary, but rather cut
the request off there and return.  The filesystem can then generate a new
request and call back in.  (Note that I have to be able to keep track of the
filesystem's metadata so that I can reissue the request to the netfs in the
event that cache suffers some sort of error).

What I was originally envisioning for the new ->readahead() interface is add a
second aop that allows the shaping to be accessed by the VM, before it's
started pinning any pages.

The shaping parameters I think we need are:

	- The inode, for i_size and fscache cookie
	- The proposed page range

and what you would get back could be:

	- Shaped page range
	- Minimum I/O granularity[1]
	- Minimum preferred granularity[2]
	- Flag indicating if the pages can just be zero-filled[3]

[1] The filesystem doesn't want to read in smaller chunks than this.

[2] The cache doesn't want to read in smaller chunks than this, though in the
    cache's case, a partially read block is just abandoned for the moment.
    This number would allow the readahead algorithm to shorten the request if
    it can't allocate a page.

[3] If I know that the local i_size is much bigger than the i_size on the
    server, there's no need to download/read those pages and readahead can
    just clear them.  This is more applicable to write_begin() normally.

Now a chunk of this is in struct readahead_control, so it might be reasonable
to add the other bits there too.

Note that one thing I really would like to avoid having to do is to expand a
request forward, particularly if the main page of interest is precreated and
locked by the VM before calling the filesystem.  I would much rather the VM
created the pages, starting from the lowest-numbered.

Anyway, that's my 2p.
David


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: The future of readahead
  2020-08-27 17:02 ` David Howells
@ 2020-08-27 17:21   ` Matthew Wilcox
  0 siblings, 0 replies; 3+ messages in thread
From: Matthew Wilcox @ 2020-08-27 17:21 UTC (permalink / raw)
  To: David Howells; +Cc: linux-fsdevel, Kent Overstreet, Mike Marshall

On Thu, Aug 27, 2020 at 06:02:18PM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> > void readahead_expand(struct readahead_control *rac, loff_t start, u64 len);
> > or possibly
> > void readahead_expand(struct readahead_control *rac, pgoff_t start,
> > 		unsigned int count);
> > 
> > It might not actually expand the readahead attempt at all -- for example,
> > if there's already a page in the page cache, or if it can't allocate
> > memory.  But this puts the responsibility for allocating pages in the VFS,
> > where it belongs.
> 
> This is exactly what the fscache read helper in my fscache rewrite is doing,
> except that I'm doing it in fs/fscache/read_helper.c.
> 
> Have a look here:
> 
> 	https://lore.kernel.org/linux-fsdevel/159465810864.1376674.10267227421160756746.stgit@warthog.procyon.org.uk/
> 
> and look for the fscache_read_helper() function.
> 
> Note that it's slighly complicated because it handles ->readpage(),
> ->readpages() and ->write_begin()[*].
> 
> [*] I want to be able to bring the granule into the cache for modification.
>     Ideally I'd be able to see that the entire granule is going to get written
>     over and skip - kind of like write_begin for a whole granule rather than a
>     page.

I'm going to want something like that for THP too.  I may end up
changing the write_begin API.

> Shaping the readahead request has the following issues:
> 
>  (1) The request may span multiple granules.
> 
>  (2) Those granules may be a mixture of cached and uncached.
> 
>  (3) The granule size may vary.
> 
>  (4) Granules fall on power-of-2 boundaries (for example 256K boundaries)
>      within the file, but the request may not start on a boundary and may not
>      end on one.
> 
> To deal with this, fscache_read_helper() calls out to the cache backend
> (fscache_shape_request()) and the netfs (req->ops->reshape()) to adjust the
> read it's going to make.  Shaping the request may mean moving the start
> earlier as well as expanding or contracting the size.  The only thing that's
> guaranteed is that the first page of the request will be retained.

Thank you for illustrating why this shouldn't be handled in the
filesystem ;-)

mmap readaround starts n/2 pages below the the faulting page and ends n/2
pages above the faulting page.  So if your adjustment fails to actually
bring in the page in the middel, it has failed mmap.

> What I was originally envisioning for the new ->readahead() interface is add a
> second aop that allows the shaping to be accessed by the VM, before it's
> started pinning any pages.
> 
> The shaping parameters I think we need are:
> 
> 	- The inode, for i_size and fscache cookie
> 	- The proposed page range
> 
> and what you would get back could be:
> 
> 	- Shaped page range
> 	- Minimum I/O granularity[1]
> 	- Minimum preferred granularity[2]
> 	- Flag indicating if the pages can just be zero-filled[3]
> 
> [1] The filesystem doesn't want to read in smaller chunks than this.
> 
> [2] The cache doesn't want to read in smaller chunks than this, though in the
>     cache's case, a partially read block is just abandoned for the moment.
>     This number would allow the readahead algorithm to shorten the request if
>     it can't allocate a page.
> 
> [3] If I know that the local i_size is much bigger than the i_size on the
>     server, there's no need to download/read those pages and readahead can
>     just clear them.  This is more applicable to write_begin() normally.
> 
> Now a chunk of this is in struct readahead_control, so it might be reasonable
> to add the other bits there too.
> 
> Note that one thing I really would like to avoid having to do is to expand a
> request forward, particularly if the main page of interest is precreated and
> locked by the VM before calling the filesystem.  I would much rather the VM
> created the pages, starting from the lowest-numbered.

A call to ->readahead is always a contiguous set of pages.  A call to
readahead_expand() which tried to expand both up and down would start
by allocating, locking and adding the pages to the page cache heading
downwards from the current start (it's usually not allowed to lock
pages out of order, but because they're locked before being added,
this is an exception).  Then we'd try to expand upwards.  We'll fail
to expand if we can't allocate a page or if there's already a page in
the cache that blocks our expansion in that direction.

As far as the zeroing beyond i_size, that's the responsibility of the
filesystem.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-08-27 17:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 19:31 The future of readahead Matthew Wilcox
2020-08-27 17:02 ` David Howells
2020-08-27 17:21   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).