LKML Archive on
help / color / mirror / Atom feed
From: David Howells <>
To: Matthew Wilcox <>
	Linus Torvalds <>,
	Anna Schumaker <>,
	Trond Myklebust <>,
	Jeff Layton <>,
	Steve French <>,
	Dominique Martinet <>,
	Mike Marshall <>,
	Miklos Szeredi <>,
	Shyam Prasad N <>,,,
	"open list:NFS, SUNRPC, AND..." <>,
	CIFS <>,,,, Linux-MM <>,
	linux-fsdevel <>,
	Linux Kernel Mailing List <>
Subject: Re: Canvassing for network filesystem write size vs page size
Date: Fri, 06 Aug 2021 16:04:53 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

Matthew Wilcox <> wrote:

> No, that is very much not the same thing.  Look at what NFS does, like
> Linus said.  Consider this test program:
> 	fd = open();
> 	lseek(fd, 5, SEEK_SET);
> 	write(fd, buf, 3);
> 	write(fd, buf2, 10);
> 	write(fd, buf3, 2);
> 	close(fd);

Yes, I get that.  I can do that when there isn't a local cache or content

Note that, currently, if the pages (or cache blocks) being read/modified are
beyond the EOF at the point when the file is opened, truncated down or last
subject to 3rd-party invalidation, I don't go to the server at all.

> > But that kind of screws with local caching.  The local cache might need to
> > track the missing bits, and we are likely to be using blocks larger than a
> > page.
> There's nothing to cache.  Pages which are !Uptodate aren't going to get
> locally cached.

Eh?  Of course there is.  You've just written some data.  That need to get
copied to the cache as well as the server if that file is supposed to be being
cached (for filesystems that support local caching of files open for writing,
which AFS does).

> > Basically, there are a lot of scenarios where not having fully populated
> > pages sucks.  And for streaming writes, wouldn't it be better if you used
> > DIO writes?
> DIO can't do sub-512-byte writes.

Yes it can - and it works for my AFS client at least with the patches in my
fscache-iter-2 branch.  This is mainly a restriction for block storage devices
we're doing DMA to - but we're not doing direct DMA to block storage devices
typically when talking to a network filesystem.

For AFS, at least, I can just make one big FetchData/StoreData RPC that
reads/writes the entire DIO request in a single op; for other filesystems
(NFS, ceph for example), it needs breaking up into a sequence of RPCs, but
there's no particular reason that I know of that requires it to be 512-byte
aligned on any of these.

Things get more interesting if you're doing DIO to a content-encrypted file
because the block size may be 4096 or even a lot larger - in which case we
would have to do local RMW to handle misaligned writes, but it presents no
particular difficulty.

> You might not be trying to do anything for block filesystems, but we
> should think about what makes sense for block filesystems as well as
> network filesystems.

Whilst that's a good principle, they have very different characteristics that
might make that difficult.


  parent reply	other threads:[~2021-08-06 15:05 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
2021-08-05 12:37 ` Matthew Wilcox
2021-08-05 13:07 ` David Howells
2021-08-05 13:35   ` Matthew Wilcox
2021-08-05 14:38   ` David Howells
2021-08-05 15:06     ` Matthew Wilcox
2021-08-05 15:38     ` David Howells
2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
2021-08-05 17:27       ` Linus Torvalds
2021-08-05 17:43         ` Trond Myklebust
2021-08-05 22:11         ` Matthew Wilcox
2021-08-06 13:42         ` David Howells
2021-08-06 14:17           ` Matthew Wilcox
2021-08-06 15:04           ` David Howells [this message]
2021-08-05 17:52       ` Adam Borowski
2021-08-05 18:50       ` Jeff Layton
2021-08-05 23:47       ` Matthew Wilcox
2021-08-06 13:44       ` David Howells
2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).