LKML Archive on
help / color / mirror / Atom feed
From: Matthew Wilcox <>
To: Linus Torvalds <>
Cc: David Howells <>,
	Anna Schumaker <>,
	Trond Myklebust <>,
	Jeff Layton <>,
	Steve French <>,
	Dominique Martinet <>,
	Mike Marshall <>,
	Miklos Szeredi <>,
	Shyam Prasad N <>,,,
	"open list:NFS, SUNRPC, AND..." <>,
	CIFS <>,,,, Linux-MM <>,
	linux-fsdevel <>,
	Linux Kernel Mailing List <>
Subject: Re: Canvassing for network filesystem write size vs page size
Date: Thu, 5 Aug 2021 23:11:08 +0100	[thread overview]
Message-ID: <YQxh/> (raw)
In-Reply-To: <>

On Thu, Aug 05, 2021 at 10:27:05AM -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <> wrote:
> > Some network filesystems, however, currently keep track of which byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and only write
> > out the modified data.
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
> It's fairly important to be able to do streaming writes without having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes until
> you have the whole page.

I completely agree with you.  The context you're missing is that Dave
wants to do RMW twice.  He doesn't do the delaying SetPageUptodate dance.
If the write is less than the whole page, AFS, Ceph and anybody else
using netfs_write_begin() will first read the entire page in and mark
it Uptodate.

Then he wants to track which parts of the page are dirty (at byte
granularity) and send only those bytes to the server in a write request.
So it's worst of both worlds; first the client does an RMW, then the
server does an RMW (assuming the client's data is no longer in the
server's cache.

The NFS code moves the RMW from the client to the server, and that makes
a load of sense.

> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic manner.

It certainly doesn't make sense for block filesystems.  Since they
can only do I/O on block boundaries, a sub-block write has to read in
the surrounding block, and once you're doing that, you might as well
read in the whole page.

Tracking sub-page dirty bits still makes sense.  It's on my to-do
list for iomap.

> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]

NFS is missing one trick; it could implement aops->is_partially_uptodate
and then it would be able to read back bytes that have already been
written by this client without writing back the dirty ranges and fetching
the page from the server.

Maybe this isn't an important optimisation.

  parent reply	other threads:[~2021-08-05 22:11 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
2021-08-05 12:37 ` Matthew Wilcox
2021-08-05 13:07 ` David Howells
2021-08-05 13:35   ` Matthew Wilcox
2021-08-05 14:38   ` David Howells
2021-08-05 15:06     ` Matthew Wilcox
2021-08-05 15:38     ` David Howells
2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
2021-08-05 17:27       ` Linus Torvalds
2021-08-05 17:43         ` Trond Myklebust
2021-08-05 22:11         ` Matthew Wilcox [this message]
2021-08-06 13:42         ` David Howells
2021-08-06 14:17           ` Matthew Wilcox
2021-08-06 15:04           ` David Howells
2021-08-05 17:52       ` Adam Borowski
2021-08-05 18:50       ` Jeff Layton
2021-08-05 23:47       ` Matthew Wilcox
2021-08-06 13:44       ` David Howells
2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YQxh/ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).