LKML Archive on
help / color / mirror / Atom feed
From: David Howells <>
To: Matthew Wilcox <>
Cc:,,, Christoph Hellwig <>,
	Linus Torvalds <>,,,
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?
Date: Thu, 05 Aug 2021 14:07:03 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

Matthew Wilcox <> wrote:

> > Say, for example, I need to write a 3-byte change from a page, where that
> > page is part of a 256K sequence in the pagecache.  Currently, I have to
> > round the 3-bytes out to DIO size/alignment, but I could say to the API,
> > for example, "here's a 256K iterator - I need bytes 225-227 written, but
> > you can write more if you want to"?
> I think you're optimising the wrong thing.  No actual storage lets you
> write three bytes.  You're just pushing the read/modify/write cycle to
> the remote end.  So you shouldn't even be tracking that three bytes have
> been dirtied; you should be working in multiples of i_blocksize().

I'm dealing with network filesystems that don't necessarily let you know what
i_blocksize is.  Assume it to be 1.

Further, only sending, say, 3 bytes and pushing RMW to the remote end is not
necessarily wrong for a network filesystem for at least two reasons: it
reduces the network loading and it reduces the effects of third-party write

> I don't know of any storage which lets you ask "can I optimise this
> further for you by using a larger size".  Maybe we have some (software)
> compressed storage which could do a better job if given a whole 256kB
> block to recompress.

It would offer an extent-based filesystem the possibility of adjusting its
extent list.  And if you were mad enough to put your cache on a shingled
drive...  (though you'd probably need a much bigger block than 256K to make
that useful).  Also, jffs2 (if someone used that as a cache) can compress its

> So it feels like you're both tracking dirty data at too fine a granularity,
> and getting ahead of actual hardware capabilities by trying to introduce a
> too-flexible API.

We might not know what the h/w caps are and there may be multiple destination
servers with different h/w caps involved.  Note that NFS and AFS in the kernel
both currently track at byte granularity and only send the bytes that changed.
The expense of setting up the write op on the server might actually outweigh
the RMW cycle.  With something like ceph, the server might actually have a
whole-object RMW/COW, say 4M.

Yet further, if your network fs has byte-range locks/leases and you have a
write lock/lease that ends part way into a page, when you drop that lock/lease
you shouldn't flush any data outside of that range lest you overwrite a range
that someone else has a lock/lease on.


  parent reply	other threads:[~2021-08-05 13:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
2021-08-05 12:37 ` Matthew Wilcox
2021-08-05 13:07 ` David Howells [this message]
2021-08-05 13:35   ` Matthew Wilcox
2021-08-05 14:38   ` David Howells
2021-08-05 15:06     ` Matthew Wilcox
2021-08-05 15:38     ` David Howells
2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
2021-08-05 17:27       ` Linus Torvalds
2021-08-05 17:43         ` Trond Myklebust
2021-08-05 22:11         ` Matthew Wilcox
2021-08-06 13:42         ` David Howells
2021-08-06 14:17           ` Matthew Wilcox
2021-08-06 15:04           ` David Howells
2021-08-05 17:52       ` Adam Borowski
2021-08-05 18:50       ` Jeff Layton
2021-08-05 23:47       ` Matthew Wilcox
2021-08-06 13:44       ` David Howells
2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).