LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, jlayton@kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	dchinner@redhat.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?
Date: Thu, 5 Aug 2021 16:06:51 +0100	[thread overview]
Message-ID: <YQv+iwmhhZJ+/ndc@casper.infradead.org> (raw)
In-Reply-To: <1186271.1628174281@warthog.procyon.org.uk>

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> > If you want to take leases at byte granularity, and then not writeback
> > parts of a page that are outside that lease, feel free.  It shouldn't
> > affect how you track dirtiness or how you writethrough the page cache
> > to the disk cache.
> 
> Indeed.  Handling writes to the local disk cache is different from handling
> writes to the server(s).  The cache has a larger block size but I don't have
> to worry about third-party conflicts on it, whereas the server can be taken as
> having no minimum block size, but my write can clash with someone else's.
> 
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
> 
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be?

If your network protocol doesn't give you a way to ask the server what
size it is, assume 512 bytes and allow it to be overridden by a mount
option.

> Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this.  What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

Actually, you do.  Two situations:

1. Application uses MADVISE_HUGEPAGE.  In response, we create a 2MB
page and mmap it aligned.  We use a PMD sized TLB entry and then the
CPU dirties a few bytes with a store.  There's no sub-TLB-entry tracking
of dirtiness.  It's just the whole 2MB.

2. The bigger the folio, the more writes it will absorb before being
written back.  So when you're writing back that 4MB folio, you're not
just servicing this 3 byte write, you're servicing every other write
which hit this 4MB chunk of the file.

There is one exception I've found, and that's O_SYNC writes.  These are
pretty rare, and I think I have a solution to it which essentially treats
the page cache as writethrough (for sync writes).  We skip marking
the page (folio) as dirty and go straight to marking it as writeback.
We have all the information we need about which bytes to write and we're
actually using the existing page cache infrastructure to do it.

I'm working on implementing that in iomap; there's some SMOP type
problems to solve, but it looks doable.

  reply	other threads:[~2021-08-05 15:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
2021-08-05 12:37 ` Matthew Wilcox
2021-08-05 13:07 ` David Howells
2021-08-05 13:35   ` Matthew Wilcox
2021-08-05 14:38   ` David Howells
2021-08-05 15:06     ` Matthew Wilcox [this message]
2021-08-05 15:38     ` David Howells
2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
2021-08-05 17:27       ` Linus Torvalds
2021-08-05 17:43         ` Trond Myklebust
2021-08-05 22:11         ` Matthew Wilcox
2021-08-06 13:42         ` David Howells
2021-08-06 14:17           ` Matthew Wilcox
2021-08-06 15:04           ` David Howells
2021-08-05 17:52       ` Adam Borowski
2021-08-05 18:50       ` Jeff Layton
2021-08-05 23:47       ` Matthew Wilcox
2021-08-06 13:44       ` David Howells
2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YQv+iwmhhZJ+/ndc@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=dchinner@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).