LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Proposal for "proper" durable fsync() and fdatasync()
@ 2008-02-26  7:26 Jamie Lokier
  2008-02-26  7:43 ` Andrew Morton
  2008-02-26  7:43 ` Jeff Garzik
  0 siblings, 2 replies; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26  7:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: Chris Wedgwood

Dear kernel,

This is a proposal to add "proper" durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution "with benefits", so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought "how do I ensure ordered
journal writes; can I use fdatasync()?" and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:

    http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
    ("True fsync() in Linux (on IDE)")

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...


For a while, I've been meaning to bring it up on linux-kernel...


The fsync problem
-----------------

Chris Wedgwood wrote:
> On Mon, Feb 25, 2008 at 08:50:40PM +0000, Jamie Lokier wrote:
> 
> > On Linux (and other host OSes), fdatsync() and fsync() don't always
> > commit data to hard storage; it sometimes only commits it to the hard
> > drive cache.
> 
> That's a filesystem bug IMO.  People should be able to use f[data]sync
> with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
      fsync doesn't flush the disk's cache.  This means data stored
      is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
      guarantee any _ordering_ for the writes at the hard storage
      level.  So aside from losing committed data, it can also corrupt
      structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortune, fsync() issues a write
     barrier.  Right now, this does commit data (if the device can).
     But, if the SCSI mid-layer is fixed to use tag ordering, this
     won't commit data!  Therefore, the fchmod() trick with fsync() is
     good enough for ordering writes for, e.g. a database journal, but
     not for reporting that data is committed to hard storage,
     i.e. it's not durable.

  4. Again using the trick or good fortune, now you have two writes at
     different parts of the disk, with a great big seek.  This is a
     disaster for database-style journalling.  One of the writes is
     technically unnecessary, and the seeks add hugely to the commit
     time and disk wear, and break any attempt to optimise journal
     placement.

Linux has not only fsync(), but fdatasync() and sync_file_range().

Someone clearly put thought into a reasonably performant API for
database like applications.  (It would be nicer if sync_file_range()
took a vector of ranges for better elevator scheduling, but let's
ignore that :-)

Yet, it isn't safe for the simplest of journalling applications.

If you think this isn't a problem, I can tell you: it is.  Power
failures happen, sometimes by design.  I've seen filesystem corruption
in ext3 filesystems before journalling barriers were added; it wasn't
pretty, and it was enough of a problem that a lot of work was done to
add them cleanly.

The same corruption can happen to databases and guest VM filesystems
with current kernels.


Implementation proposal - block layer
-------------------------------------

Solving this, i.e. implementing fsync() and friends properly, isn't
trivial, but it isn't huge either.

Firstly, we have to look at the elevator and block driver APIs.  It's
worth reading Documentation/block/barrier.txt.  You can queue a
request with HARDBARRIER.  On devices which use ordering tags
(i.e. none because of SCSI driver limitations at present, according to
that doc), it uses ordering tags.  On other devices, if possible, it
uses cache flush commands and/or sets the FUA ("force unit access")
bit on the request.

Now imagine a database (guest VM, etc.) issues some writes.  Time
passes.  The writes are written to the disk's cache.  Then the
database calls fsync().  What kind of request shall we sent to the
block device?  We have _no_ outstanding read or write requests to
attach HARDBARRIER to.

So, that's the first thing: the block API needs a way to send that
fsync flush _without_ an associated read or write, and for the fsync()
system call to return when that flush indicates completion.  Let's
call this request HARDFLUSH (similar to HARDBARRIER).

The second thing is that the flush cannot be equivalent to a
HARDBARRIER attached to a NOP request, because HARDBARRIER provides
ordering only, at least in principle.  It must be a real flush.

Sometimes, there _are_ writes pending.  If there's only one since the
last flush, it could be optimised into a HARDBARRIER-FUA request,
which (assuming FUA is ever useful) is good for databases which have
exactly this pattern for their journal writes.

So, that's the third thing: we'd like to coalesce an fsync flush
request with a preceding undispatched write request if there is only
one write pending since the last flush.  Note: it must use
HARDBARRIER-FLUSH or HARDBARRIER-FUA, not HARDBARRIER-TAG alone.  If
tag ordering is used, follow it with HARDFLUSH.  Tag ordering before
the write is fine, but not enough after.


I/O request queue optimisations
-------------------------------

If there's only one write since the last flush, it may be possible to
set the FUA bit on that write instead of flushing after it.

There's no need to send a HARDFLUSH request if there have been no
write requests since the last flush (FUA or explicit), but non-flush
ordering tags don't count.

"Only one write pending" and "no write requests" can actually count
writes which originated from the file being synced; they don't need to
consider writes for other files.

When fsync() issues HARDFLUSH, the POSTFLUSH which is _currently_
issued with HARDBARRIER filesystem requests won't be required any
longer.  It could be deferred, safely and maybe profitably, until
before the next write.  This doesn't compromise filesystem integrity
(it's equivalent behaviour to tagged ordering), and it doesn't
compromise fsync() when fsync() does force the flushing.


Ordering of HARDFLUSH and HARDBARRIER
-------------------------------------

At first it may seem that HARDFLUSH is always stronger than
HARDBARRIER; i.e. that one includes the effect of the other.  This is
not true: writes can be moved before a HARDFLUSH, if the elevator
wants, but writes cannot be moved before a HARDBARRIER.  Another point
of view is that a HARDFLUSH can be safely delayed while other writes
proceed, perhaps to coalesce it with something.

Therefore, when queuing a request, both flags must be used together if
that's intended.  There are scenarios where either flag alone is
useful, or both together.

When a request has both HARDFLUSH and HARDBARRIER flags, it is
permitted to split it into two requests, to move later writes before
the HARDFLUSH but not before the HARDBARRIER.  This might be
advantageous in some scenarios using tagged ordering: delaying
flushes, perhaps to coalesce them, can be a useful.  It is obviously
useless when barriers are implemented using flush.


Block drivers
-------------

These need the ability to receive a HARDFLUSH request by itself or
combined with a write (after it).  HARDFLUSH must have the option of
being combined with the HARDBARRIER flag, just like other requests.
When HARDBARRIER is itself implemented using a flush or FUA, they
simply combine.  But when HARDBARRIER is using ordered tags, then this
ordering still must apply to the flush command.


Software RAID (etc.) drivers
----------------------------

HARDFLUSH can optionally be confined to a subset of the underlying
devices.  Thus it is reasonable for HARDFLUSH to be associated with a
sector range, which these drivers can use to select which devices to
flush.

HARDBARRIER can optionally be associated with a sector range too.  For
certain purposes, that means to wait for writes before the barrier
only in the corresponding range.  But be careful: it still orders
_all_ writes after the barrier, regardless of which underlying device
they reach.  Thus there are cross-device barriers.

To implement cross-device barriers, HARDBARRIERs must convert to
flushes, when followed by writes to other underlying devices, but can
used tagged ordering when followed only by writes to the same
underlying device, if there is only one.  Here be dragons, take care.

The easy way out, albeit not quite optimal, is to always convert
barriers to flushes on all underlying devices, which I think the
existing implementation does.


Filesystems
-----------

The fsync() methods should issue a HARDFLUSH after/with the journal
write, in addition to HARDBARRIER as is used now.  This may involve
adding a flag to the journalling code of each filesystem.

The proposed sync_page_range() enhancements might have interesting
consequences for how and when filesystem metadata is written, when new
blocks are allocated.


Userspace API enhancements
--------------------------

It is questionable whether fsync() and fdatasync() should always
implement hard flushes.  Immediately, there will be complaints that
Linux got much slower with some databases.

I read rumours that Mac OS X encountered this, and because it looks
bad, decided to keep hard flushes separate, using fcntl(F_FULLFSYNC).
I don't think there is a hard flush equivalent to fdatasync().

I'm thinking it should be a per-filesystem (and/or system wide
default, and or file descriptor) flag whether fsync() and fdatasync()
implement hard flushes.

For proper application control, we have the flags in
sync_file_range().  I propose that additional flags be added.

Just to be a bit cheeky and versatile, I propose that the additional
flags indicate when hard flushing is required, when it's explicitly
not required (overriding a system default for fsync), and orthogonally
(since it is orthogonal) do the same for hard barriers.  I'm sure some
databases and userspace filesystems would appreciate the various options.

Too add to the cheekiness, I propose that the API _allow_ but not
require that individual pages (actually bytes) keep track of whether
they have been followed by a hard barrier and/or hard flush.  The
implementation doesn't have to do that: it can be much coarser.  It's
nice if the API allows the possibility to refine the implementation
later.

Finally, support for flushes and/or barriers between O_DIRECT writes
are essential for some applications.


Proposal for sync_file_range()
------------------------------

Logically, associate with each page (or byte, block, file...) some flags:

     hardbarrier = { needed, pending, done }
     hardflush = { needed, pending, clean }

These flags are maintained at whatever granularity is convenient.

In addition, flags are maintained at whatever granularity is
convenient with O_DIRECT too.  This might be the file or file
descriptor, and/or the flags may be associated with each underlying
device in a software RAID.

Note: this is not as invasive as it sounds.  A simple implementation
can maintain those two flags for the file as a whole (not per page),
or even just the block device as a whole; that's easy.  We describe it
with fine granularity conceptually, to allow it in principle, as it
appears in the new API description of sync_file_range().

When a dirty page is scheduled for write-out (by any mechanism), and
the write-out completes, it is marked as clean.  When this occurs,
mark the page as "hardbarrier-needed" and "hardflush-needed", to
indicate it is written to the block device, but not committed to hard
storage.

When a HARDBARRIER or HARDFLUSH request is enqueued to a device (not
when it's issued), for all pages backed by the device, change the
flags to "hardbarrier-pending" and/or "hardflush-pending" if they were
"-needed".  When such a request completes (successfully?), set the
appropriate flags to "hardbarrier-clean" and/or "hardflush-clean".

New flags:

    SYNC_FILE_RANGE_HARD_FLUSH
        If SYNC_FILE_RANGE_WRITE is set, if any dirty page write-outs
        are initiated, queue a hard flush following the last one.  If
        there are no dirty pages, check the "hardflush" flags
        corresponding to all pages in the range, and corresponding to
        O_DIRECT for this file descriptor.  If any are
        "hardflush-needed", or the page range is empty, queue a hard
        flush soon.  In the empty page range case, set
        "hardflush-needed" in the flags corresponding to O_DIRECT,
        so that waiting for an empty page range will wait for it.

        If SYNC_FILE_RANGE_WAIT_BEFORE and/or
        SYNC_FILE_RANGE_WAIT_AFTER are set, after waiting for all
        write-outs to complete, check the "hardflush" flags
        corresponding to all pages in the range, and corresponding to
        O_DIRECT for this file descriptor.  If any are set to
        "hardflush-needed", queue a hard flush, then wait until they
        are all "hardflush-clean".

    SYNC_FILE_RANGE_HARD_BARRIER
        Same as SYNC_FILE_RANGE_HARD_FLUSH, except that "hardbarrier"
        is used instead of "hardflush", and hard barrier requests are
        queued instead of hard flushes.

        Important: SYNC_FILE_RANGE_HARD_BARRIER is a barrier only for
        writes in the specified range _before_ the barrier, but it
        controls _all_ writes to any offset after the barrier.  This
        is because there's no point in the barrier controlling offsets
        other than those where write-outs have been explicitly
        requested, and this has the practical benefit of reducing
        flushes in multi-device configurations, but acting as a
        barrier against later writes for other offsets is very useful.

        Note that this flag is not normally used if
        SYNC_FILE_RANGE_HARD_FLUSH is used in conjunction with
        SYNC_FILE_RANGE_WAIT_AFTER or SYNC_FILE_RANGE_FSYNC.  Those
        combinations wait until data is written and hard flushed
        before returning, so there is no way for the caller to issue
        more requests logically after the barrier, until the data is
        flushed anyway.  In these cases, using a barrier only
        penalises other processes for no gain.  However, you can do
        so; it is not forbidden.

    SYNC_FILE_RANGE_NO_FLUSH
        If the system is administratively set to issue hard flushes
        for fsync(), fdatasync() and sync_file_range(), which means it
        implicitly sets SYNC_FILE_RANGE_FLUSH, this flags _disables_
        the implicit setting of that flag.  This does not guarantee no
        hard flush occurs; it merely disables asking for it.  This has
        no effect on SYNC_FILE_RANGE_BARRIER.

    SYNC_FILE_RANGE_NO_BARRIER
        Same as SYNC_FILE_RANGE_NO_FLUSH, except it affects implicit
        SYNC_FILE_RANGE_BARRIER instead.  This has no effect on
        SYNC_FILE_RANGE_FLUSH.

    SYNC_FILE_RANGE_FSYNC
        Write any additional metadata that fsync() would include over
        fdatasync(), and wait for those writes to complete.  It might,
        potentially, do everything that fsync() does, including
        writing all data and waiting for it, even without setting any
        other flags.  Or it might just write the metadata.

        This flags allows you to combine SYNC_FILE_RANGE_FSYNC with
        SYNC_FILE_RANGE_{,NO_}HARD_{FLUSH,BARRIER}, to have more
        fine-grained control over the behaviour of fsync().

    SYNC_FILE_RANGE_HARD_FSYNC
        This forces a hard flushing fsync().  You should set the page
        range to cover all possible offsets, to get the full effect of
        fsync().

        It is an alias for SYNC_FILE_RANGE_FSYNC |
        SYNC_FILE_RANGE_HARD_FLUSH | SYNC_FILE_RANGE_WAIT_BEFORE |
        SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER.

        SYNC_FILE_RANGE_HARD_BARRIER is omitted, because this waits
        for the flush to complete before returning, so there is
        nothing gained by a hard barrier and it can penalise other
        processes.


Usage notes for journalling filesystem in userspace
---------------------------------------------------

For something like ext3, the pattern for a non-flushing metadata
journal update is: write to journal, write barrier, write journal
commit record, write barrier, write metadata elsewhere.

In this API, you could write (whether using O_DIRECT or not):

    pwrite(fd, journal_data, journal_length, journal_offset)
    sync_file_range(fd, journal_offset, journal_length,
                    (SYNC_FILE_RANGE_WRITE
                     | SYNC_FILE_RANGE_WAIT_AFTER
                     | SYNC_FILE_RANGE_HARD_BARRIER));
    pwrite(fd, commit_data, commit_length, commit_offset)
    sync_file_range(fd, commit_offset, commit_length,
                    (SYNC_FILE_RANGE_WRITE
                     | SYNC_FILE_RANGE_WAIT_AFTER
                     | SYNC_FILE_RANGE_HARD_BARRIER));
    pwrite(fd, metadata, metadata_length, metadata_offset);

If you wanted to request a durable commit (i.e. hard flush, fsync()
from filesystem user's perspective), then you could add
SYNC_FILE_RANGE_HARD_FLUSH to the second sync_file_range() call.  The
barrier from the first call ensures the journal entry is implicitly
flushed before the commit record, making the whole commit durable.

Alternatively, you could use a third sync_file_range() just for the
flush, after the data write.  Probably the first method is better: if
there is an advantage to reordering the requests to move the flush
later, the elevator is free to do that.

(By the way, if the commit record is a single device sector and
O_DIRECT is used, and everything is aligned just so, you may feel it
doesn't require a checksum, such is your confidence in a disk's
ability to write whole sectors or not.  If the commit record is any
other size, or O_DIRECT isn't used (which makes it a page size at
least), a checksum should be used.  Also, without O_DIRECT, be careful
of writing partial pages or misaligned pages as they are converted to
full page writes, and power failure may corrupt data that you didn't
explicitly write to.  There are many issues besides barriers and
flushing to get right when journalling for data integrity.)
    

Request for comments
--------------------

I'm not 100% sure of this API, but on the face of it, it seems it
could be quite versatile while being not too hard to implement, and
with performance improvements in future.

I expect the call should work with block devices, as well as files.
Does it provide sufficiently full access to the elevator barrier
capabilities in a tidy package?

Is this sufficient for correct and efficient behaviour over software
RAID and similar things?

Database, virtual machine and filesystem implementors,
please take a look at the API and see if it makes sense.

If one or two other people are interested to help, even if it's only
testing (and you're not in a rush...) I am willing to help implement
this.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
@ 2008-02-26  7:43 ` Andrew Morton
  2008-02-26  7:59   ` Jamie Lokier
  2008-02-26  7:43 ` Jeff Garzik
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2008-02-26  7:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, linux-fsdevel, Chris Wedgwood

On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <jamie@shareable.org> wrote:

> (It would be nicer if sync_file_range()
> took a vector of ranges for better elevator scheduling, but let's
> ignore that :-)

Two passes:

Pass 1: shove each of the segments into the queue with
        SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE

Pass 2: wait for them all to complete and return accumulated result
        with SYNC_FILE_RANGE_WAIT_AFTER



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
  2008-02-26  7:43 ` Andrew Morton
@ 2008-02-26  7:43 ` Jeff Garzik
  2008-02-26  7:55   ` Jamie Lokier
                     ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Jeff Garzik @ 2008-02-26  7:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, linux-fsdevel, Chris Wedgwood

Jamie Lokier wrote:
> By durable, I mean that fsync() should actually commit writes to
> physical stable storage,

Yes, it should.


> I was surprised that fsync() doesn't do this already.  There was a lot
> of effort put into block I/O write barriers during 2.5, so that
> journalling filesystems can force correct write ordering, using disk
> flush cache commands.
> 
> After all that effort, I was very surprised to notice that Linux 2.6.x
> doesn't use that capability to ensure fsync() flushes the disk cache
> onto stable storage.

It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.

[snip huge long proposal]

Rather than invent new APIs, we should fix the existing ones to _really_ 
flush data to physical media.

Linux should default to SAFE data storage, and permit users to retain 
the older unsafe behavior via an option.  It's completely ridiculous 
that we default to an unsafe fsync.

And [anticipating a common response from others] it is completely 
irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
current behavior is unsafe.

Safety before performance -- ESPECIALLY when it comes to storing user data.

Regards,

	Jeff (Linux ATA driver dude)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:43 ` Jeff Garzik
@ 2008-02-26  7:55   ` Jamie Lokier
  2008-02-26  9:25   ` Jamie Lokier
  2008-02-26 12:13   ` Ric Wheeler
  2 siblings, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26  7:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, linux-fsdevel, Chris Wedgwood

Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
> 
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already.  There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain 
> the older unsafe behavior via an option.  It's completely ridiculous 
> that we default to an unsafe fsync.

Well, I agree with you.  Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely 
> irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
> current behavior is unsafe.
> 
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking "how to turn off fsync in PostGreSQL"...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:43 ` Andrew Morton
@ 2008-02-26  7:59   ` Jamie Lokier
  2008-02-26  9:16     ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26  7:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, Chris Wedgwood

Andrew Morton wrote:
> On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <jamie@shareable.org> wrote:
> 
> > (It would be nicer if sync_file_range()
> > took a vector of ranges for better elevator scheduling, but let's
> > ignore that :-)
> 
> Two passes:
> 
> Pass 1: shove each of the segments into the queue with
>         SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> 
> Pass 2: wait for them all to complete and return accumulated result
>         with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:59   ` Jamie Lokier
@ 2008-02-26  9:16     ` Nick Piggin
  2008-02-26 14:09       ` Jörn Engel
  2008-02-26 16:43       ` Jeff Garzik
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2008-02-26  9:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
> Andrew Morton wrote:
> > On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <jamie@shareable.org> 
wrote:
> > > (It would be nicer if sync_file_range()
> > > took a vector of ranges for better elevator scheduling, but let's
> > > ignore that :-)
> >
> > Two passes:
> >
> > Pass 1: shove each of the segments into the queue with
> >         SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> >
> > Pass 2: wait for them all to complete and return accumulated result
> >         with SYNC_FILE_RANGE_WAIT_AFTER
>
> Thanks.
>
> Seems ok, though being able to cork the I/O until the last one would
> be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)
>
> I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
> reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, "writeout", to userspace (does "writeout" include
"in drive cache"? the kernel doesn't think so, but the only way to
make sync_file_range "safe" is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:43 ` Jeff Garzik
  2008-02-26  7:55   ` Jamie Lokier
@ 2008-02-26  9:25   ` Jamie Lokier
  2008-02-26 12:13   ` Ric Wheeler
  2 siblings, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26  9:25 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, linux-fsdevel, Chris Wedgwood

Jeff Garzik wrote:
> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  7:43 ` Jeff Garzik
  2008-02-26  7:55   ` Jamie Lokier
  2008-02-26  9:25   ` Jamie Lokier
@ 2008-02-26 12:13   ` Ric Wheeler
  2008-02-26 15:43     ` Jamie Lokier
  2 siblings, 1 reply; 22+ messages in thread
From: Ric Wheeler @ 2008-02-26 12:13 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Jamie Lokier, linux-kernel, linux-fsdevel, Chris Wedgwood

Jeff Garzik wrote:
> Jamie Lokier wrote:
>> By durable, I mean that fsync() should actually commit writes to
>> physical stable storage,
> 
> Yes, it should.
> 
> 
>> I was surprised that fsync() doesn't do this already.  There was a lot
>> of effort put into block I/O write barriers during 2.5, so that
>> journalling filesystems can force correct write ordering, using disk
>> flush cache commands.
>>
>> After all that effort, I was very surprised to notice that Linux 2.6.x
>> doesn't use that capability to ensure fsync() flushes the disk cache
>> onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

Maybe I am confused, but isn't this is what fsync() does today whenever 
barriers are enabled (the fsync() invalidates the drive's write cache).

ric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  9:16     ` Nick Piggin
@ 2008-02-26 14:09       ` Jörn Engel
  2008-02-26 15:07         ` Jamie Lokier
  2008-02-26 15:28         ` Jamie Lokier
  2008-02-26 16:43       ` Jeff Garzik
  1 sibling, 2 replies; 22+ messages in thread
From: Jörn Engel @ 2008-02-26 14:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jamie Lokier, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> 
> Yeah, sync_file_range has slightly unusual semantics and introduce
> the new concept, "writeout", to userspace (does "writeout" include
> "in drive cache"? the kernel doesn't think so, but the only way to
> make sync_file_range "safe" is if you do consider it writeout).

If sync_file_range isn't safe, it should get replaced by a noop
implementation.  There really is no point in promising "a little"
safety.

One interesting aspect of this comes with COW filesystems like btrfs or
logfs.  Writing out data pages is not sufficient, because those will get
lost unless their referencing metadata is written as well.  So either we
have to call fsync for those filesystems or add another callback and let
filesystems override the default implementation.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 14:09       ` Jörn Engel
@ 2008-02-26 15:07         ` Jamie Lokier
  2008-02-26 16:27           ` Andrew Morton
  2008-02-26 15:28         ` Jamie Lokier
  1 sibling, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26 15:07 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > 
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.
> 
> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
                   | SYNC_FILE_RANGE_WRITE
                   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of "write-out" is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a "write-out" flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the "write-out" flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that "write-out" flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 14:09       ` Jörn Engel
  2008-02-26 15:07         ` Jamie Lokier
@ 2008-02-26 15:28         ` Jamie Lokier
  2008-02-26 17:02           ` Jörn Engel
  1 sibling, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26 15:28 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.

Sometimes there is a point in "a little" safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application -> main memory
cache -> disk cache -> disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say "fine, the data is safe enough for me" depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says "done", they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says "I've written it", it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

Doesn't the ->fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 12:13   ` Ric Wheeler
@ 2008-02-26 15:43     ` Jamie Lokier
  2008-11-24 21:10       ` Sachin Gaikwad
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26 15:43 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Jeff Garzik, linux-kernel, linux-fsdevel, Chris Wedgwood

Ric Wheeler wrote:
> >>I was surprised that fsync() doesn't do this already.  There was a lot
> >>of effort put into block I/O write barriers during 2.5, so that
> >>journalling filesystems can force correct write ordering, using disk
> >>flush cache commands.
> >>
> >>After all that effort, I was very surprised to notice that Linux 2.6.x
> >>doesn't use that capability to ensure fsync() flushes the disk cache
> >>onto stable storage.
> >
> >It's surprising you are surprised, given that this [lame] fsync behavior 
> >has remaining consistently lame throughout Linux's history.
> 
> Maybe I am confused, but isn't this is what fsync() does today whenever 
> barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

	fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
	while (1) {
		char byte;
		usleep (100000);
		pwrite (fd, &byte, 1, 0);
		fsync (fd);
	}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 15:07         ` Jamie Lokier
@ 2008-02-26 16:27           ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2008-02-26 16:27 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jörn Engel, Nick Piggin, linux-kernel, linux-fsdevel,
	Chris Wedgwood

On Tue, 26 Feb 2008 15:07:45 +0000 Jamie Lokier <jamie@shareable.org> wrote:

> SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
> pages which aren't already queued for write-out.  It marks those with
> a "write-out" flag, and starts write I/Os at some unspecified time in
> the near future; it can be assumed writes for all the pages will
> complete eventually if there's no errors.  When I/O completes on a
> page, it cleans the page and also clears the write-out flag.
> 
> SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
> have the "write-out" flag set.
> 
> SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
> pages for write-out.  I don't actually see the point in this.  Isn't a
> preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
> BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26  9:16     ` Nick Piggin
  2008-02-26 14:09       ` Jörn Engel
@ 2008-02-26 16:43       ` Jeff Garzik
  2008-02-26 17:00         ` Jamie Lokier
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff Garzik @ 2008-02-26 16:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jamie Lokier, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Nick Piggin wrote:
> Anyway, the idea of making fsync/fdatasync etc. safe by default is
> a good idea IMO, and is a bad bug that we don't do that :(

Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.

It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.

	Jeff



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 16:43       ` Jeff Garzik
@ 2008-02-26 17:00         ` Jamie Lokier
  2008-02-26 17:54           ` Jeff Garzik
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26 17:00 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jeff Garzik wrote:
> Nick Piggin wrote:
> >Anyway, the idea of making fsync/fdatasync etc. safe by default is
> >a good idea IMO, and is a bad bug that we don't do that :(
> 
> Agreed...  it's also disappointing that [unless I'm mistaken] you have 
> to hack each filesystem to support barriers.
> 
> It seems far easier to make sync_blkdev() Do The Right Thing, and 
> magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 15:28         ` Jamie Lokier
@ 2008-02-26 17:02           ` Jörn Engel
  2008-02-26 17:29             ` Jamie Lokier
  0 siblings, 1 reply; 22+ messages in thread
From: Jörn Engel @ 2008-02-26 17:02 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jörn Engel, Nick Piggin, Andrew Morton, linux-kernel,
	linux-fsdevel, Chris Wedgwood

On Tue, 26 February 2008 15:28:10 +0000, Jamie Lokier wrote:
> 
> > One interesting aspect of this comes with COW filesystems like btrfs or
> > logfs.  Writing out data pages is not sufficient, because those will get
> > lost unless their referencing metadata is written as well.  So either we
> > have to call fsync for those filesystems or add another callback and let
> > filesystems override the default implementation.
> 
> Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> with appropriate arguments?

My paragraph above was aimed at the sync_file_range() case.  fsync and
fdatasync do the right thing within the limitations you brought up in
this thread.  sync_file_range() without further changes will only write
data pages, not the metadata required to actually access those data
pages.  This works just fine for non-COW filesystems, which covers all
currently merged ones.

With COW filesystems it is currently impossible to do sync_file_range()
properly.  The problem is orthogonal to your's, I just brought it up
since you were already mentioning sync_file_range().


Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 17:02           ` Jörn Engel
@ 2008-02-26 17:29             ` Jamie Lokier
  2008-02-26 17:38               ` Jörn Engel
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2008-02-26 17:29 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jörn Engel wrote:
> On Tue, 26 February 2008 15:28:10 +0000, Jamie Lokier wrote:
> > 
> > > One interesting aspect of this comes with COW filesystems like btrfs or
> > > logfs.  Writing out data pages is not sufficient, because those will get
> > > lost unless their referencing metadata is written as well.  So either we
> > > have to call fsync for those filesystems or add another callback and let
> > > filesystems override the default implementation.
> > 
> > Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> > with appropriate arguments?
> 
> My paragraph above was aimed at the sync_file_range() case.  fsync and
> fdatasync do the right thing within the limitations you brought up in
> this thread.  sync_file_range() without further changes will only write
> data pages, not the metadata required to actually access those data
> pages.  This works just fine for non-COW filesystems, which covers all
> currently merged ones.
> 
> With COW filesystems it is currently impossible to do sync_file_range()
> properly.  The problem is orthogonal to your's, I just brought it up
> since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 17:29             ` Jamie Lokier
@ 2008-02-26 17:38               ` Jörn Engel
  0 siblings, 0 replies; 22+ messages in thread
From: Jörn Engel @ 2008-02-26 17:38 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jörn Engel, Nick Piggin, Andrew Morton, linux-kernel,
	linux-fsdevel, Chris Wedgwood

On Tue, 26 February 2008 17:29:13 +0000, Jamie Lokier wrote:
> 
> You're right.  Though, doesn't normal page writeback enqueue the COW
> metadata changes?  If not, how do they get written in a timely
> fashion?

It does.  But this is not sufficient to guarantee that the pages in
question have been safely committed to the device by the time
sync_file_range() has returned.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 17:00         ` Jamie Lokier
@ 2008-02-26 17:54           ` Jeff Garzik
  2008-02-27 14:16             ` Jamie Lokier
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Garzik @ 2008-02-26 17:54 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jamie Lokier wrote:
> Jeff Garzik wrote:
>> Nick Piggin wrote:
>>> Anyway, the idea of making fsync/fdatasync etc. safe by default is
>>> a good idea IMO, and is a bad bug that we don't do that :(
>> Agreed...  it's also disappointing that [unless I'm mistaken] you have 
>> to hack each filesystem to support barriers.
>>
>> It seems far easier to make sync_blkdev() Do The Right Thing, and 
>> magically make all filesystems data-safe.
> 
> Well, you need ordered metadata writes, barriers _and_ flushes with
> some filesystems.
> 
> Merely writing all the data pages than issuing a drive cache flush
> won't Do The Right Thing with those filesystems - someone already
> mentioned Btrfs, where it won't.

Oh certainly.  That's why we have a VFS :)  fsync for NFS will look 
quite different, too.


> But I agree that your suggestion would make a superb default, for
> filesystems which don't provide their own function.

Yep.  That would immediately cover a bunch of filesystems.


> It's not optimal even then.
> 
>   Devices: On a software RAID, you ideally don't want to issue flushes
>   to all drives if your database did a 1 block commit entry.  (But they
>   probably use O_DIRECT anyway, changing the rules again).  But all that
>   can be optimised in generic VFS code eventually.  It doesn't need
>   filesystem assistance in most cases.

My own idea is that we create a FLUSH command for blkdev request queues, 
to exist alongside READ, WRITE, and the current barrier implementation. 
  Then FLUSH could be passed down through MD or DM.

	Jeff



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 17:54           ` Jeff Garzik
@ 2008-02-27 14:16             ` Jamie Lokier
  0 siblings, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2008-02-27 14:16 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-fsdevel, Chris Wedgwood

Jeff Garzik wrote:
> >It's not optimal even then.
> >
> >  Devices: On a software RAID, you ideally don't want to issue flushes
> >  to all drives if your database did a 1 block commit entry.  (But they
> >  probably use O_DIRECT anyway, changing the rules again).  But all that
> >  can be optimised in generic VFS code eventually.  It doesn't need
> >  filesystem assistance in most cases.
> 
> My own idea is that we create a FLUSH command for blkdev request queues, 
> to exist alongside READ, WRITE, and the current barrier implementation. 
>  Then FLUSH could be passed down through MD or DM.

I like your thought, and it has the benefit of being simple.

My thought is very similar, but with (hopefully not premature...)
optimisations:

  - I would merge FLUSH with a preceding write in some cases,
    converting to an FUA-write command.  Probably the generic request
    queue is the best place to detect and merge.  This is so that
    userspace filesystems (including guest VMs) and databases can do
    journal commits with the same I/O sequence as in kernel
    filesystems.

  - I would create BARRIER too, so that a userspace API can ask for
    this weaker form of fsync, which may improve throughput of
    userspace journalling.  

  - I would include a sector range in FLUSH and BARRIER, for MD and DM
    to flush _only_ relevant sub-devices.  This may improve performance
    for journalling both kernel and userspace filesystems, as journal
    commits are often very small and hit one or two sub-devices in RAID.

  - I would ask the nice MD and DM people to take tag-barriers rather
    than flush-barriers on the input queue, converting to
    tag-barriers, flush-barriers and independent FLUSH on the
    sub-device queues according to sector ranges and subsequent
    writes.  It's not obvious, but my barrier proposal which started
    this thread is designed to support an efficient inter-sub-device
    flush-barrier when necessary, and single-sub-device tag-barrier
    when possible.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-02-26 15:43     ` Jamie Lokier
@ 2008-11-24 21:10       ` Sachin Gaikwad
  2008-11-25 10:17         ` Jamie Lokier
  0 siblings, 1 reply; 22+ messages in thread
From: Sachin Gaikwad @ 2008-11-24 21:10 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ric Wheeler, Jeff Garzik, linux-kernel, linux-fsdevel, Chris Wedgwood

Hi Jamie,

On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Ric Wheeler wrote:
>> >>I was surprised that fsync() doesn't do this already.  There was a lot
>> >>of effort put into block I/O write barriers during 2.5, so that
>> >>journalling filesystems can force correct write ordering, using disk
>> >>flush cache commands.
>> >>
>> >>After all that effort, I was very surprised to notice that Linux 2.6.x
>> >>doesn't use that capability to ensure fsync() flushes the disk cache
>> >>onto stable storage.
>> >
>> >It's surprising you are surprised, given that this [lame] fsync behavior
>> >has remaining consistently lame throughout Linux's history.
>>
>> Maybe I am confused, but isn't this is what fsync() does today whenever
>> barriers are enabled (the fsync() invalidates the drive's write cache).
>
> No, fsync() doesn't always flush the drive's write cache.  It often
> does, any I think many people are under the impression it always does,
> but it doesn't.
>
> Try this code on ext3:
>
>        fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
>        while (1) {
>                char byte;
>                usleep (100000);
>                pwrite (fd, &byte, 1, 0);
>                fsync (fd);
>        }
>
> It will do just over 10 write ops per second on an idle system (13 on
> mine), and 1 flush op per second.

How did you measure write-ops and flush-ops ? Is there any tool which
can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
provides, but no luck.

Sachin

>
> That's because ext3 fsync() only does a journal commit when the inode
> has changed.  The inode mtime is changed by write only with 1 second
> granularity.  Without a journal commit, there's no barrier, which
> translates to not flushing disk write cache.
>
> If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
> and fsync, you'll see at least 20 write ops and 20 flush ops per
> second, and you'll here the disk seeking more.  That's because the
> fchmod dirties the inode, so fsync() writes the inode with a journal
> commit.
>
> It turns out even _that_ is not sufficient according to the kernel
> internals.  A journal commit uses an ordered request, which isn't the
> same as a flush potentially, it just happens to use flush in this
> instance.  I'm not sure if ordered requests are actually implemented
> by any drivers at the moment.  If not now, they will be one day.
>
> We could change ext3 fsync() to always do a journal commit, and depend
> on the non-existence of block drivers which do ordered (not flush)
> barrier requests.  But there's lots of things wrong with that.  Not
> least, it sucks performance for database-like applications and virtual
> machines, a lot due to unnecessary seeks.  That way lies wrongness.
>
> Rightness is to make fdatasync() work well, with a genuine flush (or
> equivalent (see FUA), only when required, and not a mere ordered
> barrier), no inode write, and to make sync_file_range()[*] offer the
> fancier applications finer controls which reflect what they actually
> need.
>
> [*] - or whatever.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Proposal for "proper" durable fsync() and fdatasync()
  2008-11-24 21:10       ` Sachin Gaikwad
@ 2008-11-25 10:17         ` Jamie Lokier
  0 siblings, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2008-11-25 10:17 UTC (permalink / raw)
  To: Sachin Gaikwad
  Cc: Ric Wheeler, Jeff Garzik, linux-kernel, linux-fsdevel, Chris Wedgwood

Sachin Gaikwad wrote:
> > No, fsync() doesn't always flush the drive's write cache.  It often
> > does, any I think many people are under the impression it always does,
> > but it doesn't.
> >
> > Try this code on ext3:
> >
> >        fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
> >        while (1) {
> >                char byte;
> >                usleep (100000);
> >                pwrite (fd, &byte, 1, 0);
> >                fsync (fd);
> >        }
> >
> > It will do just over 10 write ops per second on an idle system (13 on
> > mine), and 1 flush op per second.
> 
> How did you measure write-ops and flush-ops ? Is there any tool which
> can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
> provides, but no luck.

I don't remember; it was such a long time ago!

It probably involved looking at /sys/block/*/stat or something like that.

You might find the "blktrace" tool does what you want.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-11-25 10:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-26  7:26 Proposal for "proper" durable fsync() and fdatasync() Jamie Lokier
2008-02-26  7:43 ` Andrew Morton
2008-02-26  7:59   ` Jamie Lokier
2008-02-26  9:16     ` Nick Piggin
2008-02-26 14:09       ` Jörn Engel
2008-02-26 15:07         ` Jamie Lokier
2008-02-26 16:27           ` Andrew Morton
2008-02-26 15:28         ` Jamie Lokier
2008-02-26 17:02           ` Jörn Engel
2008-02-26 17:29             ` Jamie Lokier
2008-02-26 17:38               ` Jörn Engel
2008-02-26 16:43       ` Jeff Garzik
2008-02-26 17:00         ` Jamie Lokier
2008-02-26 17:54           ` Jeff Garzik
2008-02-27 14:16             ` Jamie Lokier
2008-02-26  7:43 ` Jeff Garzik
2008-02-26  7:55   ` Jamie Lokier
2008-02-26  9:25   ` Jamie Lokier
2008-02-26 12:13   ` Ric Wheeler
2008-02-26 15:43     ` Jamie Lokier
2008-11-24 21:10       ` Sachin Gaikwad
2008-11-25 10:17         ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).