LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Jens Axboe <jens.axboe@oracle.com>
To: Robert Hancock <hancockr@shaw.ca>
Cc: Ricardo Correia <rcorreia@wizy.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: How to flush the disk write cache from userspace
Date: Sun, 21 Jan 2007 14:50:58 +1100	[thread overview]
Message-ID: <20070121035058.GB4658@kernel.dk> (raw)
In-Reply-To: <45B0123C.3080506@shaw.ca>

On Thu, Jan 18 2007, Robert Hancock wrote:
> Ricardo Correia wrote:
> >On Tuesday 16 January 2007 00:38, you wrote:
> >>As always with these things, the devil is in the details. It requires
> >>the device to support a ->prepare_flush() queue hook, and not all
> >>devices do that. It will work for IDE/SATA/SCSI, though. In some devices
> >>you don't want/need to do a real disk flush, it depends on the write
> >>cache settings, battery backing, etc.
> >
> >Is there any chance that someone could implement this (I don't have the 
> >skills, unfortunately)? Maybe add a new ioctl() to block devices, so that 
> >it doesn't break any existing code?
> 
> I think we really should have support for doing cache flushes 
> automatically on fsync, etc. User space code should not have to worry 
> about this problem, it's pretty silly that for example MySQL has to 
> advise people to use hdparm -W 0 to disable the write cache on their IDE 
> drives in order to get proper data integrity guarantees - and disabling 
> the cache on IDE without command queueing really slaughters the 
> performance, unnecessarily in this case.

Completely agree. If you have barriers enabled in your filesystem, then
it should Just Work when you do fsync(). At least that is the case for
reiserfs and XFS, I'm not completely sure that ext3 also handles it
correctly.

For direct block device access, fsync() does need to provide a commit to
stable storage as well though.

> There may be some cases where the controller provides a battery-backed 
> cache and thus we don't want to actually force the controller to flush 
> everything out to the drive on fsync, so we may need to be able to 
> disable this, but these controllers may ignore flushes anyway. I know 
> IBM ServeRAID appears to fail requests for write cache info and so the 
> kernel assumes drive cache: write through and doesn't do any flushes.

That would be the preferable approach, just have the hardware that
doesn't need a flush ignore the FLUSH_CACHE. That would also need to
ignore the FUA bit on writes then. I'm not sure what the spec has to say
on this, basically the requirement is just that data is on stable
storage (eg survives power failure and so on), then that would be fine.
And I would hope it is, it'd be hard to specify anything else.

> >I believe it's a very useful (and relatively simple) feature that 
> >increases data integrity and reliability for applications that need this 
> >functionality.
> >
> >I think it must be considered that most people have disk write caches 
> >enabled and are using IDE, SATA or SCSI disks.
> >
> >I also think there's no point in disabling disks' write caches, since it 
> >slows writes and decreases disks' lifetime, and because there's a better 
> >solution.
> 
> Yes, ideally doing all writes to the drive with write cache enabled and 
> then flushing them out afterwards would be much more efficient (at least 
>  when no command queueing is involved) since the drive can choose what 
> order to complete the writes in.

That only works if you just care about the stream of writes going to
stable storage and don't care about ordering. But the above is
essentially how the barriers work on write back cache + non queued
devices. When the barrier write is received, we commit the previous
writes first with a flush and then write the barrier (followed by
another flush, or possibly not if we have FUA).

-- 
Jens Axboe


  reply	other threads:[~2007-01-21  4:03 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fa.y+HJNAxqDqX5AHUxcmThAo20Ivo@ifi.uio.no>
     [not found] ` <fa.xbdrjhFpvWMJeTroG2DpPE4wd+M@ifi.uio.no>
     [not found]   ` <fa.lqQRZqIqMX2chyIAM888fc1jCuY@ifi.uio.no>
2007-01-19  0:35     ` Robert Hancock
2007-01-21  3:50       ` Jens Axboe [this message]
2007-01-14  4:05 Ricardo Correia
2007-01-16  0:38 ` Jens Axboe
2007-01-18  1:15   ` Ricardo Correia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070121035058.GB4658@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=hancockr@shaw.ca \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rcorreia@wizy.org \
    --subject='Re: How to flush the disk write cache from userspace' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).