LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: David Chinner <dgc@sgi.com>
To: Michael Rubin <mrubin@google.com>
Cc: David Chinner <dgc@sgi.com>, Fengguang Wu <wfg@mail.ustc.edu.cn>,
	a.p.zijlstra@chello.nl, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure
Date: Fri, 18 Jan 2008 19:54:07 +1100	[thread overview]
Message-ID: <20080118085407.GV155259@sgi.com> (raw)
In-Reply-To: <532480950801172138x44e06780w2b15464845b626fc@mail.gmail.com>

On Thu, Jan 17, 2008 at 09:38:24PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 9:01 PM, David Chinner <dgc@sgi.com> wrote:
> 
> First off thank you for the very detailed reply. This rocks and gives
> me much to think about.
> 
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > This seems suboptimal for large files. If you keep feeding in
> > new least recently dirtied files, the large files will never
> > get an unimpeded go at the disk and hence we'll struggle to
> > get decent bandwidth under anything but pure large file
> > write loads.
> 
> You're right. I understand now. I just  changed a dial on my tests,
> ran it and found pdflush not keeping up like it should. I need to
> address this.
> 
> > Switching inodes during writeback implies a seek to the new write
> > location, while continuing to write the same inode has no seek
> > penalty because the writeback is sequential.  It follows from this
> > that allowing larges file a disproportionate amount of data
> > writeback is desirable.
> >
> > Also, cycling rapidly through all the large files to write 4MB to each is
> > going to cause us to spend time seeking rather than writing compared
> > to cycling slower and writing 40MB from each large file at a time.
> >
> > i.e. servicing one large file for 100ms is going to result in higher
> > writeback throughput than servicing 10 large files for 10ms each
> > because there's going to be less seeking and more writing done by
> > the disks.
> >
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> >         slice 1: iterate over small files
> >         slice 2: flush large file 1
> >         slice 3: iterate over small files
> >         slice 4: flush large file 2
> >         ......
> >         slice n-1: flush large file N
> >         slice n: iterate over small files
> >         slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
> 
> I am getting where you are coming from. But if we are going to make
> changes to optimize for seeks maybe we need to be more aggressive in
> write back in how we organize both time and location. Right now AFAIK
> there is no attention to location in the writeback path.

True. But IMO, locality ordering really only impacts the small file
data writes and the inodes themselves because there is typically
lots of seeks in doing that.

For large sequential writes to a file, writing a significant
chunk of data gives that bit of writeback it's own locality
because it does not cause seeks. Hence simply writing large
enough chunks avoids any need to order the writeback by locality.

Hence I writeback ordering by locality more a function of 
optimising the "iterate over small files" aspect of the writeback.

> >         The higher the bandwidth of the device, the more frequently
> >         we need to be servicing the inodes with large amounts of
> >         dirty data to be written to maintain write throughput at a
> >         significant percentage of the device capability.
> >
> 
> Could you expand that to say it's not the inodes of large files but
> the ones with data that we can exploit locality?

Not sure I understand what you mean. Can you rephrase that?

> Often large files are fragmented.

Then the filesystem is not doing it's job. Fragmentation does
not happen very frequently in XFS for large files - that is one
of the reasons it is extremely good for large files and high
throughput applications...

> Would it make more sense to pursue cracking the inodes and
> grouping their blocks's locations? Or is this all overkill and should
> be handled at a lower level like the elevator?

For large files it is overkill. For filesystems that do delayed
allocation, it is often impossible (no block mapping until
the writeback is executed unless it's an overwrite).

At this point, I'd say it is best to leave it to the filesystem and
the elevator to do their jobs properly.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

  reply	other threads:[~2008-01-18  8:54 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-15  8:09 [patch] Converting writeback linked lists to a tree based data structure Michael Rubin
2008-01-15  8:46 ` Peter Zijlstra
2008-01-15 17:53   ` Michael Rubin
     [not found]     ` <400452490.28636@ustc.edu.cn>
2008-01-16  3:01       ` Fengguang Wu
2008-01-16  3:44       ` Andrew Morton
     [not found]         ` <400457571.32162@ustc.edu.cn>
2008-01-16  4:25           ` Fengguang Wu
2008-01-16  4:42           ` Andrew Morton
     [not found]             ` <400459376.04290@ustc.edu.cn>
2008-01-16  4:55               ` Fengguang Wu
2008-01-16  5:51               ` Andrew Morton
     [not found]                 ` <400474447.19383@ustc.edu.cn>
2008-01-16  9:07                   ` Fengguang Wu
2008-01-16 22:35                     ` David Chinner
     [not found]                       ` <400539769.00869@ustc.edu.cn>
2008-01-17  3:16                         ` Fengguang Wu
2008-01-17  5:21                           ` David Chinner
2008-01-18  7:36                   ` Mike Waychison
2008-01-16  7:55         ` David Chinner
2008-01-16  8:13           ` Andrew Morton
     [not found]             ` <400488821.15609@ustc.edu.cn>
2008-01-16 13:06               ` Fengguang Wu
2008-01-16 18:55       ` Michael Rubin
     [not found]         ` <400540692.29046@ustc.edu.cn>
2008-01-17  3:31           ` Fengguang Wu
     [not found] ` <400562938.07583@ustc.edu.cn>
2008-01-17  9:41   ` Fengguang Wu
2008-01-17 21:07   ` Michael Rubin
2008-01-18  5:01     ` David Chinner
2008-01-18  5:38       ` Michael Rubin
2008-01-18  8:54         ` David Chinner [this message]
2008-01-18  9:26           ` Michael Rubin
     [not found]       ` <400634919.20750@ustc.edu.cn>
2008-01-18  5:41         ` Fengguang Wu
2008-01-19  2:50           ` David Chinner
     [not found]     ` <400632190.14601@ustc.edu.cn>
2008-01-18  4:56       ` Fengguang Wu
2008-01-18  5:41       ` Andi Kleen
     [not found]         ` <400644314.11994@ustc.edu.cn>
2008-01-18  6:01           ` Fengguang Wu
2008-01-18  7:48         ` Mike Waychison
2008-01-18  6:43       ` Michael Rubin
     [not found]         ` <400651538.20437@ustc.edu.cn>
2008-01-18  9:32           ` Fengguang Wu
  -- strict thread matches above, loose matches on Subject: below --
2007-12-13  0:32 Michael Rubin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080118085407.GV155259@sgi.com \
    --to=dgc@sgi.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mrubin@google.com \
    --cc=wfg@mail.ustc.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).