Linux-Fsdevel Archive on
help / color / mirror / Atom feed
From: Dave Chinner <>
To: Ritesh Harjani <>
Cc: Anju T Sudhakar <>,,,,,,
Subject: Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().
Date: Sat, 22 Aug 2020 07:53:58 +1000	[thread overview]
Message-ID: <20200821215358.GG7941@dread.disaster.area> (raw)
In-Reply-To: <>

On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote:
> Hello Dave,
> Thanks for reviewing this.
> On 8/21/20 4:41 AM, Dave Chinner wrote:
> > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote:
> > > From: Ritesh Harjani <>
> > > 
> > > __bio_try_merge_page() may return same_page = 1 and merged = 0.
> > > This could happen when bio->bi_iter.bi_size + len > UINT_MAX.
> > 
> > Ummm, silly question, but exactly how are we getting a bio that
> > large in ->writepages getting built? Even with 64kB pages, that's a
> > bio with 2^16 pages attached to it. We shouldn't be building single
> > bios in writeback that large - what storage hardware is allowing
> > such huge bios to be built? (i.e. can you dump all the values in
> > /sys/block/<dev>/queue/* for that device for us?)
> Please correct me here, but as I see, bio has only these two limits
> which it checks for adding page to bio. It doesn't check for limits
> of /sys/block/<dev>/queue/* no? I guess then it could be checked
> by block layer below b4 submitting the bio?
> 113 static inline bool bio_full(struct bio *bio, unsigned len)
> 114 {
> 115         if (bio->bi_vcnt >= bio->bi_max_vecs)
> 116                 return true;
> 117
> 118         if (bio->bi_iter.bi_size > UINT_MAX - len)
> 119                 return true;
> 120
> 121         return false;
> 122 }

but iomap only allows BIO_MAX_PAGES when creating the bio. And:

#define BIO_MAX_PAGES 256

So even on a 64k page machine, we should not be building a bio with
more than 16MB of data in it. So how are we getting 4GB of data into

Further, the writeback code is designed around the bios having a
bound size that is relatively small to keep IO submission occurring
as we pack pages into bios. This keeps IO latency down and minimises
the individual IO completion overhead of each IO. This is especially
important as the writeback path is critical for memory relcaim to
make progress because we do not want to trap gigabytes of dirty
memory in the writeback IO path.

IOWs, seeing huge bios being built by writeback is indicative of
design assumptions and contraints being violated - huge bios on the
buffered writeback path like this are not a good thing to see.

FWIW, We've also recently got reports of hard lockups in IO
completion of overwrites because our ioend bio chains have grown to
almost 3 million pages and all the writeback pages get processed as
a single completion. This is a similar situation to this bug report
in that the bio chains are unbound in length, and I'm betting the
cause is the same: overwrite a 10GB file in memory (with dirty
limits turned up), then run fsync so we do a single writepages call
that tries to write 10GB of dirty pages....

The only reason we don't normally see this is that background
writeback caps the number of pages written per writepages call to
1024. i.e.  it caps writeback IO sizes to a small amount so that IO
latency, writeback fairness across dirty inodes, etc can be
maintained for background writeback - no one dirty file can
monopolise the available writeback bandwidth and starve writeback
to other dirty inodes.

So combine the two, and we've got a problem that the writeback IO
sizes are not being bound to sane IO sizes. I have no problems with
building individual bios that are 4MB or even 16MB in size - that
allows the block layer to work efficiently. Problems at a system
start to occur, however, when individual bios or bio chains built
by writeback end up being orders of magnitude larger than this....

i.e. I'm not looking at this as a "bio overflow bug" - I'm
commenting on what this overflow implies from an architectural point
of view. i.e. that uncapped bio sizes and bio chain lengths in
writeback are actually a bad thing and something we've always
tried to avoid doing....


> /sys/block/<dev>/queue/*
> ========================
> setup:/run/perf$ cat /sys/block/loop1/queue/max_segments
> 128
> setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size
> 65536

A maximumally size bio (16MB) will get split into two bios for this
hardware based on this (8MB max size).

> setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb
> 1280

Except this says 1280kB is the max size, so it will actually get
split into 14 bios.

So a stream of 16MB bios from writeback will be more than large
enough to keep this hardware's pipeline full....


Dave Chinner

  parent reply	other threads:[~2020-08-21 21:54 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-19 10:28 Anju T Sudhakar
2020-08-20 23:11 ` Dave Chinner
2020-08-21  4:45   ` Ritesh Harjani
2020-08-21  6:00     ` Christoph Hellwig
2020-08-21  9:09       ` Ritesh Harjani
2020-08-21 21:53     ` Dave Chinner [this message]
2020-08-22 13:13       ` Christoph Hellwig
2020-08-24 14:28         ` Brian Foster
2020-08-24 15:04           ` Christoph Hellwig
2020-08-24 15:48             ` Brian Foster
2020-08-25  0:42               ` Dave Chinner
2020-08-25 14:49                 ` Brian Foster
2020-08-31  4:01                   ` Ming Lei
2020-08-31 14:35                     ` Brian Foster
2020-09-16  0:12                   ` Darrick J. Wong
2020-09-16  8:45                     ` Christoph Hellwig
2020-09-16 13:07                       ` Brian Foster
2020-09-17  8:04                         ` Christoph Hellwig
2020-09-17 10:42                           ` Brian Foster
2020-09-17 14:48                             ` Christoph Hellwig
2020-09-17 21:33                               ` Darrick J. Wong
2020-09-17 23:13                           ` Ming Lei
2020-08-21  6:01   ` Christoph Hellwig
2020-08-21  6:07 ` Christoph Hellwig
2020-08-21  8:53   ` Ritesh Harjani
2020-08-21 14:49   ` Jens Axboe
2020-08-21 13:31 ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200821215358.GG7941@dread.disaster.area \ \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).