Linux-Fsdevel Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). @ 2020-08-19 10:28 Anju T Sudhakar 2020-08-20 23:11 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Anju T Sudhakar @ 2020-08-19 10:28 UTC (permalink / raw) To: hch, darrick.wong Cc: linux-xfs, linux-fsdevel, linux-kernel, willy, riteshh, anju From: Ritesh Harjani <riteshh@linux.ibm.com> __bio_try_merge_page() may return same_page = 1 and merged = 0. This could happen when bio->bi_iter.bi_size + len > UINT_MAX. Handle this case in iomap_add_to_ioend() by incrementing write_count. This scenario mostly happens where we have too much dirty data accumulated. w/o the patch we hit below kernel warning, WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6 Call Trace: __remove_mapping+0x154/0x320 (unreliable) iomap_releasepage+0x80/0x180 try_to_release_page+0x94/0xe0 invalidate_inode_page+0xc8/0x110 invalidate_mapping_pages+0x1dc/0x540 generic_fadvise+0x3c8/0x450 xfs_file_fadvise+0x2c/0xe0 [xfs] vfs_fadvise+0x3c/0x60 ksys_fadvise64_64+0x68/0xe0 sys_fadvise64+0x28/0x40 system_call_exception+0xf8/0x1c0 system_call_common+0xf0/0x278 Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> --- fs/iomap/buffered-io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index bcfc288dba3f..4e8062279e66 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1332,10 +1332,12 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, &same_page); - if (iop && !same_page) + if (iop && merged && !same_page) atomic_inc(&iop->write_count); if (!merged) { + if (iop) + atomic_inc(&iop->write_count); if (bio_full(wpc->ioend->io_bio, len)) { wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio); -- 2.25.4 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-19 10:28 [PATCH] iomap: Fix the write_count in iomap_add_to_ioend() Anju T Sudhakar @ 2020-08-20 23:11 ` Dave Chinner 2020-08-21 4:45 ` Ritesh Harjani 2020-08-21 6:01 ` Christoph Hellwig 2020-08-21 6:07 ` Christoph Hellwig 2020-08-21 13:31 ` Matthew Wilcox 2 siblings, 2 replies; 27+ messages in thread From: Dave Chinner @ 2020-08-20 23:11 UTC (permalink / raw) To: Anju T Sudhakar Cc: hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, riteshh On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > From: Ritesh Harjani <riteshh@linux.ibm.com> > > __bio_try_merge_page() may return same_page = 1 and merged = 0. > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. Ummm, silly question, but exactly how are we getting a bio that large in ->writepages getting built? Even with 64kB pages, that's a bio with 2^16 pages attached to it. We shouldn't be building single bios in writeback that large - what storage hardware is allowing such huge bios to be built? (i.e. can you dump all the values in /sys/block/<dev>/queue/* for that device for us?) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-20 23:11 ` Dave Chinner @ 2020-08-21 4:45 ` Ritesh Harjani 2020-08-21 6:00 ` Christoph Hellwig 2020-08-21 21:53 ` Dave Chinner 2020-08-21 6:01 ` Christoph Hellwig 1 sibling, 2 replies; 27+ messages in thread From: Ritesh Harjani @ 2020-08-21 4:45 UTC (permalink / raw) To: Dave Chinner, Anju T Sudhakar Cc: hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy Hello Dave, Thanks for reviewing this. On 8/21/20 4:41 AM, Dave Chinner wrote: > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: >> From: Ritesh Harjani <riteshh@linux.ibm.com> >> >> __bio_try_merge_page() may return same_page = 1 and merged = 0. >> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > > Ummm, silly question, but exactly how are we getting a bio that > large in ->writepages getting built? Even with 64kB pages, that's a > bio with 2^16 pages attached to it. We shouldn't be building single > bios in writeback that large - what storage hardware is allowing > such huge bios to be built? (i.e. can you dump all the values in > /sys/block/<dev>/queue/* for that device for us?) Please correct me here, but as I see, bio has only these two limits which it checks for adding page to bio. It doesn't check for limits of /sys/block/<dev>/queue/* no? I guess then it could be checked by block layer below b4 submitting the bio? 113 static inline bool bio_full(struct bio *bio, unsigned len) 114 { 115 if (bio->bi_vcnt >= bio->bi_max_vecs) 116 return true; 117 118 if (bio->bi_iter.bi_size > UINT_MAX - len) 119 return true; 120 121 return false; 122 } This issue was first observed while running a fio run on a system with huge memory. But then here is an easy way we figured out to trigger the issue almost everytime with loop device on my VM setup. I have provided all the details on this below. <cmds to trigger it fairly quickly> =================================== echo 99999999 > /proc/sys/vm/dirtytime_expire_seconds echo 99999999 > /proc/sys/vm/dirty_expire_centisecs echo 90 > /proc/sys/vm/dirty_rati0 echo 90 > /proc/sys/vm/dirty_background_ratio echo 0 > /proc/sys/vm/dirty_writeback_centisecs sudo perf probe -s ~/host_shared/src/linux/ -a '__bio_try_merge_page:10 bio page page->index bio->bi_iter.bi_size len same_page[0]' sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size > 0xff000000' sudo fio --rw=write --bs=1M --numjobs=1 --name=/mnt/testfile --size=24G --ioengine=libaio # on running this 2nd time it gets hit everytime on my setup sudo perf record -e probe:__bio_try_merge_page_L10 -a --filter 'bi_size > 0xff000000' sudo fio --rw=write --bs=1M --numjobs=1 --name=/mnt/testfile --size=24G --ioengine=libaio Perf o/p from above filter causing overflow =========================================== <...> fio 25194 [029] 70471.559084: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffff8000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559087: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffff9000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559090: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffffa000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559093: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffffb000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559095: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffffc000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559098: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffffd000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559101: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xffffe000 len=0x1000 same_page=0x1 fio 25194 [029] 70471.559104: probe:__bio_try_merge_page_L10: (c000000000aa054c) bio=0xc0000013d49a4b80 page=0xc00c000004029d80 index=0x10a9d bi_size=0xfffff000 len=0x1000 same_page=0x1 ^^^^^^ (this could cause an overflow) loop dev ========= NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC /dev/loop1 0 0 0 0 /mnt1/filefs 0 512 mount o/p ========= /dev/loop1 on /mnt type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) /sys/block/<dev>/queue/* ======================== setup:/run/perf$ cat /sys/block/loop1/queue/max_segments 128 setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size 65536 setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb 1280 setup:/run/perf$ cat /sys/block/loop1/queue/logical_block_size 512 setup:/run/perf$ cat /sys/block/loop1/queue/max_sectors_kb 1280 setup:/run/perf$ cat /sys/block/loop1/queue/hw_sector_size 512 setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_bytes 4294966784 setup:/run/perf$ cat /sys/block/loop1/queue/discard_max_hw_bytes 4294966784 setup:/run/perf$ cat /sys/block/loop1/queue/discard_zeroes_data 0 setup:/run/perf$ cat /sys/block/loop1/queue/discard_granularity 4096 setup:/run/perf$ cat /sys/block/loop1/queue/chunk_sectors 0 setup:/run/perf$ cat /sys/block/loop1/queue/max_discard_segments 1 setup:/run/perf$ cat /sys/block/loop1/queue/read_ahead_kb 128 setup:/run/perf$ cat /sys/block/loop1/queue/rotational 1 setup:/run/perf$ cat /sys/block/loop1/queue/physical_block_size 512 setup:/run/perf$ cat /sys/block/loop1/queue/write_same_max_bytes 0 setup:/run/perf$ cat /sys/block/loop1/queue/write_zeroes_max_bytes 4294966784 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 4:45 ` Ritesh Harjani @ 2020-08-21 6:00 ` Christoph Hellwig 2020-08-21 9:09 ` Ritesh Harjani 2020-08-21 21:53 ` Dave Chinner 1 sibling, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2020-08-21 6:00 UTC (permalink / raw) To: Ritesh Harjani Cc: Dave Chinner, Anju T Sudhakar, hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote: > Please correct me here, but as I see, bio has only these two limits > which it checks for adding page to bio. It doesn't check for limits > of /sys/block/<dev>/queue/* no? I guess then it could be checked > by block layer below b4 submitting the bio? The bio does not, but the blk-mq code will split the bios when mapping it to requests, take a look at blk_mq_submit_bio and __blk_queue_split. But while the default limits are quite low, they can be increased siginificantly, which tends to help with performance and is often also done by scripts shipped by the distributions. > This issue was first observed while running a fio run on a system with > huge memory. But then here is an easy way we figured out to trigger the > issue almost everytime with loop device on my VM setup. I have provided > all the details on this below. Can you wire this up for xfstests? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 6:00 ` Christoph Hellwig @ 2020-08-21 9:09 ` Ritesh Harjani 0 siblings, 0 replies; 27+ messages in thread From: Ritesh Harjani @ 2020-08-21 9:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On 8/21/20 11:30 AM, Christoph Hellwig wrote: > On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote: >> Please correct me here, but as I see, bio has only these two limits >> which it checks for adding page to bio. It doesn't check for limits >> of /sys/block/<dev>/queue/* no? I guess then it could be checked >> by block layer below b4 submitting the bio? > > The bio does not, but the blk-mq code will split the bios when mapping > it to requests, take a look at blk_mq_submit_bio and __blk_queue_split. Thanks :) > > But while the default limits are quite low, they can be increased > siginificantly, which tends to help with performance and is often > also done by scripts shipped by the distributions. > >> This issue was first observed while running a fio run on a system with >> huge memory. But then here is an easy way we figured out to trigger the >> issue almost everytime with loop device on my VM setup. I have provided >> all the details on this below. > > Can you wire this up for xfstests? > Sure, will do that. I do see generic/460 does play with such vm dirty_ratio params which this test would also require to manipulate to trigger this issue. Thanks for the suggestion! -ritesh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 4:45 ` Ritesh Harjani 2020-08-21 6:00 ` Christoph Hellwig @ 2020-08-21 21:53 ` Dave Chinner 2020-08-22 13:13 ` Christoph Hellwig 1 sibling, 1 reply; 27+ messages in thread From: Dave Chinner @ 2020-08-21 21:53 UTC (permalink / raw) To: Ritesh Harjani Cc: Anju T Sudhakar, hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote: > Hello Dave, > > Thanks for reviewing this. > > On 8/21/20 4:41 AM, Dave Chinner wrote: > > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > > > From: Ritesh Harjani <riteshh@linux.ibm.com> > > > > > > __bio_try_merge_page() may return same_page = 1 and merged = 0. > > > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > > > > Ummm, silly question, but exactly how are we getting a bio that > > large in ->writepages getting built? Even with 64kB pages, that's a > > bio with 2^16 pages attached to it. We shouldn't be building single > > bios in writeback that large - what storage hardware is allowing > > such huge bios to be built? (i.e. can you dump all the values in > > /sys/block/<dev>/queue/* for that device for us?) > > Please correct me here, but as I see, bio has only these two limits > which it checks for adding page to bio. It doesn't check for limits > of /sys/block/<dev>/queue/* no? I guess then it could be checked > by block layer below b4 submitting the bio? > > 113 static inline bool bio_full(struct bio *bio, unsigned len) > 114 { > 115 if (bio->bi_vcnt >= bio->bi_max_vecs) > 116 return true; > 117 > 118 if (bio->bi_iter.bi_size > UINT_MAX - len) > 119 return true; > 120 > 121 return false; > 122 } but iomap only allows BIO_MAX_PAGES when creating the bio. And: #define BIO_MAX_PAGES 256 So even on a 64k page machine, we should not be building a bio with more than 16MB of data in it. So how are we getting 4GB of data into it? Further, the writeback code is designed around the bios having a bound size that is relatively small to keep IO submission occurring as we pack pages into bios. This keeps IO latency down and minimises the individual IO completion overhead of each IO. This is especially important as the writeback path is critical for memory relcaim to make progress because we do not want to trap gigabytes of dirty memory in the writeback IO path. IOWs, seeing huge bios being built by writeback is indicative of design assumptions and contraints being violated - huge bios on the buffered writeback path like this are not a good thing to see. FWIW, We've also recently got reports of hard lockups in IO completion of overwrites because our ioend bio chains have grown to almost 3 million pages and all the writeback pages get processed as a single completion. This is a similar situation to this bug report in that the bio chains are unbound in length, and I'm betting the cause is the same: overwrite a 10GB file in memory (with dirty limits turned up), then run fsync so we do a single writepages call that tries to write 10GB of dirty pages.... The only reason we don't normally see this is that background writeback caps the number of pages written per writepages call to 1024. i.e. it caps writeback IO sizes to a small amount so that IO latency, writeback fairness across dirty inodes, etc can be maintained for background writeback - no one dirty file can monopolise the available writeback bandwidth and starve writeback to other dirty inodes. So combine the two, and we've got a problem that the writeback IO sizes are not being bound to sane IO sizes. I have no problems with building individual bios that are 4MB or even 16MB in size - that allows the block layer to work efficiently. Problems at a system start to occur, however, when individual bios or bio chains built by writeback end up being orders of magnitude larger than this.... i.e. I'm not looking at this as a "bio overflow bug" - I'm commenting on what this overflow implies from an architectural point of view. i.e. that uncapped bio sizes and bio chain lengths in writeback are actually a bad thing and something we've always tried to avoid doing.... ..... > /sys/block/<dev>/queue/* > ======================== > > setup:/run/perf$ cat /sys/block/loop1/queue/max_segments > 128 > setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size > 65536 A maximumally size bio (16MB) will get split into two bios for this hardware based on this (8MB max size). > setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb > 1280 Except this says 1280kB is the max size, so it will actually get split into 14 bios. So a stream of 16MB bios from writeback will be more than large enough to keep this hardware's pipeline full.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 21:53 ` Dave Chinner @ 2020-08-22 13:13 ` Christoph Hellwig 2020-08-24 14:28 ` Brian Foster 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2020-08-22 13:13 UTC (permalink / raw) To: Dave Chinner Cc: Ritesh Harjani, Anju T Sudhakar, hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Sat, Aug 22, 2020 at 07:53:58AM +1000, Dave Chinner wrote: > but iomap only allows BIO_MAX_PAGES when creating the bio. And: > > #define BIO_MAX_PAGES 256 > > So even on a 64k page machine, we should not be building a bio with > more than 16MB of data in it. So how are we getting 4GB of data into > it? BIO_MAX_PAGES is the number of bio_vecs in the bio, it has no direct implication on the size, as each entry can fit up to UINT_MAX bytes. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-22 13:13 ` Christoph Hellwig @ 2020-08-24 14:28 ` Brian Foster 2020-08-24 15:04 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Brian Foster @ 2020-08-24 14:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Sat, Aug 22, 2020 at 02:13:12PM +0100, Christoph Hellwig wrote: > On Sat, Aug 22, 2020 at 07:53:58AM +1000, Dave Chinner wrote: > > but iomap only allows BIO_MAX_PAGES when creating the bio. And: > > > > #define BIO_MAX_PAGES 256 > > > > So even on a 64k page machine, we should not be building a bio with > > more than 16MB of data in it. So how are we getting 4GB of data into > > it? > > BIO_MAX_PAGES is the number of bio_vecs in the bio, it has no > direct implication on the size, as each entry can fit up to UINT_MAX > bytes. > Do I understand the current code (__bio_try_merge_page() -> page_is_mergeable()) correctly in that we're checking for physical page contiguity and not necessarily requiring a new bio_vec per physical page? With regard to Dave's earlier point around seeing excessively sized bio chains.. If I set up a large memory box with high dirty mem ratios and do contiguous buffered overwrites over a 32GB range followed by fsync, I can see upwards of 1GB per bio and thus chains on the order of 32+ bios for the entire write. If I play games with how the buffered overwrite is submitted (i.e., in reverse) however, then I can occasionally reproduce a ~32GB chain of ~32k bios, which I think is what leads to problems in I/O completion on some systems. Granted, I don't reproduce soft lockup issues on my system with that behavior, so perhaps there's more to that particular issue. Regardless, it seems reasonable to me to at least have a conservative limit on the length of an ioend bio chain. Would anybody object to iomap_ioend growing a chain counter and perhaps forcing into a new ioend if we chain something like more than 1k bios at once? Brian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-24 14:28 ` Brian Foster @ 2020-08-24 15:04 ` Christoph Hellwig 2020-08-24 15:48 ` Brian Foster 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2020-08-24 15:04 UTC (permalink / raw) To: Brian Foster Cc: Christoph Hellwig, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > Do I understand the current code (__bio_try_merge_page() -> > page_is_mergeable()) correctly in that we're checking for physical page > contiguity and not necessarily requiring a new bio_vec per physical > page? Yes. > With regard to Dave's earlier point around seeing excessively sized bio > chains.. If I set up a large memory box with high dirty mem ratios and > do contiguous buffered overwrites over a 32GB range followed by fsync, I > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > for the entire write. If I play games with how the buffered overwrite is > submitted (i.e., in reverse) however, then I can occasionally reproduce > a ~32GB chain of ~32k bios, which I think is what leads to problems in > I/O completion on some systems. Granted, I don't reproduce soft lockup > issues on my system with that behavior, so perhaps there's more to that > particular issue. > > Regardless, it seems reasonable to me to at least have a conservative > limit on the length of an ioend bio chain. Would anybody object to > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > if we chain something like more than 1k bios at once? So what exactly is the problem of processing a long chain in the workqueue vs multiple small chains? Maybe we need a cond_resched() here and there, but I don't see how we'd substantially change behavior. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-24 15:04 ` Christoph Hellwig @ 2020-08-24 15:48 ` Brian Foster 2020-08-25 0:42 ` Dave Chinner 0 siblings, 1 reply; 27+ messages in thread From: Brian Foster @ 2020-08-24 15:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > Do I understand the current code (__bio_try_merge_page() -> > > page_is_mergeable()) correctly in that we're checking for physical page > > contiguity and not necessarily requiring a new bio_vec per physical > > page? > > > Yes. > Ok. I also realize now that this occurs on a kernel without commit 07173c3ec276 ("block: enable multipage bvecs"). That is probably a contributing factor, but it's not clear to me whether it's feasible to backport whatever supporting infrastructure is required for that mechanism to work (I suspect not). > > With regard to Dave's earlier point around seeing excessively sized bio > > chains.. If I set up a large memory box with high dirty mem ratios and > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > for the entire write. If I play games with how the buffered overwrite is > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > issues on my system with that behavior, so perhaps there's more to that > > particular issue. > > > > Regardless, it seems reasonable to me to at least have a conservative > > limit on the length of an ioend bio chain. Would anybody object to > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > if we chain something like more than 1k bios at once? > > So what exactly is the problem of processing a long chain in the > workqueue vs multiple small chains? Maybe we need a cond_resched() > here and there, but I don't see how we'd substantially change behavior. > The immediate problem is a watchdog lockup detection in bio completion: NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 This effectively lands at the following segment of iomap_finish_ioend(): ... /* walk each page on bio, ending page IO on them */ bio_for_each_segment_all(bv, bio, iter_all) iomap_finish_page_writeback(inode, bv->bv_page, error); I suppose we could add a cond_resched(), but is that safe directly inside of a ->bi_end_io() handler? Another option could be to dump large chains into the completion workqueue, but we may still need to track the length to do that. Thoughts? Brian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-24 15:48 ` Brian Foster @ 2020-08-25 0:42 ` Dave Chinner 2020-08-25 14:49 ` Brian Foster 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2020-08-25 0:42 UTC (permalink / raw) To: Brian Foster Cc: Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > Do I understand the current code (__bio_try_merge_page() -> > > > page_is_mergeable()) correctly in that we're checking for physical page > > > contiguity and not necessarily requiring a new bio_vec per physical > > > page? > > > > > > Yes. > > > > Ok. I also realize now that this occurs on a kernel without commit > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > contributing factor, but it's not clear to me whether it's feasible to > backport whatever supporting infrastructure is required for that > mechanism to work (I suspect not). > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > for the entire write. If I play games with how the buffered overwrite is > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > issues on my system with that behavior, so perhaps there's more to that > > > particular issue. > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > limit on the length of an ioend bio chain. Would anybody object to > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > if we chain something like more than 1k bios at once? > > > > So what exactly is the problem of processing a long chain in the > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > here and there, but I don't see how we'd substantially change behavior. > > > > The immediate problem is a watchdog lockup detection in bio completion: > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > This effectively lands at the following segment of iomap_finish_ioend(): > > ... > /* walk each page on bio, ending page IO on them */ > bio_for_each_segment_all(bv, bio, iter_all) > iomap_finish_page_writeback(inode, bv->bv_page, error); > > I suppose we could add a cond_resched(), but is that safe directly > inside of a ->bi_end_io() handler? Another option could be to dump large > chains into the completion workqueue, but we may still need to track the > length to do that. Thoughts? We have ioend completion merging that will run the compeltion once for all the pending ioend completions on that inode. IOWs, we do not need to build huge chains at submission time to batch up completions efficiently. However, huge bio chains at submission time do cause issues with writeback fairness, pinning GBs of ram as unreclaimable for seconds because they are queued for completion while we are still submitting the bio chain and submission is being throttled by the block layer writeback throttle, etc. Not to mention the latency of stable pages in a situation like this - a mmap() write fault could stall for many seconds waiting for a huge bio chain to finish submission and run completion processing even when the IO for the given page we faulted on was completed before the page fault occurred... Hence I think we really do need to cap the length of the bio chains here so that we start completing and ending page writeback on large writeback ranges long before the writeback code finishes submitting the range it was asked to write back. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-25 0:42 ` Dave Chinner @ 2020-08-25 14:49 ` Brian Foster 2020-08-31 4:01 ` Ming Lei 2020-09-16 0:12 ` Darrick J. Wong 0 siblings, 2 replies; 27+ messages in thread From: Brian Foster @ 2020-08-25 14:49 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei cc Ming On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > page? > > > > > > > > > Yes. > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > contributing factor, but it's not clear to me whether it's feasible to > > backport whatever supporting infrastructure is required for that > > mechanism to work (I suspect not). > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > for the entire write. If I play games with how the buffered overwrite is > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > issues on my system with that behavior, so perhaps there's more to that > > > > particular issue. > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > if we chain something like more than 1k bios at once? > > > > > > So what exactly is the problem of processing a long chain in the > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > ... > > /* walk each page on bio, ending page IO on them */ > > bio_for_each_segment_all(bv, bio, iter_all) > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > I suppose we could add a cond_resched(), but is that safe directly > > inside of a ->bi_end_io() handler? Another option could be to dump large > > chains into the completion workqueue, but we may still need to track the > > length to do that. Thoughts? > > We have ioend completion merging that will run the compeltion once > for all the pending ioend completions on that inode. IOWs, we do not > need to build huge chains at submission time to batch up completions > efficiently. However, huge bio chains at submission time do cause > issues with writeback fairness, pinning GBs of ram as unreclaimable > for seconds because they are queued for completion while we are > still submitting the bio chain and submission is being throttled by > the block layer writeback throttle, etc. Not to mention the latency > of stable pages in a situation like this - a mmap() write fault > could stall for many seconds waiting for a huge bio chain to finish > submission and run completion processing even when the IO for the > given page we faulted on was completed before the page fault > occurred... > > Hence I think we really do need to cap the length of the bio > chains here so that we start completing and ending page writeback on > large writeback ranges long before the writeback code finishes > submitting the range it was asked to write back. > Ming pointed out separately that limiting the bio chain itself might not be enough because with multipage bvecs, we can effectively capture the same number of pages in much fewer bios. Given that, what do you think about something like the patch below to limit ioend size? This effectively limits the number of pages per ioend regardless of whether in-core state results in a small chain of dense bios or a large chain of smaller bios, without requiring any new explicit page count tracking. Brian --- 8< --- diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 6ae98d3cb157..4aa96705ffd7 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1301,7 +1301,7 @@ iomap_chain_bio(struct bio *prev) static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, - sector_t sector) + unsigned len, sector_t sector) { if ((wpc->iomap.flags & IOMAP_F_SHARED) != (wpc->ioend->io_flags & IOMAP_F_SHARED)) @@ -1312,6 +1312,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, return false; if (sector != bio_end_sector(wpc->ioend->io_bio)) return false; + if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE) + return false; return true; } @@ -1329,7 +1331,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, unsigned poff = offset & (PAGE_SIZE - 1); bool merged, same_page = false; - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) { + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) { if (wpc->ioend) list_add(&wpc->ioend->io_list, iolist); wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4d1d3c3469e9..5d1b1a08ec96 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -200,6 +200,8 @@ struct iomap_ioend { struct bio io_inline_bio; /* MUST BE LAST! */ }; +#define IOEND_MAX_IOSIZE (262144 << PAGE_SHIFT) + struct iomap_writeback_ops { /* * Required, maps the blocks so that writeback can be performed on ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-25 14:49 ` Brian Foster @ 2020-08-31 4:01 ` Ming Lei 2020-08-31 14:35 ` Brian Foster 2020-09-16 0:12 ` Darrick J. Wong 1 sibling, 1 reply; 27+ messages in thread From: Ming Lei @ 2020-08-31 4:01 UTC (permalink / raw) To: Brian Foster Cc: Dave Chinner, Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote: > cc Ming > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > > page? > > > > > > > > > > > > Yes. > > > > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > > contributing factor, but it's not clear to me whether it's feasible to > > > backport whatever supporting infrastructure is required for that > > > mechanism to work (I suspect not). > > > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > > for the entire write. If I play games with how the buffered overwrite is > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > > issues on my system with that behavior, so perhaps there's more to that > > > > > particular issue. > > > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > > if we chain something like more than 1k bios at once? > > > > > > > > So what exactly is the problem of processing a long chain in the > > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > > > ... > > > /* walk each page on bio, ending page IO on them */ > > > bio_for_each_segment_all(bv, bio, iter_all) > > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > > > I suppose we could add a cond_resched(), but is that safe directly > > > inside of a ->bi_end_io() handler? Another option could be to dump large > > > chains into the completion workqueue, but we may still need to track the > > > length to do that. Thoughts? > > > > We have ioend completion merging that will run the compeltion once > > for all the pending ioend completions on that inode. IOWs, we do not > > need to build huge chains at submission time to batch up completions > > efficiently. However, huge bio chains at submission time do cause > > issues with writeback fairness, pinning GBs of ram as unreclaimable > > for seconds because they are queued for completion while we are > > still submitting the bio chain and submission is being throttled by > > the block layer writeback throttle, etc. Not to mention the latency > > of stable pages in a situation like this - a mmap() write fault > > could stall for many seconds waiting for a huge bio chain to finish > > submission and run completion processing even when the IO for the > > given page we faulted on was completed before the page fault > > occurred... > > > > Hence I think we really do need to cap the length of the bio > > chains here so that we start completing and ending page writeback on > > large writeback ranges long before the writeback code finishes > > submitting the range it was asked to write back. > > > > Ming pointed out separately that limiting the bio chain itself might not > be enough because with multipage bvecs, we can effectively capture the > same number of pages in much fewer bios. Given that, what do you think > about something like the patch below to limit ioend size? This > effectively limits the number of pages per ioend regardless of whether > in-core state results in a small chain of dense bios or a large chain of > smaller bios, without requiring any new explicit page count tracking. Hello Brian, This patch looks fine. However, I am wondering why iomap has to chain bios in one ioend, and why not submit each bio in usual way just like what fs/direct-io.c does? Then each bio can complete the pages in its own .bi_end_io(). thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-31 4:01 ` Ming Lei @ 2020-08-31 14:35 ` Brian Foster 0 siblings, 0 replies; 27+ messages in thread From: Brian Foster @ 2020-08-31 14:35 UTC (permalink / raw) To: Ming Lei Cc: Dave Chinner, Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Mon, Aug 31, 2020 at 12:01:07PM +0800, Ming Lei wrote: > On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote: > > cc Ming > > > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > > > page? > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > > > contributing factor, but it's not clear to me whether it's feasible to > > > > backport whatever supporting infrastructure is required for that > > > > mechanism to work (I suspect not). > > > > > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > > > for the entire write. If I play games with how the buffered overwrite is > > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > > > issues on my system with that behavior, so perhaps there's more to that > > > > > > particular issue. > > > > > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > > > if we chain something like more than 1k bios at once? > > > > > > > > > > So what exactly is the problem of processing a long chain in the > > > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > > > > > ... > > > > /* walk each page on bio, ending page IO on them */ > > > > bio_for_each_segment_all(bv, bio, iter_all) > > > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > > > > > I suppose we could add a cond_resched(), but is that safe directly > > > > inside of a ->bi_end_io() handler? Another option could be to dump large > > > > chains into the completion workqueue, but we may still need to track the > > > > length to do that. Thoughts? > > > > > > We have ioend completion merging that will run the compeltion once > > > for all the pending ioend completions on that inode. IOWs, we do not > > > need to build huge chains at submission time to batch up completions > > > efficiently. However, huge bio chains at submission time do cause > > > issues with writeback fairness, pinning GBs of ram as unreclaimable > > > for seconds because they are queued for completion while we are > > > still submitting the bio chain and submission is being throttled by > > > the block layer writeback throttle, etc. Not to mention the latency > > > of stable pages in a situation like this - a mmap() write fault > > > could stall for many seconds waiting for a huge bio chain to finish > > > submission and run completion processing even when the IO for the > > > given page we faulted on was completed before the page fault > > > occurred... > > > > > > Hence I think we really do need to cap the length of the bio > > > chains here so that we start completing and ending page writeback on > > > large writeback ranges long before the writeback code finishes > > > submitting the range it was asked to write back. > > > > > > > Ming pointed out separately that limiting the bio chain itself might not > > be enough because with multipage bvecs, we can effectively capture the > > same number of pages in much fewer bios. Given that, what do you think > > about something like the patch below to limit ioend size? This > > effectively limits the number of pages per ioend regardless of whether > > in-core state results in a small chain of dense bios or a large chain of > > smaller bios, without requiring any new explicit page count tracking. > > Hello Brian, > > This patch looks fine. > > However, I am wondering why iomap has to chain bios in one ioend, and why not > submit each bio in usual way just like what fs/direct-io.c does? Then each bio > can complete the pages in its own .bi_end_io(). > I think it's mainly for efficiency and code simplicity reasons. The ioend describes a contiguous range of blocks with the same io type (written, unwritten, append, etc.), so whatever post-completion action might be required for a particular ioend (i.e. unwritten conversion) shouldn't execute until I/O completes on the entire range. I believe this goes back to XFS commit 0e51a8e191db ("xfs: optimize bio handling in the buffer writeback path"), which basically reimplemented similar, custom ioend behavior to rely on bio chains and was eventually lifted from XFS into iomap. Brian > > thanks, > Ming > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-25 14:49 ` Brian Foster 2020-08-31 4:01 ` Ming Lei @ 2020-09-16 0:12 ` Darrick J. Wong 2020-09-16 8:45 ` Christoph Hellwig 1 sibling, 1 reply; 27+ messages in thread From: Darrick J. Wong @ 2020-09-16 0:12 UTC (permalink / raw) To: Brian Foster Cc: Dave Chinner, Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote: > cc Ming > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > > page? > > > > > > > > > > > > Yes. > > > > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > > contributing factor, but it's not clear to me whether it's feasible to > > > backport whatever supporting infrastructure is required for that > > > mechanism to work (I suspect not). > > > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > > for the entire write. If I play games with how the buffered overwrite is > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > > issues on my system with that behavior, so perhaps there's more to that > > > > > particular issue. > > > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > > if we chain something like more than 1k bios at once? > > > > > > > > So what exactly is the problem of processing a long chain in the > > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > > > ... > > > /* walk each page on bio, ending page IO on them */ > > > bio_for_each_segment_all(bv, bio, iter_all) > > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > > > I suppose we could add a cond_resched(), but is that safe directly > > > inside of a ->bi_end_io() handler? Another option could be to dump large > > > chains into the completion workqueue, but we may still need to track the > > > length to do that. Thoughts? > > > > We have ioend completion merging that will run the compeltion once > > for all the pending ioend completions on that inode. IOWs, we do not > > need to build huge chains at submission time to batch up completions > > efficiently. However, huge bio chains at submission time do cause > > issues with writeback fairness, pinning GBs of ram as unreclaimable > > for seconds because they are queued for completion while we are > > still submitting the bio chain and submission is being throttled by > > the block layer writeback throttle, etc. Not to mention the latency > > of stable pages in a situation like this - a mmap() write fault > > could stall for many seconds waiting for a huge bio chain to finish > > submission and run completion processing even when the IO for the > > given page we faulted on was completed before the page fault > > occurred... > > > > Hence I think we really do need to cap the length of the bio > > chains here so that we start completing and ending page writeback on > > large writeback ranges long before the writeback code finishes > > submitting the range it was asked to write back. > > > > Ming pointed out separately that limiting the bio chain itself might not > be enough because with multipage bvecs, we can effectively capture the > same number of pages in much fewer bios. Given that, what do you think > about something like the patch below to limit ioend size? This > effectively limits the number of pages per ioend regardless of whether > in-core state results in a small chain of dense bios or a large chain of > smaller bios, without requiring any new explicit page count tracking. > > Brian Dave was asking on IRC if I was going to pull this patch in. I'm unsure of its status (other than it hasn't been sent as a proper [PATCH]) so I wonder, is this necessary, and if so, can it be cleaned up and submitted? --D > --- 8< --- > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 6ae98d3cb157..4aa96705ffd7 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1301,7 +1301,7 @@ iomap_chain_bio(struct bio *prev) > > static bool > iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, > - sector_t sector) > + unsigned len, sector_t sector) > { > if ((wpc->iomap.flags & IOMAP_F_SHARED) != > (wpc->ioend->io_flags & IOMAP_F_SHARED)) > @@ -1312,6 +1312,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset, > return false; > if (sector != bio_end_sector(wpc->ioend->io_bio)) > return false; > + if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE) > + return false; > return true; > } > > @@ -1329,7 +1331,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, > unsigned poff = offset & (PAGE_SIZE - 1); > bool merged, same_page = false; > > - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) { > + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) { > if (wpc->ioend) > list_add(&wpc->ioend->io_list, iolist); > wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc); > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 4d1d3c3469e9..5d1b1a08ec96 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -200,6 +200,8 @@ struct iomap_ioend { > struct bio io_inline_bio; /* MUST BE LAST! */ > }; > > +#define IOEND_MAX_IOSIZE (262144 << PAGE_SHIFT) > + > struct iomap_writeback_ops { > /* > * Required, maps the blocks so that writeback can be performed on > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-16 0:12 ` Darrick J. Wong @ 2020-09-16 8:45 ` Christoph Hellwig 2020-09-16 13:07 ` Brian Foster 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2020-09-16 8:45 UTC (permalink / raw) To: Darrick J. Wong Cc: Brian Foster, Dave Chinner, Christoph Hellwig, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Tue, Sep 15, 2020 at 05:12:42PM -0700, Darrick J. Wong wrote: > On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote: > > cc Ming > > > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > > > page? > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > > > contributing factor, but it's not clear to me whether it's feasible to > > > > backport whatever supporting infrastructure is required for that > > > > mechanism to work (I suspect not). > > > > > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > > > for the entire write. If I play games with how the buffered overwrite is > > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > > > issues on my system with that behavior, so perhaps there's more to that > > > > > > particular issue. > > > > > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > > > if we chain something like more than 1k bios at once? > > > > > > > > > > So what exactly is the problem of processing a long chain in the > > > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > > > > > ... > > > > /* walk each page on bio, ending page IO on them */ > > > > bio_for_each_segment_all(bv, bio, iter_all) > > > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > > > > > I suppose we could add a cond_resched(), but is that safe directly > > > > inside of a ->bi_end_io() handler? Another option could be to dump large > > > > chains into the completion workqueue, but we may still need to track the > > > > length to do that. Thoughts? > > > > > > We have ioend completion merging that will run the compeltion once > > > for all the pending ioend completions on that inode. IOWs, we do not > > > need to build huge chains at submission time to batch up completions > > > efficiently. However, huge bio chains at submission time do cause > > > issues with writeback fairness, pinning GBs of ram as unreclaimable > > > for seconds because they are queued for completion while we are > > > still submitting the bio chain and submission is being throttled by > > > the block layer writeback throttle, etc. Not to mention the latency > > > of stable pages in a situation like this - a mmap() write fault > > > could stall for many seconds waiting for a huge bio chain to finish > > > submission and run completion processing even when the IO for the > > > given page we faulted on was completed before the page fault > > > occurred... > > > > > > Hence I think we really do need to cap the length of the bio > > > chains here so that we start completing and ending page writeback on > > > large writeback ranges long before the writeback code finishes > > > submitting the range it was asked to write back. > > > > > > > Ming pointed out separately that limiting the bio chain itself might not > > be enough because with multipage bvecs, we can effectively capture the > > same number of pages in much fewer bios. Given that, what do you think > > about something like the patch below to limit ioend size? This > > effectively limits the number of pages per ioend regardless of whether > > in-core state results in a small chain of dense bios or a large chain of > > smaller bios, without requiring any new explicit page count tracking. > > > > Brian > > Dave was asking on IRC if I was going to pull this patch in. I'm unsure > of its status (other than it hasn't been sent as a proper [PATCH]) so I > wonder, is this necessary, and if so, can it be cleaned up and > submitted? Maybe it is lost somewhere, but what is the point of this patch? What does the magic number try to represent? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-16 8:45 ` Christoph Hellwig @ 2020-09-16 13:07 ` Brian Foster 2020-09-17 8:04 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Brian Foster @ 2020-09-16 13:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Darrick J. Wong, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Wed, Sep 16, 2020 at 09:45:10AM +0100, Christoph Hellwig wrote: > On Tue, Sep 15, 2020 at 05:12:42PM -0700, Darrick J. Wong wrote: > > On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote: > > > cc Ming > > > > > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote: > > > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote: > > > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote: > > > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote: > > > > > > > Do I understand the current code (__bio_try_merge_page() -> > > > > > > > page_is_mergeable()) correctly in that we're checking for physical page > > > > > > > contiguity and not necessarily requiring a new bio_vec per physical > > > > > > > page? > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > Ok. I also realize now that this occurs on a kernel without commit > > > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a > > > > > contributing factor, but it's not clear to me whether it's feasible to > > > > > backport whatever supporting infrastructure is required for that > > > > > mechanism to work (I suspect not). > > > > > > > > > > > > With regard to Dave's earlier point around seeing excessively sized bio > > > > > > > chains.. If I set up a large memory box with high dirty mem ratios and > > > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I > > > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios > > > > > > > for the entire write. If I play games with how the buffered overwrite is > > > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce > > > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in > > > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup > > > > > > > issues on my system with that behavior, so perhaps there's more to that > > > > > > > particular issue. > > > > > > > > > > > > > > Regardless, it seems reasonable to me to at least have a conservative > > > > > > > limit on the length of an ioend bio chain. Would anybody object to > > > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend > > > > > > > if we chain something like more than 1k bios at once? > > > > > > > > > > > > So what exactly is the problem of processing a long chain in the > > > > > > workqueue vs multiple small chains? Maybe we need a cond_resched() > > > > > > here and there, but I don't see how we'd substantially change behavior. > > > > > > > > > > > > > > > > The immediate problem is a watchdog lockup detection in bio completion: > > > > > > > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25 > > > > > > > > > > This effectively lands at the following segment of iomap_finish_ioend(): > > > > > > > > > > ... > > > > > /* walk each page on bio, ending page IO on them */ > > > > > bio_for_each_segment_all(bv, bio, iter_all) > > > > > iomap_finish_page_writeback(inode, bv->bv_page, error); > > > > > > > > > > I suppose we could add a cond_resched(), but is that safe directly > > > > > inside of a ->bi_end_io() handler? Another option could be to dump large > > > > > chains into the completion workqueue, but we may still need to track the > > > > > length to do that. Thoughts? > > > > > > > > We have ioend completion merging that will run the compeltion once > > > > for all the pending ioend completions on that inode. IOWs, we do not > > > > need to build huge chains at submission time to batch up completions > > > > efficiently. However, huge bio chains at submission time do cause > > > > issues with writeback fairness, pinning GBs of ram as unreclaimable > > > > for seconds because they are queued for completion while we are > > > > still submitting the bio chain and submission is being throttled by > > > > the block layer writeback throttle, etc. Not to mention the latency > > > > of stable pages in a situation like this - a mmap() write fault > > > > could stall for many seconds waiting for a huge bio chain to finish > > > > submission and run completion processing even when the IO for the > > > > given page we faulted on was completed before the page fault > > > > occurred... > > > > > > > > Hence I think we really do need to cap the length of the bio > > > > chains here so that we start completing and ending page writeback on > > > > large writeback ranges long before the writeback code finishes > > > > submitting the range it was asked to write back. > > > > > > > > > > Ming pointed out separately that limiting the bio chain itself might not > > > be enough because with multipage bvecs, we can effectively capture the > > > same number of pages in much fewer bios. Given that, what do you think > > > about something like the patch below to limit ioend size? This > > > effectively limits the number of pages per ioend regardless of whether > > > in-core state results in a small chain of dense bios or a large chain of > > > smaller bios, without requiring any new explicit page count tracking. > > > > > > Brian > > > > Dave was asking on IRC if I was going to pull this patch in. I'm unsure > > of its status (other than it hasn't been sent as a proper [PATCH]) so I > > wonder, is this necessary, and if so, can it be cleaned up and > > submitted? > I was waiting on some feedback from a few different angles before posting a proper patch.. > Maybe it is lost somewhere, but what is the point of this patch? > What does the magic number try to represent? > Dave described the main purpose earlier in this thread [1]. The initial motivation is that we've had downstream reports of soft lockup problems in writeback bio completion down in the bio -> bvec loop of iomap_finish_ioend() that has to finish writeback on each individual page of insanely large bios and/or chains. We've also had an upstream reports of a similar problem on linux-xfs [2]. The magic number itself was just pulled out of a hat. I picked it because it seemed conservative enough to still allow large contiguous bios (1GB w/ 4k pages) while hopefully preventing I/O completion problems, but was hoping for some feedback on that bit if the general approach was acceptable. I was also waiting for some feedback on either of the two users who reported the problem but I don't think I've heard back on that yet... Brian [1] https://lore.kernel.org/linux-fsdevel/20200821215358.GG7941@dread.disaster.area/ [2] https://lore.kernel.org/linux-xfs/alpine.LRH.2.02.2008311513150.7870@file01.intranet.prod.int.rdu2.redhat.com/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-16 13:07 ` Brian Foster @ 2020-09-17 8:04 ` Christoph Hellwig 2020-09-17 10:42 ` Brian Foster 2020-09-17 23:13 ` Ming Lei 0 siblings, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2020-09-17 8:04 UTC (permalink / raw) To: Brian Foster Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote: > Dave described the main purpose earlier in this thread [1]. The initial > motivation is that we've had downstream reports of soft lockup problems > in writeback bio completion down in the bio -> bvec loop of > iomap_finish_ioend() that has to finish writeback on each individual > page of insanely large bios and/or chains. We've also had an upstream > reports of a similar problem on linux-xfs [2]. > > The magic number itself was just pulled out of a hat. I picked it > because it seemed conservative enough to still allow large contiguous > bios (1GB w/ 4k pages) while hopefully preventing I/O completion > problems, but was hoping for some feedback on that bit if the general > approach was acceptable. I was also waiting for some feedback on either > of the two users who reported the problem but I don't think I've heard > back on that yet... I think the saner answer is to always run large completions in the workqueue, and add a bunch of cond_resched() calls, rather than arbitrarily breaking up the I/O size. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-17 8:04 ` Christoph Hellwig @ 2020-09-17 10:42 ` Brian Foster 2020-09-17 14:48 ` Christoph Hellwig 2020-09-17 23:13 ` Ming Lei 1 sibling, 1 reply; 27+ messages in thread From: Brian Foster @ 2020-09-17 10:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Darrick J. Wong, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Thu, Sep 17, 2020 at 09:04:55AM +0100, Christoph Hellwig wrote: > On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote: > > Dave described the main purpose earlier in this thread [1]. The initial > > motivation is that we've had downstream reports of soft lockup problems > > in writeback bio completion down in the bio -> bvec loop of > > iomap_finish_ioend() that has to finish writeback on each individual > > page of insanely large bios and/or chains. We've also had an upstream > > reports of a similar problem on linux-xfs [2]. > > > > The magic number itself was just pulled out of a hat. I picked it > > because it seemed conservative enough to still allow large contiguous > > bios (1GB w/ 4k pages) while hopefully preventing I/O completion > > problems, but was hoping for some feedback on that bit if the general > > approach was acceptable. I was also waiting for some feedback on either > > of the two users who reported the problem but I don't think I've heard > > back on that yet... > > I think the saner answer is to always run large completions in the > workqueue, and add a bunch of cond_resched() calls, rather than > arbitrarily breaking up the I/O size. > That wouldn't address the latency concern Dave brought up. That said, I have no issue with this as a targeted solution for the softlockup issue. iomap_finish_ioend[s]() is common code for both the workqueue and ->bi_end_io() contexts so that would require either some kind of context detection (and my understanding is in_atomic() is unreliable/frowned upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to indicate whether it's safe to reschedule. Preference? Brian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-17 10:42 ` Brian Foster @ 2020-09-17 14:48 ` Christoph Hellwig 2020-09-17 21:33 ` Darrick J. Wong 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2020-09-17 14:48 UTC (permalink / raw) To: Brian Foster Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Thu, Sep 17, 2020 at 06:42:19AM -0400, Brian Foster wrote: > That wouldn't address the latency concern Dave brought up. That said, I > have no issue with this as a targeted solution for the softlockup issue. > iomap_finish_ioend[s]() is common code for both the workqueue and > ->bi_end_io() contexts so that would require either some kind of context > detection (and my understanding is in_atomic() is unreliable/frowned > upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to > indicate whether it's safe to reschedule. Preference? True, it would not help with latency. But then again the latency should be controlled by the writeback code not doing giant writebacks to start with, shouldn't it? Any XFS/iomap specific limit also would not help with the block layer merging bios. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-17 14:48 ` Christoph Hellwig @ 2020-09-17 21:33 ` Darrick J. Wong 0 siblings, 0 replies; 27+ messages in thread From: Darrick J. Wong @ 2020-09-17 21:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Brian Foster, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Thu, Sep 17, 2020 at 03:48:04PM +0100, Christoph Hellwig wrote: > On Thu, Sep 17, 2020 at 06:42:19AM -0400, Brian Foster wrote: > > That wouldn't address the latency concern Dave brought up. That said, I > > have no issue with this as a targeted solution for the softlockup issue. > > iomap_finish_ioend[s]() is common code for both the workqueue and > > ->bi_end_io() contexts so that would require either some kind of context > > detection (and my understanding is in_atomic() is unreliable/frowned > > upon) or a new "atomic" parameter through iomap_finish_ioend[s]() to > > indicate whether it's safe to reschedule. Preference? > > True, it would not help with latency. But then again the latency > should be controlled by the writeback code not doing giant writebacks > to start with, shouldn't it? > > Any XFS/iomap specific limit also would not help with the block layer > merging bios. /me hasn't totally been following this thread, but iomap will also aggregate the ioend completions; do we need to cap that to keep latencies down? I was assuming that amortization was always favorable, but maybe not? --D ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-09-17 8:04 ` Christoph Hellwig 2020-09-17 10:42 ` Brian Foster @ 2020-09-17 23:13 ` Ming Lei 1 sibling, 0 replies; 27+ messages in thread From: Ming Lei @ 2020-09-17 23:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Brian Foster, Darrick J. Wong, Dave Chinner, Ritesh Harjani, Anju T Sudhakar, linux-xfs, linux-fsdevel, linux-kernel, willy, minlei On Thu, Sep 17, 2020 at 09:04:55AM +0100, Christoph Hellwig wrote: > On Wed, Sep 16, 2020 at 09:07:14AM -0400, Brian Foster wrote: > > Dave described the main purpose earlier in this thread [1]. The initial > > motivation is that we've had downstream reports of soft lockup problems > > in writeback bio completion down in the bio -> bvec loop of > > iomap_finish_ioend() that has to finish writeback on each individual > > page of insanely large bios and/or chains. We've also had an upstream > > reports of a similar problem on linux-xfs [2]. > > > > The magic number itself was just pulled out of a hat. I picked it > > because it seemed conservative enough to still allow large contiguous > > bios (1GB w/ 4k pages) while hopefully preventing I/O completion > > problems, but was hoping for some feedback on that bit if the general > > approach was acceptable. I was also waiting for some feedback on either > > of the two users who reported the problem but I don't think I've heard > > back on that yet... > > I think the saner answer is to always run large completions in the > workqueue, and add a bunch of cond_resched() calls, rather than > arbitrarily breaking up the I/O size. Completion all ioend pages in single bio->end_bio() may pin too many pages unnecessary long, and adding cond_resched() can make the issue worse. thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-20 23:11 ` Dave Chinner 2020-08-21 4:45 ` Ritesh Harjani @ 2020-08-21 6:01 ` Christoph Hellwig 1 sibling, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2020-08-21 6:01 UTC (permalink / raw) To: Dave Chinner Cc: Anju T Sudhakar, hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, riteshh On Fri, Aug 21, 2020 at 09:11:40AM +1000, Dave Chinner wrote: > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > > From: Ritesh Harjani <riteshh@linux.ibm.com> > > > > __bio_try_merge_page() may return same_page = 1 and merged = 0. > > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > > Ummm, silly question, but exactly how are we getting a bio that > large in ->writepages getting built? Even with 64kB pages, that's a > bio with 2^16 pages attached to it. We shouldn't be building single > bios in writeback that large - what storage hardware is allowing > such huge bios to be built? (i.e. can you dump all the values in > /sys/block/<dev>/queue/* for that device for us?) NVMe controller should not have a problem with such huge I/O, especially if they support SGLs (extent based I/O :)) instead of the default dumb PRP scheme. Higher end SCSI controller should also have huge limits. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-19 10:28 [PATCH] iomap: Fix the write_count in iomap_add_to_ioend() Anju T Sudhakar 2020-08-20 23:11 ` Dave Chinner @ 2020-08-21 6:07 ` Christoph Hellwig 2020-08-21 8:53 ` Ritesh Harjani 2020-08-21 14:49 ` Jens Axboe 2020-08-21 13:31 ` Matthew Wilcox 2 siblings, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2020-08-21 6:07 UTC (permalink / raw) To: Anju T Sudhakar Cc: hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, riteshh, linux-block On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > From: Ritesh Harjani <riteshh@linux.ibm.com> > > __bio_try_merge_page() may return same_page = 1 and merged = 0. > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > Handle this case in iomap_add_to_ioend() by incrementing write_count. > This scenario mostly happens where we have too much dirty data accumulated. > > w/o the patch we hit below kernel warning, I think this is better fixed in the block layer rather than working around the problem in the callers. Something like this: diff --git a/block/bio.c b/block/bio.c index c63ba04bd62967..ef321cd1072e4e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; if (page_is_mergeable(bv, page, len, off, same_page)) { - if (bio->bi_iter.bi_size > UINT_MAX - len) + if (bio->bi_iter.bi_size > UINT_MAX - len) { + *same_page = false; return false; + } bv->bv_len += len; bio->bi_iter.bi_size += len; return true; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 6:07 ` Christoph Hellwig @ 2020-08-21 8:53 ` Ritesh Harjani 2020-08-21 14:49 ` Jens Axboe 1 sibling, 0 replies; 27+ messages in thread From: Ritesh Harjani @ 2020-08-21 8:53 UTC (permalink / raw) To: Christoph Hellwig, Anju T Sudhakar, linux-block Cc: darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy On 8/21/20 11:37 AM, Christoph Hellwig wrote: > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: >> From: Ritesh Harjani <riteshh@linux.ibm.com> >> >> __bio_try_merge_page() may return same_page = 1 and merged = 0. >> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. >> Handle this case in iomap_add_to_ioend() by incrementing write_count. >> This scenario mostly happens where we have too much dirty data accumulated. >> >> w/o the patch we hit below kernel warning, > > I think this is better fixed in the block layer rather than working > around the problem in the callers. Something like this: > > diff --git a/block/bio.c b/block/bio.c > index c63ba04bd62967..ef321cd1072e4e 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > if (page_is_mergeable(bv, page, len, off, same_page)) { > - if (bio->bi_iter.bi_size > UINT_MAX - len) > + if (bio->bi_iter.bi_size > UINT_MAX - len) { > + *same_page = false; > return false; > + } > bv->bv_len += len; > bio->bi_iter.bi_size += len; > return true; > Ya, we had think of that. But what we then thought was, maybe the API does return the right thing. Meaning, what API says is, same_page is true, but the page couldn't be merged hence it returned ret = false. With that thought, we fixed this in the caller. But agree with you that with ret = false, there is no meaning of same_page being true. Ok, so let linux-block comment on whether above also looks good. If yes, I can spin out an official patch with all details. -ritesh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-21 6:07 ` Christoph Hellwig 2020-08-21 8:53 ` Ritesh Harjani @ 2020-08-21 14:49 ` Jens Axboe 1 sibling, 0 replies; 27+ messages in thread From: Jens Axboe @ 2020-08-21 14:49 UTC (permalink / raw) To: Christoph Hellwig, Anju T Sudhakar Cc: darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, willy, riteshh, linux-block On 8/21/20 12:07 AM, Christoph Hellwig wrote: > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: >> From: Ritesh Harjani <riteshh@linux.ibm.com> >> >> __bio_try_merge_page() may return same_page = 1 and merged = 0. >> This could happen when bio->bi_iter.bi_size + len > UINT_MAX. >> Handle this case in iomap_add_to_ioend() by incrementing write_count. >> This scenario mostly happens where we have too much dirty data accumulated. >> >> w/o the patch we hit below kernel warning, > > I think this is better fixed in the block layer rather than working > around the problem in the callers. Something like this: > > diff --git a/block/bio.c b/block/bio.c > index c63ba04bd62967..ef321cd1072e4e 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -879,8 +879,10 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > if (page_is_mergeable(bv, page, len, off, same_page)) { > - if (bio->bi_iter.bi_size > UINT_MAX - len) > + if (bio->bi_iter.bi_size > UINT_MAX - len) { > + *same_page = false; > return false; > + } > bv->bv_len += len; > bio->bi_iter.bi_size += len; > return true; > This looks good to me. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). 2020-08-19 10:28 [PATCH] iomap: Fix the write_count in iomap_add_to_ioend() Anju T Sudhakar 2020-08-20 23:11 ` Dave Chinner 2020-08-21 6:07 ` Christoph Hellwig @ 2020-08-21 13:31 ` Matthew Wilcox 2 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2020-08-21 13:31 UTC (permalink / raw) To: Anju T Sudhakar Cc: hch, darrick.wong, linux-xfs, linux-fsdevel, linux-kernel, riteshh On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > __bio_try_merge_page() may return same_page = 1 and merged = 0. > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > Handle this case in iomap_add_to_ioend() by incrementing write_count. One of the patches I have pending ignores same_page by just using the write_count as a byte count instead of a segment count. It's currently mixed into this patch but needs to be separated. http://git.infradead.org/users/willy/pagecache.git/commitdiff/0186d1dde949a458584c56b706fa8dfd252466ff (another patch does the same thing to the read count). ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2020-09-18 4:00 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-19 10:28 [PATCH] iomap: Fix the write_count in iomap_add_to_ioend() Anju T Sudhakar 2020-08-20 23:11 ` Dave Chinner 2020-08-21 4:45 ` Ritesh Harjani 2020-08-21 6:00 ` Christoph Hellwig 2020-08-21 9:09 ` Ritesh Harjani 2020-08-21 21:53 ` Dave Chinner 2020-08-22 13:13 ` Christoph Hellwig 2020-08-24 14:28 ` Brian Foster 2020-08-24 15:04 ` Christoph Hellwig 2020-08-24 15:48 ` Brian Foster 2020-08-25 0:42 ` Dave Chinner 2020-08-25 14:49 ` Brian Foster 2020-08-31 4:01 ` Ming Lei 2020-08-31 14:35 ` Brian Foster 2020-09-16 0:12 ` Darrick J. Wong 2020-09-16 8:45 ` Christoph Hellwig 2020-09-16 13:07 ` Brian Foster 2020-09-17 8:04 ` Christoph Hellwig 2020-09-17 10:42 ` Brian Foster 2020-09-17 14:48 ` Christoph Hellwig 2020-09-17 21:33 ` Darrick J. Wong 2020-09-17 23:13 ` Ming Lei 2020-08-21 6:01 ` Christoph Hellwig 2020-08-21 6:07 ` Christoph Hellwig 2020-08-21 8:53 ` Ritesh Harjani 2020-08-21 14:49 ` Jens Axboe 2020-08-21 13:31 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).