LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/6] congestion_wait() and GFP_NOFAIL
@ 2021-09-14  0:13 NeilBrown
  2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

While working on an NFS issue recently I was informed (or maybe
"reminded") that congestion_wait() doesn't really do what we think it
does.   It is indistinguishable from schedule_timeout_uninterruptible().

Some current users for congestion_wait() would be better suited by
__GFP_NOFAIL.   In related discussions it was pointed out that the 
__GFP_NOFAIL documentation could usefully clarify the costs of its use.

So this set of patch addresses some of these issues.  The patches are
all independent and can safely be applied separately in different tress
as appropriate.

They:
 - add or improve documentation relating to these issues
 - make a tiny fix to the page_alloc_bulk_*
 - replace those calls to congestion_wait() which are simply waiting
   to retry a memory allocation.

These are the easy bits.  There are 5 calls to congestion_wait() and one
to wait_iff_congested() in mm/ which need consideration.  There are
multiple calls to congestion_wait in fs/, particularly fs/f2fs/ which
need to be addressed too.  I'll try to form an opinion about these in
coming weeks.

Thanks,
NeilBrown


---

NeilBrown (6):
      MM: improve documentation for __GFP_NOFAIL
      MM: annotate congestion_wait() and wait_iff_congested() as ineffective.
      EXT4: Remove ENOMEM/congestion_wait() loops.
      EXT4: remove congestion_wait from ext4_bio_write_page, and simplify
      XFS: remove congestion_wait() loop from kmem_alloc()
      XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()


 fs/ext4/ext4.h              |  2 +-
 fs/ext4/ext4_jbd2.c         |  8 +++++-
 fs/ext4/extents.c           | 49 ++++++++++++++-----------------------
 fs/ext4/extents_status.c    | 35 ++++++++++++++------------
 fs/ext4/extents_status.h    |  2 +-
 fs/ext4/indirect.c          |  2 +-
 fs/ext4/inode.c             |  6 ++---
 fs/ext4/ioctl.c             |  4 +--
 fs/ext4/page-io.c           | 13 ++++------
 fs/ext4/super.c             |  2 +-
 fs/jbd2/transaction.c       |  8 +++---
 fs/xfs/kmem.c               | 16 +++---------
 fs/xfs/xfs_buf.c            |  6 ++---
 include/linux/backing-dev.h |  7 ++++++
 mm/backing-dev.c            |  9 +++++++
 15 files changed, 86 insertions(+), 83 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (2 preceding siblings ...)
  2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  2021-09-15 11:51   ` Michal Hocko
  2021-09-14  0:13 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
  2021-09-14  0:13 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
  5 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

__GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst.
The details are not entirely consistent.

This patch ensures both places state that:
 - there is a cost potentially imposed on other subsystems
 - it should only be used when there is no real alternative
 - it is preferable to an endless loop
 - it is strongly discourages for costly-order allocations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/core-api/memory-allocation.rst |    9 ++++++++-
 include/linux/gfp.h                          |    4 ++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 5954ddf6ee13..9458ce72d31c 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -126,7 +126,14 @@ or another request.
 
   * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior
     and all allocation requests will loop endlessly until they succeed.
-    This might be really dangerous especially for larger orders.
+    The allocator may provide access to memory that would otherwise be
+    reserved in order to satisfy this allocation which might adversely
+    affect other subsystems.  So it should only be used when there is no
+    reasonable failure policy and when the memory is likely to be freed
+    again in the near future.  Its use is strong discourage (via a
+    WARN_ON) for allocations larger than ``PAGE_ALLOC_COSTLY_ORDER``.
+    While this flag is best avoided, it is still preferable to endless
+    loops around the allocator.
 
 Selecting memory allocator
 ==========================
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..101479373738 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -209,6 +209,10 @@ struct vm_area_struct;
  * used only when there is no reasonable failure policy) but it is
  * definitely preferable to use the flag rather than opencode endless
  * loop around allocator.
+ * Use of this flag may provide access to memory which would otherwise be
+ * reserved.  As such it must be understood that there can be a cost imposed
+ * on other subsystems as well as the obvious cost of placing the calling
+ * thread in an uninterruptible indefinite wait.
  * Using this flag for costly allocations is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective.
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  2021-09-15 11:56   ` Michal Hocko
  2021-09-14  0:13 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

Only 4 subsystems call set_bdi_congested() or clear_bdi_congested():
 block/pktcdvd, fs/ceph fs/fuse fs/nfs

It may make sense to use congestion_wait() or wait_iff_congested()
within these subsystems, but they have no value outside of these.

Add documentation comments to these functions to discourage further use.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/backing-dev.h |    7 +++++++
 mm/backing-dev.c            |    9 +++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..cc9513840351 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -153,6 +153,13 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 	return wb->congested & cong_bits;
 }
 
+/* NOTE congestion_wait() and wait_iff_congested() are
+ * largely useless except as documentation.
+ * congestion_wait() will (almost) always wait for the given timeout.
+ * wait_iff_congested() will (almost) never wait, but will call
+ * cond_resched().
+ * Were possible an alternative waiting strategy should be found.
+ */
 long congestion_wait(int sync, long timeout);
 long wait_iff_congested(int sync, long timeout);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..53472ab38796 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1023,6 +1023,11 @@ EXPORT_SYMBOL(set_bdi_congested);
  * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
  * write congestion.  If no backing_devs are congested then just wait for the
  * next write to be completed.
+ *
+ * NOTE: in the current implementation, hardly any backing_devs are ever
+ * marked as congested, and write-completion is rarely reported (see calls
+ * to clear_bdi_congested).  So this should not be assumed to ever wake before
+ * the timeout.
  */
 long congestion_wait(int sync, long timeout)
 {
@@ -1054,6 +1059,10 @@ EXPORT_SYMBOL(congestion_wait);
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
+ *
+ * NOTE: in the current implementation, hardly any backing_devs are ever
+ * marked as congested, and write-completion is rarely reported (see calls
+ * to clear_bdi_congested).  So this should not be assumed to sleep at all.
  */
 long wait_iff_congested(int sync, long timeout)
 {



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
  2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
  2021-09-14  0:13 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  2021-09-14 16:34   ` Mel Gorman
  2021-09-15  0:28   ` Theodore Ts'o
  2021-09-14  0:13 ` [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

Indefinite loops waiting for memory allocation are discouraged by
documentation in gfp.h which says the use of __GFP_NOFAIL that it

 is definitely preferable to use the flag rather than opencode endless
 loop around allocator.

Such loops that use congestion_wait() are particularly unwise as
congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice - and should be
deprecated.

So this patch changes the two loops in ext4_ext_truncate() to use
__GFP_NOFAIL instead of looping.

As the allocation is multiple layers deeper in the call stack, this
requires passing the EXT4_EX_NOFAIL flag down and handling it in various
places.

Of particular interest is the ext4_journal_start family of calls which
can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
a high bit, so it is safe in practice.

jbd2__journal_start() is enhanced so that the gfp_t flags passed are
used for *all* allocations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ext4/ext4.h           |    2 +-
 fs/ext4/ext4_jbd2.c      |    8 +++++++-
 fs/ext4/extents.c        |   49 +++++++++++++++++-----------------------------
 fs/ext4/extents_status.c |   35 +++++++++++++++++++--------------
 fs/ext4/extents_status.h |    2 +-
 fs/ext4/indirect.c       |    2 +-
 fs/ext4/inode.c          |    6 +++---
 fs/ext4/ioctl.c          |    4 ++--
 fs/ext4/super.c          |    2 +-
 fs/jbd2/transaction.c    |    8 ++++----
 10 files changed, 58 insertions(+), 60 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 90ff5acaf11f..52a34f5dfda2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3720,7 +3720,7 @@ extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			       struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_truncate(handle_t *, struct inode *);
 extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-				 ext4_lblk_t end);
+				 ext4_lblk_t end, int nofail);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 6def7339056d..2bdda3b7a3e6 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -92,6 +92,12 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 {
 	journal_t *journal;
 	int err;
+	gfp_t gfp_mask = GFP_NOFS;
+
+	if (type & EXT4_EX_NOFAIL) {
+		gfp_mask |= __GFP_NOFAIL;
+		type &= ~EXT4_EX_NOFAIL;
+	}
 
 	trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
 				 _RET_IP_);
@@ -103,7 +109,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
-				   GFP_NOFS, type, line);
+				   gfp_mask, type, line);
 }
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c0de30f25185..b7bc12aedf78 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1488,7 +1488,7 @@ static int ext4_ext_search_left(struct inode *inode,
 static int ext4_ext_search_right(struct inode *inode,
 				 struct ext4_ext_path *path,
 				 ext4_lblk_t *logical, ext4_fsblk_t *phys,
-				 struct ext4_extent *ret_ex)
+				 struct ext4_extent *ret_ex, int nofail)
 {
 	struct buffer_head *bh = NULL;
 	struct ext4_extent_header *eh;
@@ -1565,7 +1565,7 @@ static int ext4_ext_search_right(struct inode *inode,
 	while (++depth < path->p_depth) {
 		/* subtract from p_depth to get proper eh_depth */
 		bh = read_extent_tree_block(inode, block,
-					    path->p_depth - depth, 0);
+					    path->p_depth - depth, nofail);
 		if (IS_ERR(bh))
 			return PTR_ERR(bh);
 		eh = ext_block_hdr(bh);
@@ -1574,7 +1574,7 @@ static int ext4_ext_search_right(struct inode *inode,
 		put_bh(bh);
 	}
 
-	bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0);
+	bh = read_extent_tree_block(inode, block, path->p_depth - depth, nofail);
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
 	eh = ext_block_hdr(bh);
@@ -2773,7 +2773,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path)
 }
 
 int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-			  ext4_lblk_t end)
+			  ext4_lblk_t end, int nofail)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int depth = ext_depth(inode);
@@ -2789,7 +2789,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug(inode, "truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
+	handle = ext4_journal_start_with_revoke(inode,
+			EXT4_HT_TRUNCATE | nofail,
 			depth + 1,
 			ext4_free_metadata_revoke_credits(inode->i_sb, depth));
 	if (IS_ERR(handle))
@@ -2877,7 +2878,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			 */
 			lblk = ex_end + 1;
 			err = ext4_ext_search_right(inode, path, &lblk, &pblk,
-						    NULL);
+						    NULL, nofail);
 			if (err < 0)
 				goto out;
 			if (pblk) {
@@ -2899,10 +2900,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	} else {
 		path = kcalloc(depth + 1, sizeof(struct ext4_ext_path),
 			       GFP_NOFS | __GFP_NOFAIL);
-		if (path == NULL) {
-			ext4_journal_stop(handle);
-			return -ENOMEM;
-		}
 		path[0].p_maxdepth = path[0].p_depth = depth;
 		path[0].p_hdr = ext_inode_hdr(inode);
 		i = 0;
@@ -2955,7 +2952,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			memset(path + i + 1, 0, sizeof(*path));
 			bh = read_extent_tree_block(inode,
 				ext4_idx_pblock(path[i].p_idx), depth - i - 1,
-				EXT4_EX_NOCACHE);
+				EXT4_EX_NOCACHE | nofail);
 			if (IS_ERR(bh)) {
 				/* should we reset i_size? */
 				err = PTR_ERR(bh);
@@ -4186,7 +4183,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (err)
 		goto out;
 	ar.lright = map->m_lblk;
-	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
+	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2, 0);
 	if (err < 0)
 		goto out;
 
@@ -4368,23 +4365,13 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 
 	last_block = (inode->i_size + sb->s_blocksize - 1)
 			>> EXT4_BLOCK_SIZE_BITS(sb);
-retry:
 	err = ext4_es_remove_extent(inode, last_block,
-				    EXT_MAX_BLOCKS - last_block);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry;
-	}
+				    EXT_MAX_BLOCKS - last_block,
+				    EXT4_EX_NOFAIL);
 	if (err)
 		return err;
-retry_remove_space:
-	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_remove_space;
-	}
+	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1,
+				    EXT4_EX_NOFAIL);
 	return err;
 }
 
@@ -5322,13 +5309,13 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	ext4_discard_preallocations(inode, 0);
 
 	ret = ext4_es_remove_extent(inode, punch_start,
-				    EXT_MAX_BLOCKS - punch_start);
+				    EXT_MAX_BLOCKS - punch_start, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
 	}
 
-	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1);
+	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5510,7 +5497,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 	}
 
 	ret = ext4_es_remove_extent(inode, offset_lblk,
-			EXT_MAX_BLOCKS - offset_lblk);
+				    EXT_MAX_BLOCKS - offset_lblk, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5574,10 +5561,10 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
 	BUG_ON(!inode_is_locked(inode1));
 	BUG_ON(!inode_is_locked(inode2));
 
-	*erp = ext4_es_remove_extent(inode1, lblk1, count);
+	*erp = ext4_es_remove_extent(inode1, lblk1, count, 0);
 	if (unlikely(*erp))
 		return 0;
-	*erp = ext4_es_remove_extent(inode2, lblk2, count);
+	*erp = ext4_es_remove_extent(inode2, lblk2, count, 0);
 	if (unlikely(*erp))
 		return 0;
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9a3a8996aacf..7f7711a2ea44 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -144,9 +144,10 @@
 static struct kmem_cache *ext4_es_cachep;
 static struct kmem_cache *ext4_pending_cachep;
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved);
+			      ext4_lblk_t end, int *reserved, int nofail);
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
@@ -452,10 +453,11 @@ static void ext4_es_list_del(struct inode *inode)
 
 static struct extent_status *
 ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
-		     ext4_fsblk_t pblk)
+		     ext4_fsblk_t pblk, int nofail)
 {
 	struct extent_status *es;
-	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
+	es = kmem_cache_alloc(ext4_es_cachep,
+			      GFP_ATOMIC | (nofail ? __GFP_NOFAIL : 0));
 	if (es == NULL)
 		return NULL;
 	es->es_lblk = lblk;
@@ -754,7 +756,8 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
 }
 #endif
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node **p = &tree->root.rb_node;
@@ -795,7 +798,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
 	}
 
 	es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
-				  newes->es_pblk);
+				  newes->es_pblk, nofail);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -848,11 +851,11 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_insert_extent_check(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, NULL);
+	err = __es_remove_extent(inode, lblk, end, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
@@ -902,7 +905,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
 	if (!es || es->es_lblk > end)
-		__es_insert_extent(inode, &newes);
+		__es_insert_extent(inode, &newes, 0);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 }
 
@@ -1294,6 +1297,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * @lblk - first block in range
  * @end - last block in range
  * @reserved - number of cluster reservations released
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * If @reserved is not NULL and delayed allocation is enabled, counts
  * block/cluster reservations freed by removing range and if bigalloc
@@ -1301,7 +1305,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * error code on failure.
  */
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved)
+			      ext4_lblk_t end, int *reserved, int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node *node;
@@ -1350,7 +1354,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 					orig_es.es_len - len2;
 			ext4_es_store_pblock_status(&newes, block,
 						    ext4_es_status(&orig_es));
-			err = __es_insert_extent(inode, &newes);
+			err = __es_insert_extent(inode, &newes, nofail);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
@@ -1426,12 +1430,13 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  * @inode - file containing range
  * @lblk - first block in range
  * @len - number of blocks to remove
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * Reduces block/cluster reservation count and for bigalloc cancels pending
  * reservations as needed. Returns 0 on success, error code on failure.
  */
 int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, int nofail)
 {
 	ext4_lblk_t end;
 	int err = 0;
@@ -1456,7 +1461,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * is reclaimed.
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, &reserved);
+	err = __es_remove_extent(inode, lblk, end, &reserved, nofail);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	ext4_da_release_space(inode, reserved);
@@ -2003,11 +2008,11 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err = __es_remove_extent(inode, lblk, lblk, NULL);
+	err = __es_remove_extent(inode, lblk, lblk, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 4ec30a798260..23d77094a165 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -134,7 +134,7 @@ extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len, ext4_fsblk_t pblk,
 				 unsigned int status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, int nofail);
 extern void ext4_es_find_extent_range(struct inode *inode,
 				      int (*match_fn)(struct extent_status *es),
 				      ext4_lblk_t lblk, ext4_lblk_t end,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 89efa78ed4b2..910e87aea7be 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1125,7 +1125,7 @@ void ext4_ind_truncate(handle_t *handle, struct inode *inode)
 			return;
 	}
 
-	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block);
+	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block, 0);
 
 	/*
 	 * The orphan list entry will now protect us from any crash which
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..24246043d94b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1575,7 +1575,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 		ext4_lblk_t start, last;
 		start = index << (PAGE_SHIFT - inode->i_blkbits);
 		last = end << (PAGE_SHIFT - inode->i_blkbits);
-		ext4_es_remove_extent(inode, start, last - start + 1);
+		ext4_es_remove_extent(inode, start, last - start + 1, 0);
 	}
 
 	pagevec_init(&pvec);
@@ -4109,7 +4109,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 		ext4_discard_preallocations(inode, 0);
 
 		ret = ext4_es_remove_extent(inode, first_block,
-					    stop_block - first_block);
+					    stop_block - first_block, 0);
 		if (ret) {
 			up_write(&EXT4_I(inode)->i_data_sem);
 			goto out_stop;
@@ -4117,7 +4117,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 			ret = ext4_ext_remove_space(inode, first_block,
-						    stop_block - 1);
+						    stop_block - 1, 0);
 		else
 			ret = ext4_ind_remove_space(handle, inode, first_block,
 						    stop_block);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 606dee9e08a3..e4de05a6b976 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -79,8 +79,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 		(ei1->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	ei2->i_flags = tmp | (ei2->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	swap(ei1->i_disksize, ei2->i_disksize);
-	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
-	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS, 0);
+	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS, 0);
 
 	isize = i_size_read(inode1);
 	i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0775950ee84e..947e8376a35a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1393,7 +1393,7 @@ void ext4_clear_inode(struct inode *inode)
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode, 0);
-	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS, 0);
 	dquot_drop(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6a3caedd2285..23e0f003d43b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -476,9 +476,9 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 }
 
 /* Allocate a new handle.  This should probably be in a slab... */
-static handle_t *new_handle(int nblocks)
+static handle_t *new_handle(int nblocks, gfp_t gfp)
 {
-	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(gfp);
 	if (!handle)
 		return NULL;
 	handle->h_total_credits = nblocks;
@@ -505,13 +505,13 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 
 	nblocks += DIV_ROUND_UP(revoke_records,
 				journal->j_revoke_records_per_block);
-	handle = new_handle(nblocks);
+	handle = new_handle(nblocks, gfp_mask);
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
 	if (rsv_blocks) {
 		handle_t *rsv_handle;
 
-		rsv_handle = new_handle(rsv_blocks);
+		rsv_handle = new_handle(rsv_blocks, gfp_mask);
 		if (!rsv_handle) {
 			jbd2_free_handle(handle);
 			return ERR_PTR(-ENOMEM);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (4 preceding siblings ...)
  2021-09-14  0:13 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  5 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible().  It is best avoided and should be
deprecated.

It is not needed in ext4_bio_write_page().  There are two cases.
If there are no ->io_bio yet, then it is appropriate to use __GFP_NOFAIL
which does the waiting in a better place.  The code already uses this
flag on the second attempt.  This patch changes to it always use that
flag for this case.

If there *are* ->io_bio (in which case the allocation was non-blocking)
we submit the io and return the first case.  No waiting is needed in
this case.

So remove the congestion_wait() call, and simplify the code so that the
two cases are somewhat clearer.

Remove the "if (io->io_bio)" before calling ext4_io_submit() as that
test is performed internally by that function.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ext4/page-io.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index f038d578d8d8..3b6ece0d3ad6 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -506,7 +506,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	 * can't happen in the common case of blocksize == PAGE_SIZE.
 	 */
 	if (fscrypt_inode_uses_fs_layer_crypto(inode) && nr_to_submit) {
-		gfp_t gfp_flags = GFP_NOFS;
+		gfp_t gfp_flags;
 		unsigned int enc_bytes = round_up(len, i_blocksize(inode));
 
 		/*
@@ -514,21 +514,18 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 		 * a waiting mask (i.e. request guaranteed allocation) on the
 		 * first page of the bio.  Otherwise it can deadlock.
 		 */
+	retry_encrypt:
 		if (io->io_bio)
 			gfp_flags = GFP_NOWAIT | __GFP_NOWARN;
-	retry_encrypt:
+		else
+			gfp_flags = GFP_NOFS | __GFP_NOFAIL;
 		bounce_page = fscrypt_encrypt_pagecache_blocks(page, enc_bytes,
 							       0, gfp_flags);
 		if (IS_ERR(bounce_page)) {
 			ret = PTR_ERR(bounce_page);
 			if (ret == -ENOMEM &&
 			    (io->io_bio || wbc->sync_mode == WB_SYNC_ALL)) {
-				gfp_flags = GFP_NOFS;
-				if (io->io_bio)
-					ext4_io_submit(io);
-				else
-					gfp_flags |= __GFP_NOFAIL;
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				ext4_io_submit(io);
 				goto retry_encrypt;
 			}
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
  2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  2021-09-14  1:31   ` Dave Chinner
  2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
not given.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/kmem.c |   16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 6f49bf39183c..f545f3633f88 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 {
 	int	retries = 0;
 	gfp_t	lflags = kmem_flags_convert(flags);
-	void	*ptr;
 
 	trace_kmem_alloc(size, flags, _RET_IP_);
 
-	do {
-		ptr = kmalloc(size, lflags);
-		if (ptr || (flags & KM_MAYFAIL))
-			return ptr;
-		if (!(++retries % 100))
-			xfs_err(NULL,
-	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
-				current->comm, current->pid,
-				(unsigned int)size, __func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-	} while (1);
+	if (!(flags & KM_MAYFAIL))
+		lflags |= __GFP_NOFAIL;
+
+	return kmalloc(size, lflags);
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (3 preceding siblings ...)
  2021-09-14  0:13 ` [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
@ 2021-09-14  0:13 ` NeilBrown
  2021-09-14  2:08   ` Dave Chinner
  2021-09-14  0:13 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
  5 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice and it is not a good way
to wait for memory to become available.

So instead of waiting, allocate a single page using __GFP_NOFAIL, then
loop around and try to get any more pages that might be needed with a
bulk allocation.  This single-page allocation will wait in the most
appropriate way.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/xfs_buf.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5fa6cd947dd4..1ae3768f6504 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
-	 * array is not an allocation failure, so don't back off if we get at
-	 * least one extra page.
+	 * array is not an allocation failure, so don't fail or fall back on
+	 * __GFP_NOFAIL if we get at least one extra page.
 	 */
 	for (;;) {
 		long	last = filled;
@@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
 		}
 
 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		congestion_wait(BLK_RW_ASYNC, HZ / 50);
+		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
 	}
 	return 0;
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-14  0:13 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
@ 2021-09-14  1:31   ` Dave Chinner
  2021-09-14  3:27     ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2021-09-14  1:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Documentation commment in gfp.h discourages indefinite retry loops on
> ENOMEM and says of __GFP_NOFAIL that it
> 
>     is definitely preferable to use the flag rather than opencode
>     endless loop around allocator.
> 
> So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> not given.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/xfs/kmem.c |   16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 6f49bf39183c..f545f3633f88 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
>  	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
>  	trace_kmem_alloc(size, flags, _RET_IP_);
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & KM_MAYFAIL))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> -				current->comm, current->pid,
> -				(unsigned int)size, __func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & KM_MAYFAIL))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }

Which means we no longer get warnings about memory allocation
failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
in this loop. Hence we'll now get silent deadlocks through this code
instead of getting warnings that memory allocation is failing
repeatedly.

I also wonder about changing the backoff behaviour here (it's a 20ms
wait right now because there are not early wakeups) will affect the
behaviour, as __GFP_NOFAIL won't wait for that extra time between
allocation attempts....

And, of course, how did you test this? Sometimes we see
unpredicted behaviours as a result of "simple" changes like this
under low memory conditions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14  0:13 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
@ 2021-09-14  2:08   ` Dave Chinner
  2021-09-14  2:35     ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2021-09-14  2:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Documentation commment in gfp.h discourages indefinite retry loops on
> ENOMEM and says of __GFP_NOFAIL that it
> 
>     is definitely preferable to use the flag rather than opencode
>     endless loop around allocator.
> 
> congestion_wait() is indistinguishable from
> schedule_timeout_uninterruptible() in practice and it is not a good way
> to wait for memory to become available.
> 
> So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> loop around and try to get any more pages that might be needed with a
> bulk allocation.  This single-page allocation will wait in the most
> appropriate way.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/xfs/xfs_buf.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5fa6cd947dd4..1ae3768f6504 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> -	 * array is not an allocation failure, so don't back off if we get at
> -	 * least one extra page.
> +	 * array is not an allocation failure, so don't fail or fall back on
> +	 * __GFP_NOFAIL if we get at least one extra page.
>  	 */
>  	for (;;) {
>  		long	last = filled;
> @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
>  		}
>  
>  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);

This smells wrong - the whole point of using the bulk page allocator
in this loop is to avoid the costly individual calls to
alloc_page().

What we are implementing here fail-fast semantics for readahead and
fail-never for everything else.  If the bulk allocator fails to get
a page from the fast path free lists, it already falls back to
__alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
there's no need to add another call to alloc_page() because we can
just do this instead:

	if (flags & XBF_READ_AHEAD)
		gfp_mask |= __GFP_NORETRY;
	else
-		gfp_mask |= GFP_NOFS;
+		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;

Which should make the __alloc_pages() call in
alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
provide the necessary never-fail guarantee that is needed here.

At which point, the bulk allocation loop can be simplified because
we can only fail bulk allocation for readahead, so something like:

		if (filled == bp->b_page_count) {
			XFS_STATS_INC(bp->b_mount, xb_page_found);
			break;
		}

-		if (filled != last)
+		if (filled == last) {
-			continue;
-
-		if (flags & XBF_READ_AHEAD) {
			ASSERT(flags & XBF_READ_AHEAD);
			xfs_buf_free_pages(bp);
			return -ENOMEM;
		}

		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		congestion_wait(BLK_RW_ASYNC, HZ / 50);
	}
	return 0;
}

would do the right thing and still record that we are doing
blocking allocations (via the xb_page_retries stat) in this loop.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14  2:08   ` Dave Chinner
@ 2021-09-14  2:35     ` NeilBrown
  2021-09-14  5:33       ` Dave Chinner
  2021-09-14 16:45       ` Mel Gorman
  0 siblings, 2 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-14  2:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, 14 Sep 2021, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Documentation commment in gfp.h discourages indefinite retry loops on
> > ENOMEM and says of __GFP_NOFAIL that it
> > 
> >     is definitely preferable to use the flag rather than opencode
> >     endless loop around allocator.
> > 
> > congestion_wait() is indistinguishable from
> > schedule_timeout_uninterruptible() in practice and it is not a good way
> > to wait for memory to become available.
> > 
> > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > loop around and try to get any more pages that might be needed with a
> > bulk allocation.  This single-page allocation will wait in the most
> > appropriate way.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/xfs/xfs_buf.c |    6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 5fa6cd947dd4..1ae3768f6504 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> >  
> >  	/*
> >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > -	 * array is not an allocation failure, so don't back off if we get at
> > -	 * least one extra page.
> > +	 * array is not an allocation failure, so don't fail or fall back on
> > +	 * __GFP_NOFAIL if we get at least one extra page.
> >  	 */
> >  	for (;;) {
> >  		long	last = filled;
> > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> >  		}
> >  
> >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> 
> This smells wrong - the whole point of using the bulk page allocator
> in this loop is to avoid the costly individual calls to
> alloc_page().
> 
> What we are implementing here fail-fast semantics for readahead and
> fail-never for everything else.  If the bulk allocator fails to get
> a page from the fast path free lists, it already falls back to
> __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> there's no need to add another call to alloc_page() because we can
> just do this instead:
> 
> 	if (flags & XBF_READ_AHEAD)
> 		gfp_mask |= __GFP_NORETRY;
> 	else
> -		gfp_mask |= GFP_NOFS;
> +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> 
> Which should make the __alloc_pages() call in
> alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> provide the necessary never-fail guarantee that is needed here.

That is a nice simplification.
Mel Gorman told me
  https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
documentation comment in an earlier patch.

I had a look at the code and cannot see how it would fail to allocate at
least one page.  Maybe Mel can help....

NeilBrown
 


> 
> At which point, the bulk allocation loop can be simplified because
> we can only fail bulk allocation for readahead, so something like:
> 
> 		if (filled == bp->b_page_count) {
> 			XFS_STATS_INC(bp->b_mount, xb_page_found);
> 			break;
> 		}
> 
> -		if (filled != last)
> +		if (filled == last) {
> -			continue;
> -
> -		if (flags & XBF_READ_AHEAD) {
> 			ASSERT(flags & XBF_READ_AHEAD);
> 			xfs_buf_free_pages(bp);
> 			return -ENOMEM;
> 		}
> 
> 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> 	}
> 	return 0;
> }
> 
> would do the right thing and still record that we are doing
> blocking allocations (via the xb_page_retries stat) in this loop.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-14  1:31   ` Dave Chinner
@ 2021-09-14  3:27     ` NeilBrown
  2021-09-14  6:05       ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14  3:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, 14 Sep 2021, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Documentation commment in gfp.h discourages indefinite retry loops on
> > ENOMEM and says of __GFP_NOFAIL that it
> > 
> >     is definitely preferable to use the flag rather than opencode
> >     endless loop around allocator.
> > 
> > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > not given.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/xfs/kmem.c |   16 ++++------------
> >  1 file changed, 4 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 6f49bf39183c..f545f3633f88 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> >  	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> >  	trace_kmem_alloc(size, flags, _RET_IP_);
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & KM_MAYFAIL))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > -				current->comm, current->pid,
> > -				(unsigned int)size, __func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & KM_MAYFAIL))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Which means we no longer get warnings about memory allocation
> failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> in this loop. Hence we'll now get silent deadlocks through this code
> instead of getting warnings that memory allocation is failing
> repeatedly.

Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
__GFP_NOFAIL?
Or is the 1-in-100 important? I think default warning is 1 every 10
seconds.

> 
> I also wonder about changing the backoff behaviour here (it's a 20ms
> wait right now because there are not early wakeups) will affect the
> behaviour, as __GFP_NOFAIL won't wait for that extra time between
> allocation attempts....

The internal backoff is 100ms if there is much pending writeout, and
there are 16 internal retries.  If there is not much pending writeout, I
think it just loops with cond_resched().
So adding 20ms can only be at all interesting when the only way to
reclaim memory is something other than writeout.  I don't know how to
think about that.

> 
> And, of course, how did you test this? Sometimes we see
> unpredicted behaviours as a result of "simple" changes like this
> under low memory conditions...

I suspect this is close to untestable.  While I accept that there might
be a scenario where the change might cause some macro effect, it would
most likely be some interplay with some other subsystem struggling with
memory.  Testing XFS by itself would be unlikely to find it.

Thanks,
NeilBrown


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14  2:35     ` NeilBrown
@ 2021-09-14  5:33       ` Dave Chinner
  2021-09-14 16:45       ` Mel Gorman
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2021-09-14  5:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > to wait for memory to become available.
> > > 
> > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > loop around and try to get any more pages that might be needed with a
> > > bulk allocation.  This single-page allocation will wait in the most
> > > appropriate way.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/xfs_buf.c |    6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > >  
> > >  	/*
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > -	 * array is not an allocation failure, so don't back off if we get at
> > > -	 * least one extra page.
> > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > >  	 */
> > >  	for (;;) {
> > >  		long	last = filled;
> > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > >  		}
> > >  
> > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > 
> > This smells wrong - the whole point of using the bulk page allocator
> > in this loop is to avoid the costly individual calls to
> > alloc_page().
> > 
> > What we are implementing here fail-fast semantics for readahead and
> > fail-never for everything else.  If the bulk allocator fails to get
> > a page from the fast path free lists, it already falls back to
> > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > there's no need to add another call to alloc_page() because we can
> > just do this instead:
> > 
> > 	if (flags & XBF_READ_AHEAD)
> > 		gfp_mask |= __GFP_NORETRY;
> > 	else
> > -		gfp_mask |= GFP_NOFS;
> > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > 
> > Which should make the __alloc_pages() call in
> > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > provide the necessary never-fail guarantee that is needed here.
> 
> That is a nice simplification.
> Mel Gorman told me
>   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> documentation comment in an earlier patch.

Well, that's a surprise to me - I can't see where it masked out
NOFAIL, and it seems quite arbitrary to just say "different code
needs different fallbacks, so you can't have NOFAIL" despite NOFAIL
being the exact behavioural semantics one of only three users of the
bulk allocator really needs...

> I had a look at the code and cannot see how it would fail to allocate at
> least one page.  Maybe Mel can help....

Yup, clarification is definitely needed here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-14  3:27     ` NeilBrown
@ 2021-09-14  6:05       ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2021-09-14  6:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 01:27:31PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > > not given.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/kmem.c |   16 ++++------------
> > >  1 file changed, 4 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 6f49bf39183c..f545f3633f88 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > >  	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > >  	trace_kmem_alloc(size, flags, _RET_IP_);
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & KM_MAYFAIL))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > > -				current->comm, current->pid,
> > > -				(unsigned int)size, __func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & KM_MAYFAIL))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Which means we no longer get warnings about memory allocation
> > failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> > in this loop. Hence we'll now get silent deadlocks through this code
> > instead of getting warnings that memory allocation is failing
> > repeatedly.
> 
> Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
> __GFP_NOFAIL?

Probably.

> Or is the 1-in-100 important? I think default warning is 1 every 10
> seconds.

1-in-100 is an arbitrary number to prevent spamming of logs unless
there is a real likelihood of there being a memory allocation
deadlock. We've typically only ever seen this when trying to do
high-order allocations (e.g. 64kB for xattr buffers) and failing
repeatedly in extreme memory pressure events. It's a canary that we
leave in the logs so that when a user reports problems we know that
they've been running under extended extreme low memory conditions
and can adjust the triage process accordingly.

So, we could remove __GFP_NOWARN, as long as the core allocator code
has sufficient rate limiting that it won't spam the logs due to
extended failure looping...

> > I also wonder about changing the backoff behaviour here (it's a 20ms
> > wait right now because there are not early wakeups) will affect the
> > behaviour, as __GFP_NOFAIL won't wait for that extra time between
> > allocation attempts....
> 
> The internal backoff is 100ms if there is much pending writeout, and
> there are 16 internal retries.  If there is not much pending writeout, I
> think it just loops with cond_resched().
> So adding 20ms can only be at all interesting when the only way to
> reclaim memory is something other than writeout.  I don't know how to
> think about that.

Any cache that uses a shrinker to reclaim (e.g. dentry, inodes, fs
metadata, etc due to recursive directory traversals) can cause
reclaim looping and priority escalation without there being any page
cache writeback or reclaim possible. Especially when you have
GFP_NOFS allocation context and all your memory is in VFS level
caches. At that point, direct reclaim cannot (and will not) make
forwards progress, so we still have to wait for some other
GFP_KERNEL context reclaim (e.g. kswapd) to make progress reclaiming
memory while we wait.

Fundamentally, the memory reclaim backoff code doesn't play well
with shrinkers. Patches from an old patchset which pushed lack of
shrinker progress back up into the vmscan level backoff algorithms
was something I was experimenting with a few years ago. e.g.

https://lore.kernel.org/linux-xfs/20191031234618.15403-16-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20191031234618.15403-17-david@fromorbit.com/

We didn't end up going this way to solve the XFS inode reclaim
problems - I ended up solving that entirely by pinning XFS buffer
cache memory and modifying the XFS inode shrinker - but it was this
patchset that first exposed the fact that congestion_wait() was no
longer functioning as intended. See the last few paragraphs of the
(long) cover letter for v1 of that patchset here:

https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/

So, yeah, I know full well that congestion_wait() is mostly just
an unconditional timeout these days...

> > And, of course, how did you test this? Sometimes we see
> > unpredicted behaviours as a result of "simple" changes like this
> > under low memory conditions...
> 
> I suspect this is close to untestable.  While I accept that there might
> be a scenario where the change might cause some macro effect, it would
> most likely be some interplay with some other subsystem struggling with
> memory.  Testing XFS by itself would be unlikely to find it.

Filesystem traversal workloads (e.g. chown -R) are the ones that
hammer memory allocation from GFP_NOFS context which creates memory
pressure that cannot be balanced by direct reclaim as direct reclaim
cannot reclaim filesystem caches in this situation. This is where I
would expect extra backoff on failing GFP_NOFS allocations to have
some effect...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
@ 2021-09-14 16:34   ` Mel Gorman
  2021-09-14 21:48     ` NeilBrown
  2021-09-14 23:55     ` Dave Chinner
  2021-09-15  0:28   ` Theodore Ts'o
  1 sibling, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2021-09-14 16:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Indefinite loops waiting for memory allocation are discouraged by
> documentation in gfp.h which says the use of __GFP_NOFAIL that it
> 
>  is definitely preferable to use the flag rather than opencode endless
>  loop around allocator.
> 
> Such loops that use congestion_wait() are particularly unwise as
> congestion_wait() is indistinguishable from
> schedule_timeout_uninterruptible() in practice - and should be
> deprecated.
> 
> So this patch changes the two loops in ext4_ext_truncate() to use
> __GFP_NOFAIL instead of looping.
> 
> As the allocation is multiple layers deeper in the call stack, this
> requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> places.
> 
> Of particular interest is the ext4_journal_start family of calls which
> can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> a high bit, so it is safe in practice.
> 
> jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> used for *all* allocations.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
the risk of a livelock if memory is completely depleted where as some
callers can afford to wait.

The key event should be reclaim making progress. The hack below is
intended to vaguely demonstrate how blocking can be based on reclaim
making progress instead of "congestion" but has not even been booted. A
more complete overhaul may involve introducing
reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask)
and
reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout)
and converting congestion_wait and wait_iff_congestion to calling
reclaim_congestion_wait_nodemask which waits on the first usable node
and then audit every single congestion_wait() user to see which API
they should call. Further work would be to establish whether the page allocator should
call reclaim_congestion_wait_nodemask() if direct reclaim is not making
progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL
could then soften its access to emergency reserves but I haven't given
it much thought.

Yes it's significant work, but it would be a better than letting
__GFP_NOFAIL propagate further and kicking us down the road.

This hack is terrible, it's not the right way to do it, it's just to
illustrate the idea of "waiting on memory should be based on reclaim
making progress and not the state of storage" is not impossible.

--8<--
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c0318509f9e..5ed81c5746ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -832,6 +832,7 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
+	wait_queue_head_t reclaim_wait;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6122c78ce914..21a9cd693d12 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include <linux/swap.h>
 #include <trace/events/writeback.h>
 
 struct backing_dev_info noop_backing_dev_info;
@@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 EXPORT_SYMBOL(set_bdi_congested);
 
 /**
- * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
+ * congestion_wait - the docs are now worthless but avoiding a rename
  *
- * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
- * write congestion.  If no backing_devs are congested then just wait for the
- * next write to be completed.
+ * New thing -- wait for a timeout or reclaim to make progress
  */
 long congestion_wait(int sync, long timeout)
 {
+	pg_data_t *pgdat;
 	long ret;
 	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	wait_queue_head_t *wqh;
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
+	/* Never let kswapd sleep on itself */
+	if (current_is_kswapd())
+		goto trace;
+
+	/*
+	 * Dangerous, local memory may be forbidden by cpuset or policies,
+	 * use first eligible zone in zonelists node instead
+	 */
+	preempt_disable();
+	pgdat = NODE_DATA(smp_processor_id());
+	preempt_enable();
+	wqh = &pgdat->reclaim_wait;
+
+	/*
+	 * Should probably check watermark of suitable zones here
+	 * in case this is spuriously called
+	 */
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+trace:
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b09e71c9ce7..4b87b73d1264 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7418,6 +7418,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
 
+	init_waitqueue_head(&pgdat->reclaim_wait);
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 158c9c93d03c..0ac2cf6be5e3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2888,6 +2888,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 }
 
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+
 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -3070,6 +3072,18 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				    sc))
 		goto again;
 
+	/*
+	 * Might be race-prone, more appropriate to do this when exiting
+	 * direct reclaim and when kswapd finds that pgdat is balanced.
+	 * May also be appropriate to update pgdat_balanced to take
+	 * a watermark level and wakeup when min watermarks are ok
+	 * instead of waiting for the high watermark
+	 */
+	if (waitqueue_active(&pgdat->reclaim_wait) &&
+	    pgdat_balanced(pgdat, 0, ZONE_MOVABLE)) {
+		wake_up_interruptible(&pgdat->reclaim_wait);
+	}
+
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
 	 * many failures to reclaim anything from them and goes to

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14  2:35     ` NeilBrown
  2021-09-14  5:33       ` Dave Chinner
@ 2021-09-14 16:45       ` Mel Gorman
  2021-09-14 21:13         ` NeilBrown
  1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-09-14 16:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, linux-xfs, linux-ext4,
	linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > to wait for memory to become available.
> > > 
> > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > loop around and try to get any more pages that might be needed with a
> > > bulk allocation.  This single-page allocation will wait in the most
> > > appropriate way.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/xfs_buf.c |    6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > >  
> > >  	/*
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > -	 * array is not an allocation failure, so don't back off if we get at
> > > -	 * least one extra page.
> > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > >  	 */
> > >  	for (;;) {
> > >  		long	last = filled;
> > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > >  		}
> > >  
> > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > 
> > This smells wrong - the whole point of using the bulk page allocator
> > in this loop is to avoid the costly individual calls to
> > alloc_page().
> > 
> > What we are implementing here fail-fast semantics for readahead and
> > fail-never for everything else.  If the bulk allocator fails to get
> > a page from the fast path free lists, it already falls back to
> > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > there's no need to add another call to alloc_page() because we can
> > just do this instead:
> > 
> > 	if (flags & XBF_READ_AHEAD)
> > 		gfp_mask |= __GFP_NORETRY;
> > 	else
> > -		gfp_mask |= GFP_NOFS;
> > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > 
> > Which should make the __alloc_pages() call in
> > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > provide the necessary never-fail guarantee that is needed here.
> 
> That is a nice simplification.
> Mel Gorman told me
>   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> documentation comment in an earlier patch.
> 
> I had a look at the code and cannot see how it would fail to allocate at
> least one page.  Maybe Mel can help....
> 

If there are already at least one page an the array and the first attempt
at bulk allocation fails, it'll simply return. It's an odd corner case
that may never apply but it's possible.  That said, I'm of the opinion that
__GFP_NOFAIL should not be expanded and instead congestion_wait should be
deleted and replaced with something triggered by reclaim making progress.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-14 16:45       ` Mel Gorman
@ 2021-09-14 21:13         ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-14 21:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, linux-xfs, linux-ext4,
	linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Wed, 15 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> > On Tue, 14 Sep 2021, Dave Chinner wrote:
> > > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > > ENOMEM and says of __GFP_NOFAIL that it
> > > > 
> > > >     is definitely preferable to use the flag rather than opencode
> > > >     endless loop around allocator.
> > > > 
> > > > congestion_wait() is indistinguishable from
> > > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > > to wait for memory to become available.
> > > > 
> > > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > > loop around and try to get any more pages that might be needed with a
> > > > bulk allocation.  This single-page allocation will wait in the most
> > > > appropriate way.
> > > > 
> > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > ---
> > > >  fs/xfs/xfs_buf.c |    6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > > >  
> > > >  	/*
> > > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > > -	 * array is not an allocation failure, so don't back off if we get at
> > > > -	 * least one extra page.
> > > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > > >  	 */
> > > >  	for (;;) {
> > > >  		long	last = filled;
> > > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > > >  		}
> > > >  
> > > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > > 
> > > This smells wrong - the whole point of using the bulk page allocator
> > > in this loop is to avoid the costly individual calls to
> > > alloc_page().
> > > 
> > > What we are implementing here fail-fast semantics for readahead and
> > > fail-never for everything else.  If the bulk allocator fails to get
> > > a page from the fast path free lists, it already falls back to
> > > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > > there's no need to add another call to alloc_page() because we can
> > > just do this instead:
> > > 
> > > 	if (flags & XBF_READ_AHEAD)
> > > 		gfp_mask |= __GFP_NORETRY;
> > > 	else
> > > -		gfp_mask |= GFP_NOFS;
> > > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > > 
> > > Which should make the __alloc_pages() call in
> > > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > > provide the necessary never-fail guarantee that is needed here.
> > 
> > That is a nice simplification.
> > Mel Gorman told me
> >   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> > that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> > documentation comment in an earlier patch.
> > 
> > I had a look at the code and cannot see how it would fail to allocate at
> > least one page.  Maybe Mel can help....
> > 
> 
> If there are already at least one page an the array and the first attempt
> at bulk allocation fails, it'll simply return. It's an odd corner case
> that may never apply but it's possible.  That said, I'm of the opinion that
> __GFP_NOFAIL should not be expanded and instead congestion_wait should be
> deleted and replaced with something triggered by reclaim making progress.

Ahh.... that was (I think) fixed by
https://patchwork.kernel.org/project/linux-mm/patch/163027609524.7591.4987241695872857175@noble.neil.brown.name/
(which I cannot find on lore.kernel.org - strange)
which you acked - and which I meant to include in this series but
somehow missed.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14 16:34   ` Mel Gorman
@ 2021-09-14 21:48     ` NeilBrown
  2021-09-15 12:06       ` Michal Hocko
  2021-09-14 23:55     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-14 21:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Wed, 15 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Indefinite loops waiting for memory allocation are discouraged by
> > documentation in gfp.h which says the use of __GFP_NOFAIL that it
> > 
> >  is definitely preferable to use the flag rather than opencode endless
> >  loop around allocator.
> > 
> > Such loops that use congestion_wait() are particularly unwise as
> > congestion_wait() is indistinguishable from
> > schedule_timeout_uninterruptible() in practice - and should be
> > deprecated.
> > 
> > So this patch changes the two loops in ext4_ext_truncate() to use
> > __GFP_NOFAIL instead of looping.
> > 
> > As the allocation is multiple layers deeper in the call stack, this
> > requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> > places.
> > 
> > Of particular interest is the ext4_journal_start family of calls which
> > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> > as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > a high bit, so it is safe in practice.
> > 
> > jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> > used for *all* allocations.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
> the risk of a livelock if memory is completely depleted where as some
> callers can afford to wait.

Maybe we should wind back and focus on the documentation patches.
As quoted above, mm.h says:

> >  is definitely preferable to use the flag rather than opencode endless
> >  loop around allocator.

but you seem to be saying that is wrong.  I'd certainly like to get the
documentation right before changing any code.

Why does __GFP_NOFAIL access the reserves? Why not require that the
relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
with __GFP_NOFAIL if that is justified?

There are over 100 __GFP_NOFAIL allocation sites.  I don't feel like
reviewing them all and seeing if any really need a try-harder flag.
Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then
#define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC)
and encourage the use of __GFP_NEVERFAIL in future?

When __GFP_NOFAIL loops, it calls congestion_wait() internally.  That
certainly needs to be fixed and the ideas you present below are
certainly worth considering when trying to understand how to address
that.  I'd rather fix it once there in page_alloc.c rather then export a
waiting API like congestion_wait().  That would provide more
flexibility.  e.g.  a newly freed page could be handed directly back to
the waiter.

Thanks,
NeilBrown



> 
> The key event should be reclaim making progress. The hack below is
> intended to vaguely demonstrate how blocking can be based on reclaim
> making progress instead of "congestion" but has not even been booted. A
> more complete overhaul may involve introducing
> reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask)
> and
> reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout)
> and converting congestion_wait and wait_iff_congestion to calling
> reclaim_congestion_wait_nodemask which waits on the first usable node
> and then audit every single congestion_wait() user to see which API
> they should call. Further work would be to establish whether the page allocator should
> call reclaim_congestion_wait_nodemask() if direct reclaim is not making
> progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL
> could then soften its access to emergency reserves but I haven't given
> it much thought.
> 
> Yes it's significant work, but it would be a better than letting
> __GFP_NOFAIL propagate further and kicking us down the road.
> 
> This hack is terrible, it's not the right way to do it, it's just to
> illustrate the idea of "waiting on memory should be based on reclaim
> making progress and not the state of storage" is not impossible.
> 
> --8<--
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5c0318509f9e..5ed81c5746ec 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -832,6 +832,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> +	wait_queue_head_t reclaim_wait;
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
>  	struct task_struct *kswapd;	/* Protected by
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6122c78ce914..21a9cd693d12 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -13,6 +13,7 @@
>  #include <linux/module.h>
>  #include <linux/writeback.h>
>  #include <linux/device.h>
> +#include <linux/swap.h>
>  #include <trace/events/writeback.h>
>  
>  struct backing_dev_info noop_backing_dev_info;
> @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  EXPORT_SYMBOL(set_bdi_congested);
>  
>  /**
> - * congestion_wait - wait for a backing_dev to become uncongested
> - * @sync: SYNC or ASYNC IO
> - * @timeout: timeout in jiffies
> + * congestion_wait - the docs are now worthless but avoiding a rename
>   *
> - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion.  If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * New thing -- wait for a timeout or reclaim to make progress
>   */
>  long congestion_wait(int sync, long timeout)
>  {
> +	pg_data_t *pgdat;
>  	long ret;
>  	unsigned long start = jiffies;
>  	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +	wait_queue_head_t *wqh;
>  
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> -	ret = io_schedule_timeout(timeout);
> +	/* Never let kswapd sleep on itself */
> +	if (current_is_kswapd())
> +		goto trace;
> +
> +	/*
> +	 * Dangerous, local memory may be forbidden by cpuset or policies,
> +	 * use first eligible zone in zonelists node instead
> +	 */
> +	preempt_disable();
> +	pgdat = NODE_DATA(smp_processor_id());
> +	preempt_enable();
> +	wqh = &pgdat->reclaim_wait;
> +
> +	/*
> +	 * Should probably check watermark of suitable zones here
> +	 * in case this is spuriously called
> +	 */
> +
> +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> +	ret = schedule_timeout(timeout);
>  	finish_wait(wqh, &wait);
>  
> +trace:
>  	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
>  					jiffies_to_usecs(jiffies - start));
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b09e71c9ce7..4b87b73d1264 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7418,6 +7418,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  	pgdat_init_split_queue(pgdat);
>  	pgdat_init_kcompactd(pgdat);
>  
> +	init_waitqueue_head(&pgdat->reclaim_wait);
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 158c9c93d03c..0ac2cf6be5e3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2888,6 +2888,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
>  }
>  
> +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> +
>  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -3070,6 +3072,18 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				    sc))
>  		goto again;
>  
> +	/*
> +	 * Might be race-prone, more appropriate to do this when exiting
> +	 * direct reclaim and when kswapd finds that pgdat is balanced.
> +	 * May also be appropriate to update pgdat_balanced to take
> +	 * a watermark level and wakeup when min watermarks are ok
> +	 * instead of waiting for the high watermark
> +	 */
> +	if (waitqueue_active(&pgdat->reclaim_wait) &&
> +	    pgdat_balanced(pgdat, 0, ZONE_MOVABLE)) {
> +		wake_up_interruptible(&pgdat->reclaim_wait);
> +	}
> +
>  	/*
>  	 * Kswapd gives up on balancing particular nodes after too
>  	 * many failures to reclaim anything from them and goes to
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14 16:34   ` Mel Gorman
  2021-09-14 21:48     ` NeilBrown
@ 2021-09-14 23:55     ` Dave Chinner
  2021-09-15  8:59       ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2021-09-14 23:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Tue, Sep 14, 2021 at 05:34:32PM +0100, Mel Gorman wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Indefinite loops waiting for memory allocation are discouraged by
> > documentation in gfp.h which says the use of __GFP_NOFAIL that it
> > 
> >  is definitely preferable to use the flag rather than opencode endless
> >  loop around allocator.
> > 
> > Such loops that use congestion_wait() are particularly unwise as
> > congestion_wait() is indistinguishable from
> > schedule_timeout_uninterruptible() in practice - and should be
> > deprecated.
> > 
> > So this patch changes the two loops in ext4_ext_truncate() to use
> > __GFP_NOFAIL instead of looping.
> > 
> > As the allocation is multiple layers deeper in the call stack, this
> > requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> > places.
> > 
> > Of particular interest is the ext4_journal_start family of calls which
> > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> > as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > a high bit, so it is safe in practice.
> > 
> > jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> > used for *all* allocations.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
> the risk of a livelock if memory is completely depleted where as some
> callers can afford to wait.

Undocumented behaviour, never mentioned or communicated to users in
any __GFP_NOFAIL discussion I've taken part in until now.

How is it different to, say, GFP_ATOMIC? i.e. Does GFP_NOFAIL
actually imply GFP_ATOMIC, or is there some other undocumented
behaviour going on here?

We've already go ~80 __GFP_NOFAIL allocation contexts in fs/ and the
vast majority of the are GFP_KERNEL | __GFP_NOFAIL or GFP_NOFS |
__GFP_NOFAIL, so some clarification on what this actually means
would be really good...

> The key event should be reclaim making progress.

Yup, that's what we need, but I don't see why it needs to be exposed
outside the allocation code at all.

> The hack below is
> intended to vaguely demonstrate how blocking can be based on reclaim
> making progress instead of "congestion" but has not even been booted. A
> more complete overhaul may involve introducing
> reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask)
> and
> reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout)

I think that's racy. There's no guarantee that the node we are
currently running on matches the cpu/node id that we failed to
allocate from. Pre-emptible kernels and all that. IOWs, I think
needs to be completely internal to the reclaim infrastructure and
based on the current context we are trying to reclaim from.

That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump
through an ever changing tangle of hoops to make basic "never-fail"
allocation semantics behave correctly.

> and converting congestion_wait and wait_iff_congestion to calling
> reclaim_congestion_wait_nodemask which waits on the first usable node
> and then audit every single congestion_wait() user to see which API
> they should call. Further work would be to establish whether the page allocator should
> call reclaim_congestion_wait_nodemask() if direct reclaim is not making
> progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL
> could then soften its access to emergency reserves but I haven't given
> it much thought.
> 
> Yes it's significant work, but it would be a better than letting
> __GFP_NOFAIL propagate further and kicking us down the road.

Unfortunately, that seems to ignore the fact that we still need
never-fail allocation semantics for stable system performance.  Like
it or not the requirements for __GFP_NOFAIL (and "retry forever"
equivalent semantics) or open coded endless retry loops
are *never* going away.

IOWs, I'd suggest that we should think about how to formally
support "never-fail" allocation semantics in both the API and the
implementation in such a way that we don't end up with this
__GFP_NOFAIL catch-22 ever again. Having the memory reclaim code
wait on forwards progress instead of congestion as you propose here
would be a core part of providing "never-fail" allocations...

> This hack is terrible, it's not the right way to do it, it's just to
> illustrate the idea of "waiting on memory should be based on reclaim
> making progress and not the state of storage" is not impossible.

I've been saying that is how reclaim should work for years. :/

It was LFSMM 2013 or 2014 that I was advocating for memory reclaim
to move to IO-less reclaim throttling based on the rate at which
free pages are returned to the freelists similar to the way IO-less
dirty page throttling is based on the rate dirty pages are cleaned.

Relying on IO interactions (submitting IO or waiting for completion)
for high level page state management has always been a bad way to
throttle demand because it only provides indirect control and has
poor feedback indication.

It's also a good way to remove the dependency on direct reclaim -
just sleep instead of duplicating the work that kswapd should
already be doing in the background to reclaim pages...

> --8<--
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5c0318509f9e..5ed81c5746ec 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -832,6 +832,7 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> +	wait_queue_head_t reclaim_wait;
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
>  	struct task_struct *kswapd;	/* Protected by
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6122c78ce914..21a9cd693d12 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -13,6 +13,7 @@
>  #include <linux/module.h>
>  #include <linux/writeback.h>
>  #include <linux/device.h>
> +#include <linux/swap.h>
>  #include <trace/events/writeback.h>
>  
>  struct backing_dev_info noop_backing_dev_info;
> @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  EXPORT_SYMBOL(set_bdi_congested);
>  
>  /**
> - * congestion_wait - wait for a backing_dev to become uncongested
> - * @sync: SYNC or ASYNC IO
> - * @timeout: timeout in jiffies
> + * congestion_wait - the docs are now worthless but avoiding a rename
>   *
> - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion.  If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * New thing -- wait for a timeout or reclaim to make progress
>   */
>  long congestion_wait(int sync, long timeout)
>  {
> +	pg_data_t *pgdat;
>  	long ret;
>  	unsigned long start = jiffies;
>  	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +	wait_queue_head_t *wqh;
>  
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> -	ret = io_schedule_timeout(timeout);
> +	/* Never let kswapd sleep on itself */
> +	if (current_is_kswapd())
> +		goto trace;

I think this breaks the kswapd 100ms immediate reclaim backoff in
shrink_node().

> +
> +	/*
> +	 * Dangerous, local memory may be forbidden by cpuset or policies,
> +	 * use first eligible zone in zonelists node instead
> +	 */
> +	preempt_disable();
> +	pgdat = NODE_DATA(smp_processor_id());
> +	preempt_enable();
> +	wqh = &pgdat->reclaim_wait;

This goes away if it is kept internal and is passed the reclaim
pgdat context we just failed to reclaim pages from.

> +
> +	/*
> +	 * Should probably check watermark of suitable zones here
> +	 * in case this is spuriously called
> +	 */

Ditto.

These hacks really make me think that an external "wait for memory
reclaim to make progress before retrying allocation" behaviour is
the wrong way to tackle this. It's always been a hack because
open-coded retry loops had to be implemented everywhere for
never-fail allocation semantics.

Neil has the right idea by replacing such fail-never back-offs with
actual allocation attempts that encapsulate waiting for reclaim to
make progress. This needs to be a formally supported function of
memory allocation, and then these backoffs can be properly
integrated into the memory reclaim retry mechanism instead of being
poorly grafted onto the side...

Whether that be __GFP_NOFAIL or GFP_RETRY_FOREVER that doesn't have
the "dip into reserves" behaviour of __GFP_NOFAIL (which we clearly
don't need because open coded retry loops have clearly work well
enough for production systems for many years), I don't really care.

But I think the memory allocation subsystem needs to move beyond
"ahhhh, never-fail is too hard!!!!" and take steps to integrate this
behaviour properly so that it can be made to work a whole lot better
than it currently does....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
  2021-09-14 16:34   ` Mel Gorman
@ 2021-09-15  0:28   ` Theodore Ts'o
  2021-09-15  5:25     ` NeilBrown
  1 sibling, 1 reply; 34+ messages in thread
From: Theodore Ts'o @ 2021-09-15  0:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox,
	Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> 
> Of particular interest is the ext4_journal_start family of calls which
> can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> a high bit, so it is safe in practice.

I'm really not fond of this type blurring.  What I'd suggeset doing
instead is adding a "gfp_t gfp_mask" parameter to the
__ext4_journal_start_sb().  With the exception of one call site in
fs/ext4/ialloc.c, most of the callers of __ext4_journal_start_sb() are
via #define helper macros or inline funcions.  So it would just
require adding a GFP_NOFS as an extra parameter to the various macros
and inline functions which call __ext4_journal_start_sb() in
ext4_jbd2.h.

The function ext4_journal_start_with_revoke() is called exactly once
so we could just bury the __GFP_NOFAIL in the definition of that
macros, e.g.:

#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
			     GFP_NOFS | __GFP_NOFAIL, (revoke_creds))

but it's probably better to do something like this:

#define ext4_journal_start_with_revoke(gfp_mask, inode, type, blocks, revoke_creds) \
	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
			     gfp_mask, (revoke_creds))

So it's explicit in the C function ext4_ext_remove_space() in
fs/ext4/extents.c that we are explicitly requesting the __GFP_NOFAIL
behavior.

Does that make sense?

					- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15  0:28   ` Theodore Ts'o
@ 2021-09-15  5:25     ` NeilBrown
  2021-09-15 17:02       ` Theodore Ts'o
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2021-09-15  5:25 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox,
	Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel

On Wed, 15 Sep 2021, Theodore Ts'o wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > 
> > Of particular interest is the ext4_journal_start family of calls which
> > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> > as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > a high bit, so it is safe in practice.
> 
> I'm really not fond of this type blurring.  What I'd suggeset doing
> instead is adding a "gfp_t gfp_mask" parameter to the
> __ext4_journal_start_sb().  With the exception of one call site in
> fs/ext4/ialloc.c, most of the callers of __ext4_journal_start_sb() are
> via #define helper macros or inline funcions.  So it would just
> require adding a GFP_NOFS as an extra parameter to the various macros
> and inline functions which call __ext4_journal_start_sb() in
> ext4_jbd2.h.
> 
> The function ext4_journal_start_with_revoke() is called exactly once
> so we could just bury the __GFP_NOFAIL in the definition of that
> macros, e.g.:
> 
> #define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
> 	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
> 			     GFP_NOFS | __GFP_NOFAIL, (revoke_creds))
> 
> but it's probably better to do something like this:
> 
> #define ext4_journal_start_with_revoke(gfp_mask, inode, type, blocks, revoke_creds) \
> 	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
> 			     gfp_mask, (revoke_creds))
> 
> So it's explicit in the C function ext4_ext_remove_space() in
> fs/ext4/extents.c that we are explicitly requesting the __GFP_NOFAIL
> behavior.
> 
> Does that make sense?

Mostly.
Adding gfp_mask to __ext4_journal_start_sb() make perfect sense.
There doesn't seem much point adding one to __ext4_journal_start(),
we can have ext4_journal_start_with_revoke() call
__ext4_journal_start_sb() directly.
But I cannot see what it doesn't already do that.
i.e. why have the inline __ext4_journal_start() at all?
Is it OK if I don't use that for ext4_journal_start_with_revoke()?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14 23:55     ` Dave Chinner
@ 2021-09-15  8:59       ` Mel Gorman
  2021-09-15 12:20         ` Michal Hocko
  2021-09-15 14:35         ` Mel Gorman
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2021-09-15  8:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Wed, Sep 15, 2021 at 09:55:35AM +1000, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 05:34:32PM +0100, Mel Gorman wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Indefinite loops waiting for memory allocation are discouraged by
> > > documentation in gfp.h which says the use of __GFP_NOFAIL that it
> > > 
> > >  is definitely preferable to use the flag rather than opencode endless
> > >  loop around allocator.
> > > 
> > > Such loops that use congestion_wait() are particularly unwise as
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice - and should be
> > > deprecated.
> > > 
> > > So this patch changes the two loops in ext4_ext_truncate() to use
> > > __GFP_NOFAIL instead of looping.
> > > 
> > > As the allocation is multiple layers deeper in the call stack, this
> > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> > > places.
> > > 
> > > Of particular interest is the ext4_journal_start family of calls which
> > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> > > as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > > a high bit, so it is safe in practice.
> > > 
> > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> > > used for *all* allocations.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
> > the risk of a livelock if memory is completely depleted where as some
> > callers can afford to wait.
> 
> Undocumented behaviour, never mentioned or communicated to users in
> any __GFP_NOFAIL discussion I've taken part in until now.
> 
> How is it different to, say, GFP_ATOMIC? i.e. Does GFP_NOFAIL
> actually imply GFP_ATOMIC, or is there some other undocumented
> behaviour going on here?
> 

Hmm, it's similar but not the same as GFP_ATOMIC. The most severe aspect
of depleting emergency reserves comes from this block which is relevant
when the system is effectively OOM

        /*
         * XXX: GFP_NOFS allocations should rather fail than rely on
         * other request to make a forward progress.
         * We are in an unfortunate situation where out_of_memory cannot
         * do much for this context but let's try it to at least get
         * access to memory reserved if the current task is killed (see
         * out_of_memory). Once filesystems are ready to handle allocation
         * failures more gracefully we should just bail out here.
         */

        /* Exhausted what can be done so it's blame time */
        if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
                *did_some_progress = 1;

                /*
                 * Help non-failing allocations by giving them access to memory
                 * reserves
                 */
                if (gfp_mask & __GFP_NOFAIL)
                        page = __alloc_pages_cpuset_fallback(gfp_mask, order,
                                        ALLOC_NO_WATERMARKS, ac);
        }

The less severe aspect comes from

                /*
                 * Help non-failing allocations by giving them access to memory
                 * reserves but do not use ALLOC_NO_WATERMARKS because this
                 * could deplete whole memory reserves which would just make
                 * the situation worse
                 */
                page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
                if (page)
                        goto got_pg;

This doesn't dip into reserves as much as an atomic allocation does but
it competes with them.


> We've already go ~80 __GFP_NOFAIL allocation contexts in fs/ and the
> vast majority of the are GFP_KERNEL | __GFP_NOFAIL or GFP_NOFS |
> __GFP_NOFAIL, so some clarification on what this actually means
> would be really good...
> 

I'm not sure how much clarity can be given. Whatever the documented
semantics, at some point under the current implementation __GFP_NOFAIL
potentially competes with the same reserves as GFP_ATOMIC and has a path
where watermarks are ignored entirely.

> > The key event should be reclaim making progress.
> 
> Yup, that's what we need, but I don't see why it needs to be exposed
> outside the allocation code at all.
> 

Probably not. At least some of it could be contained within reclaim
itself to block when reclaim is not making progress as opposed to anything
congestion related. That might still livelock if no progress can be made
but that's not new, the OOM hammer should eventually kick in.

> > The hack below is
> > intended to vaguely demonstrate how blocking can be based on reclaim
> > making progress instead of "congestion" but has not even been booted. A
> > more complete overhaul may involve introducing
> > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask)
> > and
> > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout)
> 
> I think that's racy. There's no guarantee that the node we are
> currently running on matches the cpu/node id that we failed to
> allocate from.

I know, I commented

+       /*
+        * Dangerous, local memory may be forbidden by cpuset or policies,
+        * use first eligible zone in zonelists node instead
+        */

There may be multiple nodes "we failed to allocate from", but the first
eligible node is definitely one of them. There is the possibility that
the first eligible node may be completely unreclaimable (all anonymous,
no swap) in which case the timeout kicks in. I don't think this should
be a global workqueue because there will be spurious wakeups.

> Pre-emptible kernels and all that. IOWs, I think
> needs to be completely internal to the reclaim infrastructure and
> based on the current context we are trying to reclaim from.
> 

A further step could be something similar to capture_control
whereby reclaimed pages are immediately assigned to tasks blocked on
reclaim_congestion_wait. It may be excessively complicated and overkill.

> That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump
> through an ever changing tangle of hoops to make basic "never-fail"
> allocation semantics behave correctly.
> 

True and I can see what that is desirable. What I'm saying is that right
now, increasing the use of __GFP_NOFAIL may cause a different set of
problems (unbounded retries combined with ATOMIC allocation failures) as
they compete for similar resources.

> > and converting congestion_wait and wait_iff_congestion to calling
> > reclaim_congestion_wait_nodemask which waits on the first usable node
> > and then audit every single congestion_wait() user to see which API
> > they should call. Further work would be to establish whether the page allocator should
> > call reclaim_congestion_wait_nodemask() if direct reclaim is not making
> > progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL
> > could then soften its access to emergency reserves but I haven't given
> > it much thought.
> > 
> > Yes it's significant work, but it would be a better than letting
> > __GFP_NOFAIL propagate further and kicking us down the road.
> 
> Unfortunately, that seems to ignore the fact that we still need
> never-fail allocation semantics for stable system performance.  Like
> it or not the requirements for __GFP_NOFAIL (and "retry forever"
> equivalent semantics) or open coded endless retry loops
> are *never* going away.
> 

I'm aware there will be cases where never-fail allocation semantics are
required, particularly in GFP_NOFS contexts. What I'm saying is that right
now because throttling is based on imaginary "congestion" that increasing
the use could result in live-lock like bugs when multiple users complete
for similar emergency resources to atomic. Note that I didn't NACK this.

> IOWs, I'd suggest that we should think about how to formally
> support "never-fail" allocation semantics in both the API and the
> implementation in such a way that we don't end up with this
> __GFP_NOFAIL catch-22 ever again. Having the memory reclaim code
> wait on forwards progress instead of congestion as you propose here
> would be a core part of providing "never-fail" allocations...
> 
> > This hack is terrible, it's not the right way to do it, it's just to
> > illustrate the idea of "waiting on memory should be based on reclaim
> > making progress and not the state of storage" is not impossible.
> 
> I've been saying that is how reclaim should work for years. :/
> 
> It was LFSMM 2013 or 2014 that I was advocating for memory reclaim
> to move to IO-less reclaim throttling based on the rate at which
> free pages are returned to the freelists similar to the way IO-less
> dirty page throttling is based on the rate dirty pages are cleaned.
> 

I'm going to guess no one ever tried.

> Relying on IO interactions (submitting IO or waiting for completion)
> for high level page state management has always been a bad way to
> throttle demand because it only provides indirect control and has
> poor feedback indication.
> 

Also true.

> It's also a good way to remove the dependency on direct reclaim -
> just sleep instead of duplicating the work that kswapd should
> already be doing in the background to reclaim pages...
> 

Even for direct reclaim, I do think that the number of direct reclaimers
should be limited with the rest going to sleep. At some point, excessive
direct reclaim tasks are simply hammering the lru lock.

> > --8<--
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 5c0318509f9e..5ed81c5746ec 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -832,6 +832,7 @@ typedef struct pglist_data {
> >  	unsigned long node_spanned_pages; /* total size of physical page
> >  					     range, including holes */
> >  	int node_id;
> > +	wait_queue_head_t reclaim_wait;
> >  	wait_queue_head_t kswapd_wait;
> >  	wait_queue_head_t pfmemalloc_wait;
> >  	struct task_struct *kswapd;	/* Protected by
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 6122c78ce914..21a9cd693d12 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/module.h>
> >  #include <linux/writeback.h>
> >  #include <linux/device.h>
> > +#include <linux/swap.h>
> >  #include <trace/events/writeback.h>
> >  
> >  struct backing_dev_info noop_backing_dev_info;
> > @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  EXPORT_SYMBOL(set_bdi_congested);
> >  
> >  /**
> > - * congestion_wait - wait for a backing_dev to become uncongested
> > - * @sync: SYNC or ASYNC IO
> > - * @timeout: timeout in jiffies
> > + * congestion_wait - the docs are now worthless but avoiding a rename
> >   *
> > - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > - * write congestion.  If no backing_devs are congested then just wait for the
> > - * next write to be completed.
> > + * New thing -- wait for a timeout or reclaim to make progress
> >   */
> >  long congestion_wait(int sync, long timeout)
> >  {
> > +	pg_data_t *pgdat;
> >  	long ret;
> >  	unsigned long start = jiffies;
> >  	DEFINE_WAIT(wait);
> > -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +	wait_queue_head_t *wqh;
> >  
> > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > -	ret = io_schedule_timeout(timeout);
> > +	/* Never let kswapd sleep on itself */
> > +	if (current_is_kswapd())
> > +		goto trace;
> 
> I think this breaks the kswapd 100ms immediate reclaim backoff in
> shrink_node().
> 

Yep, it is. That would definitely need better care.

> > +
> > +	/*
> > +	 * Dangerous, local memory may be forbidden by cpuset or policies,
> > +	 * use first eligible zone in zonelists node instead
> > +	 */
> > +	preempt_disable();
> > +	pgdat = NODE_DATA(smp_processor_id());
> > +	preempt_enable();
> > +	wqh = &pgdat->reclaim_wait;
> 
> This goes away if it is kept internal and is passed the reclaim
> pgdat context we just failed to reclaim pages from.
> 

Yep, that would also work if this was called only from reclaim contexts
or mm internally. Some helper would still be needed to implement an
alternative congestion_wait that looks up the same information until
congestion_wait callers can be removed.

Again, I wasn't trying to offer a correct implementation, only illustrating
that it's perfectly possible to throttle based on reclaim making progress
instead of "congestion".

> > +
> > +	/*
> > +	 * Should probably check watermark of suitable zones here
> > +	 * in case this is spuriously called
> > +	 */
> 
> Ditto.
> 
> These hacks really make me think that an external "wait for memory
> reclaim to make progress before retrying allocation" behaviour is
> the wrong way to tackle this. It's always been a hack because
> open-coded retry loops had to be implemented everywhere for
> never-fail allocation semantics.
> 
> Neil has the right idea by replacing such fail-never back-offs with
> actual allocation attempts that encapsulate waiting for reclaim to
> make progress. This needs to be a formally supported function of
> memory allocation, and then these backoffs can be properly
> integrated into the memory reclaim retry mechanism instead of being
> poorly grafted onto the side...
> 

I'm not necessarily opposed to this. What I'm saying is that doing the
conversion now *MIGHT* mean an increase in live-lock-like bugs because
with the current implementation, the callers may not sleep/throttle in
the same way the crappy "loop around congestion_wait" implementations did.

> Whether that be __GFP_NOFAIL or GFP_RETRY_FOREVER that doesn't have
> the "dip into reserves" behaviour of __GFP_NOFAIL (which we clearly
> don't need because open coded retry loops have clearly work well
> enough for production systems for many years), I don't really care.
> 

I suspected this was true and that it might be appropriate for __GFP_NOFAIL
to obey normal watermarks unless __GFP_HIGH is also specified if it's
absolutely necessary but I'm not sure because I haven't put enough thought
into it.

> But I think the memory allocation subsystem needs to move beyond
> "ahhhh, never-fail is too hard!!!!" and take steps to integrate this
> behaviour properly so that it can be made to work a whole lot better
> than it currently does....
> 

Again, not opposed. It's simply a heads-up that converting now may cause
problems that manifest as livelock-like bugs unless, at minimum, internal
reclaim bases throttling on some reclaim making progress instead of
congestion_wait. Given my current load, I can't promise I'd find the time
to follow through with converting the hack into a proper implementation
but someone reading linux-mm might. Either way, I felt it was necessary
to at least warn about the hazards.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL
  2021-09-14  0:13 ` [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
@ 2021-09-15 11:51   ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2021-09-15 11:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue 14-09-21 10:13:04, Neil Brown wrote:
> __GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst.
> The details are not entirely consistent.
> 
> This patch ensures both places state that:
>  - there is a cost potentially imposed on other subsystems
>  - it should only be used when there is no real alternative
>  - it is preferable to an endless loop
>  - it is strongly discourages for costly-order allocations.
>

Yes this is a useful addition to the documentation. Thanks!

> Signed-off-by: NeilBrown <neilb@suse.de>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/core-api/memory-allocation.rst |    9 ++++++++-
>  include/linux/gfp.h                          |    4 ++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
> index 5954ddf6ee13..9458ce72d31c 100644
> --- a/Documentation/core-api/memory-allocation.rst
> +++ b/Documentation/core-api/memory-allocation.rst
> @@ -126,7 +126,14 @@ or another request.
>  
>    * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior
>      and all allocation requests will loop endlessly until they succeed.
> -    This might be really dangerous especially for larger orders.
> +    The allocator may provide access to memory that would otherwise be
> +    reserved in order to satisfy this allocation which might adversely
> +    affect other subsystems.  So it should only be used when there is no
> +    reasonable failure policy and when the memory is likely to be freed
> +    again in the near future.  Its use is strong discourage (via a
> +    WARN_ON) for allocations larger than ``PAGE_ALLOC_COSTLY_ORDER``.
> +    While this flag is best avoided, it is still preferable to endless
> +    loops around the allocator.
>  
>  Selecting memory allocator
>  ==========================
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..101479373738 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -209,6 +209,10 @@ struct vm_area_struct;
>   * used only when there is no reasonable failure policy) but it is
>   * definitely preferable to use the flag rather than opencode endless
>   * loop around allocator.
> + * Use of this flag may provide access to memory which would otherwise be
> + * reserved.  As such it must be understood that there can be a cost imposed
> + * on other subsystems as well as the obvious cost of placing the calling
> + * thread in an uninterruptible indefinite wait.
>   * Using this flag for costly allocations is _highly_ discouraged.
>   */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective.
  2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
@ 2021-09-15 11:56   ` Michal Hocko
  2021-09-16 22:13     ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2021-09-15 11:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Tue 14-09-21 10:13:04, Neil Brown wrote:
> Only 4 subsystems call set_bdi_congested() or clear_bdi_congested():
>  block/pktcdvd, fs/ceph fs/fuse fs/nfs
> 
> It may make sense to use congestion_wait() or wait_iff_congested()
> within these subsystems, but they have no value outside of these.
> 
> Add documentation comments to these functions to discourage further use.

This is an unfortunate state. The MM layer still relies on the API.
While adding a documentation to clarify the current status can stop more
usage I am wondering what is a real alternative. My experience tells me
that a lack of real alternative will lead to new creative ways of doing
things instead.
 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/backing-dev.h |    7 +++++++
>  mm/backing-dev.c            |    9 +++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index ac7f231b8825..cc9513840351 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -153,6 +153,13 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  	return wb->congested & cong_bits;
>  }
>  
> +/* NOTE congestion_wait() and wait_iff_congested() are
> + * largely useless except as documentation.
> + * congestion_wait() will (almost) always wait for the given timeout.
> + * wait_iff_congested() will (almost) never wait, but will call
> + * cond_resched().
> + * Were possible an alternative waiting strategy should be found.
> + */
>  long congestion_wait(int sync, long timeout);
>  long wait_iff_congested(int sync, long timeout);
>  
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 4a9d4e27d0d9..53472ab38796 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -1023,6 +1023,11 @@ EXPORT_SYMBOL(set_bdi_congested);
>   * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
>   * write congestion.  If no backing_devs are congested then just wait for the
>   * next write to be completed.
> + *
> + * NOTE: in the current implementation, hardly any backing_devs are ever
> + * marked as congested, and write-completion is rarely reported (see calls
> + * to clear_bdi_congested).  So this should not be assumed to ever wake before
> + * the timeout.
>   */
>  long congestion_wait(int sync, long timeout)
>  {
> @@ -1054,6 +1059,10 @@ EXPORT_SYMBOL(congestion_wait);
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
> + *
> + * NOTE: in the current implementation, hardly any backing_devs are ever
> + * marked as congested, and write-completion is rarely reported (see calls
> + * to clear_bdi_congested).  So this should not be assumed to sleep at all.
>   */
>  long wait_iff_congested(int sync, long timeout)
>  {
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-14 21:48     ` NeilBrown
@ 2021-09-15 12:06       ` Michal Hocko
  2021-09-15 22:35         ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2021-09-15 12:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4,
	linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Wed 15-09-21 07:48:11, Neil Brown wrote:
> On Wed, 15 Sep 2021, Mel Gorman wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Indefinite loops waiting for memory allocation are discouraged by
> > > documentation in gfp.h which says the use of __GFP_NOFAIL that it
> > > 
> > >  is definitely preferable to use the flag rather than opencode endless
> > >  loop around allocator.
> > > 
> > > Such loops that use congestion_wait() are particularly unwise as
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice - and should be
> > > deprecated.
> > > 
> > > So this patch changes the two loops in ext4_ext_truncate() to use
> > > __GFP_NOFAIL instead of looping.
> > > 
> > > As the allocation is multiple layers deeper in the call stack, this
> > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> > > places.
> > > 
> > > Of particular interest is the ext4_journal_start family of calls which
> > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
> > > as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > > a high bit, so it is safe in practice.
> > > 
> > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> > > used for *all* allocations.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
> > the risk of a livelock if memory is completely depleted where as some
> > callers can afford to wait.
> 
> Maybe we should wind back and focus on the documentation patches.
> As quoted above, mm.h says:
> 
> > >  is definitely preferable to use the flag rather than opencode endless
> > >  loop around allocator.
> 
> but you seem to be saying that is wrong.  I'd certainly like to get the
> documentation right before changing any code.
> 
> Why does __GFP_NOFAIL access the reserves? Why not require that the
> relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> with __GFP_NOFAIL if that is justified?

Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
memory reserves") help?

I would be worried to make the semantic even more complex than already
is. Access to memory reserves is an implementation detail that the page
allocator does currently. Callers shouldn't really be worried about
that. I do not ever remember any actual NOFAIL triggered memory
exhaustion. I have seen that to happen for unrestricted access to memory
reserves by OOM victim though. Hence cd04ae1e2dc8 ("mm, oom: do not rely
on TIF_MEMDIE for memory reserves access"). We can consider something
similar if NOFAIL allocation really tend to show a similar problem. We
do not want callers to care about OOM sitauations for this kind of
requests.

__GFP_NOFAIL | __GFP_HIGH is certainly something that is a valid usage
but I wouldn't base OOM behavior based on that.

> There are over 100 __GFP_NOFAIL allocation sites.  I don't feel like
> reviewing them all and seeing if any really need a try-harder flag.
> Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then
> #define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC)
> and encourage the use of __GFP_NEVERFAIL in future?

Doesn't this add even more complexity?

> When __GFP_NOFAIL loops, it calls congestion_wait() internally.  That
> certainly needs to be fixed and the ideas you present below are
> certainly worth considering when trying to understand how to address
> that.  I'd rather fix it once there in page_alloc.c rather then export a
> waiting API like congestion_wait().  That would provide more
> flexibility.  e.g.  a newly freed page could be handed directly back to
> the waiter.

Completely agreed here. We really do not want people to open code NOFAIL
unless they can do something really subsystem specific that would help
to make a forward progress.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15  8:59       ` Mel Gorman
@ 2021-09-15 12:20         ` Michal Hocko
  2021-09-15 14:35         ` Mel Gorman
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2021-09-15 12:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, NeilBrown, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Wed 15-09-21 09:59:04, Mel Gorman wrote:
> On Wed, Sep 15, 2021 at 09:55:35AM +1000, Dave Chinner wrote:

> > That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump
> > through an ever changing tangle of hoops to make basic "never-fail"
> > allocation semantics behave correctly.
> > 
> 
> True and I can see what that is desirable. What I'm saying is that right
> now, increasing the use of __GFP_NOFAIL may cause a different set of
> problems (unbounded retries combined with ATOMIC allocation failures) as
> they compete for similar resources.

I have commented on reasoning behind the above code in other reply. Let
me just comment on this particular concern. I completely do agree that
any use of __GFP_NOFAIL should be carefully evaluated. This is a very
strong recuirement and it should be used only as a last resort.
On the other hand converting an existing open coded nofail code that
_doesn't_ really do any clever tricks to allow a forward progress (e.g.
dropping locks, kicking some internal caching mechinisms etc.) should
just be turned into __GPF_NOFAIL. Not only it makes it easier to spot
that code but it also allows the page allocator to behave consistently
and predictably.

If the existing heuristic wrt. memory reserves to GFP_NOFAIL turns out
to be suboptimal we can fix it for all those users.

Dropping the rest of the email which talks about reclaim changes because
I will need much more time to digest that.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15  8:59       ` Mel Gorman
  2021-09-15 12:20         ` Michal Hocko
@ 2021-09-15 14:35         ` Mel Gorman
  2021-09-15 22:38           ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-09-15 14:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote:
> > Yup, that's what we need, but I don't see why it needs to be exposed
> > outside the allocation code at all.
> > 
> 
> Probably not. At least some of it could be contained within reclaim
> itself to block when reclaim is not making progress as opposed to anything
> congestion related. That might still livelock if no progress can be made
> but that's not new, the OOM hammer should eventually kick in.
> 

There are two sides to the reclaim-related throttling

1. throttling because zero progress is being made
2. throttling because there are too many dirty pages or pages under
   writeback cycling through the LRU too quickly.

The dirty page aspects (and the removal of wait_iff_congested which is
almost completely broken) could be done with something like the following
(completly untested). The downside is that end_page_writeback() takes an
atomic penalty if reclaim is throttled but at that point the system is
struggling anyway so I doubt it matters.

--8<--
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..5a289ada48cb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -841,6 +841,9 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t reclaim_wait;	/* wq for throttling reclaim */
+	atomic_t nr_reclaim_throttled;	/* nr of throtted tasks */
+	atomic_t nr_reclaim_written;	/* nr pages written since throttled */
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_wb_congested[sync]) == 0) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..b9be9afa4308 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page)
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
 	put_page(page);
+
+	acct_reclaim_writeback(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..47e77009e0d5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,13 @@
 
 void page_writeback_init(void);
 
+void __acct_reclaim_writeback(struct page *page);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+	if (atomic_read(&page_pgdat(page)->nr_reclaim_throttled))
+		__acct_reclaim_writeback(page);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..b209564766b0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,40 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
+static void
+reclaim_writeback_throttle(pg_data_t *pgdat, long timeout)
+{
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+	long ret;
+	DEFINE_WAIT(wait);
+
+	atomic_inc(&pgdat->nr_reclaim_throttled);
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(&pgdat->reclaim_wait, &wait);
+
+	if (atomic_dec_and_test(&pgdat->nr_reclaim_throttled))
+		atomic_set(&pgdat->nr_reclaim_written, 0);
+
+	/* TODO: Add tracepoint to track time sleeping */
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(struct page *page)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	int nr_written = atomic_inc_return(&pgdat->nr_reclaim_written);
+	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+		wake_up_interruptible(&pgdat->reclaim_wait);
+}
+
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -1412,9 +1446,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 		/*
 		 * The number of dirty pages determines if a node is marked
-		 * reclaim_congested which affects wait_iff_congested. kswapd
-		 * will stall and start writing pages if the tail of the LRU
-		 * is all dirty unqueued pages.
+		 * reclaim_congested. kswapd will stall and start writing
+		 * pages if the tail of the LRU is all dirty unqueued pages.
 		 */
 		page_check_dirty_writeback(page, &dirty, &writeback);
 		if (dirty || writeback)
@@ -3180,19 +3213,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * If kswapd scans pages marked for immediate
 		 * reclaim and under writeback (nr_immediate), it
 		 * implies that pages are cycling through the LRU
-		 * faster than they are written so also forcibly stall.
+		 * faster than they are written so forcibly stall
+		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			reclaim_writeback_throttle(pgdat, HZ/10);
 	}
 
 	/*
 	 * Tag a node/memcg as congested if all the dirty pages
 	 * scanned were backed by a congested BDI and
-	 * wait_iff_congested will stall.
+	 * non-kswapd tasks will stall on reclaim_writeback_throttle.
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
-	 * stalling in wait_iff_congested().
+	 * stalling in reclaim_writeback_throttle().
 	 */
 	if ((current_is_kswapd() ||
 	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3208,7 +3242,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		reclaim_writeback_throttle(pgdat, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -4286,6 +4320,8 @@ static int kswapd(void *p)
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+	atomic_set(&pgdat->nr_reclaim_throttled, 0);
+	atomic_set(&pgdat->nr_reclaim_written, 0);
 	for ( ; ; ) {
 		bool ret;
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15  5:25     ` NeilBrown
@ 2021-09-15 17:02       ` Theodore Ts'o
  0 siblings, 0 replies; 34+ messages in thread
From: Theodore Ts'o @ 2021-09-15 17:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox,
	Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel

On Wed, Sep 15, 2021 at 03:25:40PM +1000, NeilBrown wrote:
> Adding gfp_mask to __ext4_journal_start_sb() make perfect sense.
> There doesn't seem much point adding one to __ext4_journal_start(),
> we can have ext4_journal_start_with_revoke() call
> __ext4_journal_start_sb() directly.
> But I cannot see what it doesn't already do that.
> i.e. why have the inline __ext4_journal_start() at all?
> Is it OK if I don't use that for ext4_journal_start_with_revoke()?

Sure.  I think the only reason why we have __ext4_journal_start() as
an inline function at all was for historical reasons.  That is, we
modified __ext4_journal_start() so that it took a struct super, and
instead of changing all of the macros which called
__ext4_journal_start(), we named it to be __ext4_journal_start_sb()
and added the inline definition of __ext4_journal_start() to avoid
changing all of the existing users of __ext4_journal_start().

So sure, it's fine not to use that for
ext4_journal_start_with_revoke(), and we probably should clean up the
use of __ext4_journal_start() at some point.  That's unrelated to your
work, though.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15 12:06       ` Michal Hocko
@ 2021-09-15 22:35         ` NeilBrown
  2021-09-16  0:37           ` Dave Chinner
  2021-09-16  6:52           ` Michal Hocko
  0 siblings, 2 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-15 22:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4,
	linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Wed, 15 Sep 2021, Michal Hocko wrote:
> On Wed 15-09-21 07:48:11, Neil Brown wrote:
> > 
> > Why does __GFP_NOFAIL access the reserves? Why not require that the
> > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> > with __GFP_NOFAIL if that is justified?
> 
> Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
> memory reserves") help?

Yes, that helps.  A bit.

I'm not fond of the clause "the allocation request might have come with some
locks held".  What if it doesn't?  Does it still have to pay the price.

Should we not require that the caller indicate if any locks are held?
That way callers which don't hold locks can use __GFP_NOFAIL without
worrying about imposing on other code.

Or is it so rare that __GFP_NOFAIL would be used without holding a lock
that it doesn't matter?

The other commit of interest is

Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer")

I don't find the reasoning convincing.  It is a bit like "Robbing Peter
to pay Paul".  It takes from the reserves to allow a __GFP_NOFAIL to
proceed, with out any reason to think this particular allocation has any
more 'right' to the reserves than anything else.

While I don't like the reasoning in either of these, they do make it
clear (to me) that the use of reserves is entirely an internal policy
decision.  They should *not* be seen as part of the API and callers
should not have to be concerned about it when deciding whether to use
__GFP_NOFAIL or not.

The use of these reserves is, at most, a hypothetical problem.  If it
ever looks like becoming a real practical problem, it needs to be fixed
internally to the page allocator.  Maybe an extra water-mark which isn't
quite as permissive as ALLOC_HIGH...

I'm inclined to drop all references to reserves from the documentation
for __GFP_NOFAIL.  I think there are enough users already that adding a
couple more isn't going to make problems substantially more likely.  And
more will be added anyway that the mm/ team won't have the opportunity
or bandwidth to review.

Meanwhile I'll see if I can understand the intricacies of alloc_page so
that I can contibute to making it more predictable.

Question: In those cases where an open-coded loop is appropriate, such
as when you want to handle signals or can drop locks, how bad would it
be to have a tight loop without any sleep?
should_reclaim_retry() will sleep 100ms (sometimes...).  Is that enough?
__GFP_NOFAIL doesn't add any sleep when looping.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15 14:35         ` Mel Gorman
@ 2021-09-15 22:38           ` Dave Chinner
  2021-09-16  9:00             ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2021-09-15 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Wed, Sep 15, 2021 at 03:35:10PM +0100, Mel Gorman wrote:
> On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote:
> > > Yup, that's what we need, but I don't see why it needs to be exposed
> > > outside the allocation code at all.
> > > 
> > 
> > Probably not. At least some of it could be contained within reclaim
> > itself to block when reclaim is not making progress as opposed to anything
> > congestion related. That might still livelock if no progress can be made
> > but that's not new, the OOM hammer should eventually kick in.
> > 
> 
> There are two sides to the reclaim-related throttling
> 
> 1. throttling because zero progress is being made
> 2. throttling because there are too many dirty pages or pages under
>    writeback cycling through the LRU too quickly.
> 
> The dirty page aspects (and the removal of wait_iff_congested which is
> almost completely broken) could be done with something like the following
> (completly untested). The downside is that end_page_writeback() takes an
> atomic penalty if reclaim is throttled but at that point the system is
> struggling anyway so I doubt it matters.

The atomics are pretty nasty, as is directly accessing the pgdat on
every call to end_page_writeback(). Those will be performance
limiting factors. Indeed, we don't use atomics for dirty page
throttling, which does dirty page accounting via
percpu counters on the BDI and doesn't require wakeups.

Also, we've already got per-node and per-zone counters there for
dirty/write pending stats, so do we actually need new counters and
wakeups here?

i.e. balance_dirty_pages() does not have an explicit wakeup - it
bases it's sleep time on the (memcg aware) measured writeback rate
on the BDI the page belongs to and the amount of outstanding dirty
data on that BDI. i.e. it estimates fairly accurately what the wait
time for this task should be given the dirty page demand and current
writeback progress being made is and just sleeps for that length of
time.

Ideally, that's what should be happening here - we should be able to
calculate a page cleaning rate estimation and then base the sleep
time on that. No wakeups needed - when we've waited for the
estimated time, we try to reclaim again...

In fact, why can't this "too many dirty pages" case just use the
balance_dirty_pages() infrastructure to do the "wait for writeback"
reclaim backoff? Why do we even need to re-invent the wheel here?

> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293b5d..b9be9afa4308 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page)
>  	smp_mb__after_atomic();
>  	wake_up_page(page, PG_writeback);
>  	put_page(page);
> +
> +	acct_reclaim_writeback(page);

UAF - that would need to be before the put_page() call...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15 22:35         ` NeilBrown
@ 2021-09-16  0:37           ` Dave Chinner
  2021-09-16  6:52           ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2021-09-16  0:37 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, Mel Gorman, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Thu, Sep 16, 2021 at 08:35:40AM +1000, NeilBrown wrote:
> On Wed, 15 Sep 2021, Michal Hocko wrote:
> > On Wed 15-09-21 07:48:11, Neil Brown wrote:
> > > 
> > > Why does __GFP_NOFAIL access the reserves? Why not require that the
> > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> > > with __GFP_NOFAIL if that is justified?
> > 
> > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
> > memory reserves") help?
> 
> Yes, that helps.  A bit.
> 
> I'm not fond of the clause "the allocation request might have come with some
> locks held".  What if it doesn't?  Does it still have to pay the price.
> 
> Should we not require that the caller indicate if any locks are held?
> That way callers which don't hold locks can use __GFP_NOFAIL without
> worrying about imposing on other code.
> 
> Or is it so rare that __GFP_NOFAIL would be used without holding a lock
> that it doesn't matter?
> 
> The other commit of interest is
> 
> Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer")
> 
> I don't find the reasoning convincing.  It is a bit like "Robbing Peter
> to pay Paul".  It takes from the reserves to allow a __GFP_NOFAIL to
> proceed, with out any reason to think this particular allocation has any
> more 'right' to the reserves than anything else.
> 
> While I don't like the reasoning in either of these, they do make it
> clear (to me) that the use of reserves is entirely an internal policy
> decision.  They should *not* be seen as part of the API and callers
> should not have to be concerned about it when deciding whether to use
> __GFP_NOFAIL or not.

Agree totally with this - we just want to block until allocation
succeeds, and if the -filesystem- deadlocks because allocation never
succeeds then that's a problem that needs to be solved in the
filesystem with a different memory allocation strategy...

OTOH, setting up a single __GFP_NOFAIL call site with the ability to
take the entire system down seems somewhat misguided.

> The use of these reserves is, at most, a hypothetical problem.  If it
> ever looks like becoming a real practical problem, it needs to be fixed
> internally to the page allocator.  Maybe an extra water-mark which isn't
> quite as permissive as ALLOC_HIGH...
> 
> I'm inclined to drop all references to reserves from the documentation
> for __GFP_NOFAIL.  I think there are enough users already that adding a
> couple more isn't going to make problems substantially more likely.  And
> more will be added anyway that the mm/ team won't have the opportunity
> or bandwidth to review.

Yup, we've been replacing open coded loops like in kmem_alloc() with
explicit __GFP_NOFAIL usage for a while now:

$ ▶ git grep __GFP_NOFAIL fs/xfs |wc -l
33
$

ANd we've got another 100 or so call sites planned for conversion to
__GFP_NOFAIL. Hence the suggestion to remove the use of
reserves from __GFP_NOFAIL seems like a sensible plan because it has
never been necessary in the past for all the allocation sites we are
converting from open coded loops to __GFP_NOFAIL...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15 22:35         ` NeilBrown
  2021-09-16  0:37           ` Dave Chinner
@ 2021-09-16  6:52           ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2021-09-16  6:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4,
	linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Thu 16-09-21 08:35:40, Neil Brown wrote:
> On Wed, 15 Sep 2021, Michal Hocko wrote:
> > On Wed 15-09-21 07:48:11, Neil Brown wrote:
> > > 
> > > Why does __GFP_NOFAIL access the reserves? Why not require that the
> > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> > > with __GFP_NOFAIL if that is justified?
> > 
> > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
> > memory reserves") help?
> 
> Yes, that helps.  A bit.
> 
> I'm not fond of the clause "the allocation request might have come with some
> locks held".  What if it doesn't?  Does it still have to pay the price.
> 
> Should we not require that the caller indicate if any locks are held?

I do not think this would help much TBH. What if the lock in question
doesn't impose any dependency through allocation problem?

> That way callers which don't hold locks can use __GFP_NOFAIL without
> worrying about imposing on other code.
> 
> Or is it so rare that __GFP_NOFAIL would be used without holding a lock
> that it doesn't matter?
> 
> The other commit of interest is
> 
> Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer")
> 
> I don't find the reasoning convincing.  It is a bit like "Robbing Peter
> to pay Paul".  It takes from the reserves to allow a __GFP_NOFAIL to
> proceed, with out any reason to think this particular allocation has any
> more 'right' to the reserves than anything else.

I do agree that this is not really optimal. I do not remember exact
details but these changes were mostly based or inspired by extreme
memory pressure testing by Tetsuo who has managed to trigger quite some
corner cases. Especially those where NOFS was involved were problematic.

> While I don't like the reasoning in either of these, they do make it
> clear (to me) that the use of reserves is entirely an internal policy
> decision.  They should *not* be seen as part of the API and callers
> should not have to be concerned about it when deciding whether to use
> __GFP_NOFAIL or not.

Yes. NOFAIL should have high enough bar to use - essentially there is no
other way than use it - that memory reserves shouldn't be a road block.
If we learn that existing users can seriously deplete memory reserves
then we might need to reconsider the existing logic. So far there are no
indications that NOFAIL would really cause any problems in that area.

> The use of these reserves is, at most, a hypothetical problem.  If it
> ever looks like becoming a real practical problem, it needs to be fixed
> internally to the page allocator.  Maybe an extra water-mark which isn't
> quite as permissive as ALLOC_HIGH...
> 
> I'm inclined to drop all references to reserves from the documentation
> for __GFP_NOFAIL.

I have found your additions to the documentation useful.

> I think there are enough users already that adding a
> couple more isn't going to make problems substantially more likely.  And
> more will be added anyway that the mm/ team won't have the opportunity
> or bandwidth to review.
> 
> Meanwhile I'll see if I can understand the intricacies of alloc_page so
> that I can contibute to making it more predictable.
> 
> Question: In those cases where an open-coded loop is appropriate, such
> as when you want to handle signals or can drop locks, how bad would it
> be to have a tight loop without any sleep?
>
> should_reclaim_retry() will sleep 100ms (sometimes...).  Is that enough?
> __GFP_NOFAIL doesn't add any sleep when looping.

Yeah, NOFAIL doesn't add any explicit sleep points. In general there is
no guarantee that a sleepable allocation will sleep. We do cond_resched
in general but sleeping is enforced only for worker contexts because WQ
concurrency depends on an explicit sleeping. So to answer your question,
if you really need to sleep between retries then you should do it
manually but cond_resched can be implied.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-15 22:38           ` Dave Chinner
@ 2021-09-16  9:00             ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-09-16  9:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel

On Thu, Sep 16, 2021 at 08:38:58AM +1000, Dave Chinner wrote:
> On Wed, Sep 15, 2021 at 03:35:10PM +0100, Mel Gorman wrote:
> > On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote:
> > > > Yup, that's what we need, but I don't see why it needs to be exposed
> > > > outside the allocation code at all.
> > > > 
> > > 
> > > Probably not. At least some of it could be contained within reclaim
> > > itself to block when reclaim is not making progress as opposed to anything
> > > congestion related. That might still livelock if no progress can be made
> > > but that's not new, the OOM hammer should eventually kick in.
> > > 
> > 
> > There are two sides to the reclaim-related throttling
> > 
> > 1. throttling because zero progress is being made
> > 2. throttling because there are too many dirty pages or pages under
> >    writeback cycling through the LRU too quickly.
> > 
> > The dirty page aspects (and the removal of wait_iff_congested which is
> > almost completely broken) could be done with something like the following
> > (completly untested). The downside is that end_page_writeback() takes an
> > atomic penalty if reclaim is throttled but at that point the system is
> > struggling anyway so I doubt it matters.
> 
> The atomics are pretty nasty, as is directly accessing the pgdat on
> every call to end_page_writeback(). Those will be performance
> limiting factors. Indeed, we don't use atomics for dirty page
> throttling, which does dirty page accounting via
> percpu counters on the BDI and doesn't require wakeups.
> 

Thanks for taking a look!

From end_page_writeback, the first atomic operation is an atomic read
which is READ_ONCE on most architectures (alpha is a counter example as it
has a memory barrier but alpha is niche). The main atomic penalty is when
the system is reclaim throttled but it can be a per-cpu node page state
counter instead. That sacrifices accuracy for speed but in this context,
I think that's ok. As for accessing the pgdat structure, every vmstat
counter for the node involves a pgdat lookup as the API is page-based
and so there are already a bunch of pgdat lookups in the IO path.

> Also, we've already got per-node and per-zone counters there for
> dirty/write pending stats, so do we actually need new counters and
> wakeups here?
> 

I think we need at least a new counter because dirty/write pending
stats do not tell us how many pages were cleaned since reclaim started
hitting problems with dirty pages at the tail of the LRU. Reading
dirty/write_pending stats at two points of time cannot be used to
infer how many pages were cleaned during the same interval. At minimum,
we'd need nr_dirtied and a new nr_cleaned stat to infer pages cleaned
between two points in time. That can be done but if the new counters is
NR_THROTTLED_WRITTEN (NR_WRITTEN while reclaim throttled), we only need one
field in struct zone to track nr_reclaim_throttled when throttling
startsi (updated patch at the end of the mail).

> i.e. balance_dirty_pages() does not have an explicit wakeup - it
> bases it's sleep time on the (memcg aware) measured writeback rate
> on the BDI the page belongs to and the amount of outstanding dirty
> data on that BDI. i.e. it estimates fairly accurately what the wait
> time for this task should be given the dirty page demand and current
> writeback progress being made is and just sleeps for that length of
> time.
> 
> Ideally, that's what should be happening here - we should be able to
> calculate a page cleaning rate estimation and then base the sleep
> time on that. No wakeups needed - when we've waited for the
> estimated time, we try to reclaim again...
> 
> In fact, why can't this "too many dirty pages" case just use the
> balance_dirty_pages() infrastructure to do the "wait for writeback"
> reclaim backoff? Why do we even need to re-invent the wheel here?
> 

Conceptually I can see what you are asking for but am finding it hard to
translate it into an implementation. Dirty page throttling is throttling
heavy writers on a task and bdi basis but does not care about the
positioning of pages on the LRU or what node the page is allocated from.
On the reclaim side, the concern is how many pages that are dirty or
writeback at the tail of the LRU regardless of what task dirtied that
page or BDI it belongs to.

Hence I'm failing to see how the same rate-limiting mechanism could be
used on the reclaim side.

I guess we could look at the reclaim efficiency for a given task by
tracking pages that could not be reclaimed due to dirty/writeback relative
to pages that could be reclaimed and sleeping for increasing lengths of
time unconditionally when the reclaim efficiency is low.  However it's
complex and would be hard to debug. It could hit serious problems in
cases where there are both fast and slow bdi's with the pages backed by a
slow bdi dominating the tail of the LRU -- it could throttle excessively
prematurely. Alternatively, look at taking pages that are dirty/writeback
off the inactive list like what is done for LRU_UNEVICTABLE pages
and throttling based on a high rate of INACTIVE_FILE:LRU_UNEVICTABLE,
but again, it's complex and could incur additional penalties in the
end_page_writeback due to LRU manipulations. Both are essentially
re-inventing a very complex wheel.

I'm aware that what I'm proposing also has its problems. It could wake
prematurely because all the pages cleaned were backed by a fast bdi when
the pages it scanned were backed by a slow bdi. Prehaps this could
be dealt with by tracking the estimated writeback speed of pages cleaned
and comparing it against the estimated writeback speed of pages at the
tail of the LRU but again, the complexity may be excessive.

If the first solution is too complex, it'll get hit with the KISS hammer
with a request to justify the complexity when the basis for comparison is
a broken concept. So I want to start simple, all it has to be is better
than congestion_wait/wait_iff_congested. If that still is not good enough,
the more complex options will have a basis for comparison.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index dae481293b5d..b9be9afa4308 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page)
> >  	smp_mb__after_atomic();
> >  	wake_up_page(page, PG_writeback);
> >  	put_page(page);
> > +
> > +	acct_reclaim_writeback(page);
> 
> UAF - that would need to be before the put_page() call...
> 

UAF indeed.

Here is another version of the same concept that avoids atomic updates
from end_page_writeback() context and limits pgdat lookups. It's still
not tested other than "it boots under kvm".

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..12a011912c3c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
@@ -841,6 +842,10 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t reclaim_wait;	/* wq for throttling reclaim */
+	atomic_t nr_reclaim_throttled;	/* nr of throtted tasks */
+	unsigned long nr_reclaim_start;	/* nr pages written while throttled
+					 * when throttling started. */
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_wb_congested[sync]) == 0) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..59187787fbfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page)
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
+	acct_reclaim_writeback(page);
 	put_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..cd8b892537a0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,14 @@
 
 void page_writeback_init(void);
 
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	if (atomic_read(&pgdat->nr_reclaim_throttled))
+		__acct_reclaim_writeback(pgdat, page);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..f7908ed079f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,43 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
+static void
+reclaim_writeback_throttle(pg_data_t *pgdat, long timeout)
+{
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+	long ret;
+	DEFINE_WAIT(wait);
+
+	atomic_inc(&pgdat->nr_reclaim_throttled);
+	WRITE_ONCE(pgdat->nr_reclaim_start,
+		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(&pgdat->reclaim_wait, &wait);
+	atomic_dec(&pgdat->nr_reclaim_throttled);
+
+	/* TODO: Add tracepoint to track time sleeping */
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
+{
+	unsigned long nr_written;
+	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+		READ_ONCE(pgdat->nr_reclaim_start);
+
+	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+		wake_up_interruptible(&pgdat->reclaim_wait);
+}
+
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -1412,9 +1449,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 		/*
 		 * The number of dirty pages determines if a node is marked
-		 * reclaim_congested which affects wait_iff_congested. kswapd
-		 * will stall and start writing pages if the tail of the LRU
-		 * is all dirty unqueued pages.
+		 * reclaim_congested. kswapd will stall and start writing
+		 * pages if the tail of the LRU is all dirty unqueued pages.
 		 */
 		page_check_dirty_writeback(page, &dirty, &writeback);
 		if (dirty || writeback)
@@ -3180,19 +3216,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * If kswapd scans pages marked for immediate
 		 * reclaim and under writeback (nr_immediate), it
 		 * implies that pages are cycling through the LRU
-		 * faster than they are written so also forcibly stall.
+		 * faster than they are written so forcibly stall
+		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			reclaim_writeback_throttle(pgdat, HZ/10);
 	}
 
 	/*
 	 * Tag a node/memcg as congested if all the dirty pages
 	 * scanned were backed by a congested BDI and
-	 * wait_iff_congested will stall.
+	 * non-kswapd tasks will stall on reclaim_writeback_throttle.
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
-	 * stalling in wait_iff_congested().
+	 * stalling in reclaim_writeback_throttle().
 	 */
 	if ((current_is_kswapd() ||
 	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3208,7 +3245,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		reclaim_writeback_throttle(pgdat, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -4286,6 +4323,7 @@ static int kswapd(void *p)
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+	atomic_set(&pgdat->nr_reclaim_throttled, 0);
 	for ( ; ; ) {
 		bool ret;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..9b2bc9d61d4b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective.
  2021-09-15 11:56   ` Michal Hocko
@ 2021-09-16 22:13     ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-16 22:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel

On Wed, 15 Sep 2021, Michal Hocko wrote:
> On Tue 14-09-21 10:13:04, Neil Brown wrote:
> > Only 4 subsystems call set_bdi_congested() or clear_bdi_congested():
> >  block/pktcdvd, fs/ceph fs/fuse fs/nfs
> > 
> > It may make sense to use congestion_wait() or wait_iff_congested()
> > within these subsystems, but they have no value outside of these.
> > 
> > Add documentation comments to these functions to discourage further use.
> 
> This is an unfortunate state. The MM layer still relies on the API.
> While adding a documentation to clarify the current status can stop more
> usage I am wondering what is a real alternative. My experience tells me
> that a lack of real alternative will lead to new creative ways of doing
> things instead.

That is a valid concern.  Discouraging the use of an interface without
providing a clear alternative risks people doing worse things.

At lease if people continue to use congestion_wait(), then we will be
able to find those uses when we are able to provide a better approach.

I'll drop this patch.

Thanks,
NeilBrown


>  
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  include/linux/backing-dev.h |    7 +++++++
> >  mm/backing-dev.c            |    9 +++++++++
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index ac7f231b8825..cc9513840351 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -153,6 +153,13 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
> >  	return wb->congested & cong_bits;
> >  }
> >  
> > +/* NOTE congestion_wait() and wait_iff_congested() are
> > + * largely useless except as documentation.
> > + * congestion_wait() will (almost) always wait for the given timeout.
> > + * wait_iff_congested() will (almost) never wait, but will call
> > + * cond_resched().
> > + * Were possible an alternative waiting strategy should be found.
> > + */
> >  long congestion_wait(int sync, long timeout);
> >  long wait_iff_congested(int sync, long timeout);
> >  
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 4a9d4e27d0d9..53472ab38796 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -1023,6 +1023,11 @@ EXPORT_SYMBOL(set_bdi_congested);
> >   * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> >   * write congestion.  If no backing_devs are congested then just wait for the
> >   * next write to be completed.
> > + *
> > + * NOTE: in the current implementation, hardly any backing_devs are ever
> > + * marked as congested, and write-completion is rarely reported (see calls
> > + * to clear_bdi_congested).  So this should not be assumed to ever wake before
> > + * the timeout.
> >   */
> >  long congestion_wait(int sync, long timeout)
> >  {
> > @@ -1054,6 +1059,10 @@ EXPORT_SYMBOL(congestion_wait);
> >   * The return value is 0 if the sleep is for the full timeout. Otherwise,
> >   * it is the number of jiffies that were still remaining when the function
> >   * returned. return_value == timeout implies the function did not sleep.
> > + *
> > + * NOTE: in the current implementation, hardly any backing_devs are ever
> > + * marked as congested, and write-completion is rarely reported (see calls
> > + * to clear_bdi_congested).  So this should not be assumed to sleep at all.
> >   */
> >  long wait_iff_congested(int sync, long timeout)
> >  {
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

Indefinite loops waiting for memory allocation are discouraged by
documentation in gfp.h which says the use of __GFP_NOFAIL that it

 is definitely preferable to use the flag rather than opencode endless
 loop around allocator.

Such loops that use congestion_wait() are particularly unwise as
congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice - and should be
deprecated.

So this patch changes the two loops in ext4_ext_truncate() to use
__GFP_NOFAIL instead of looping.

As the allocation is multiple layers deeper in the call stack, this
requires passing the EXT4_EX_NOFAIL flag down and handling it in various
places.

__ext4_journal_start_sb() is given a new 'gfp_mask' argument which is
passed through to jbd2__journal_start() and always receives GFP_NOFS
except when called through ext4_journal_state_with_revoke from
ext4_ext_remove_space(), which now can include __GFP_NOFAIL.

jbd2__journal_start() is enhanced so that the gfp_t flags passed are
used for *all* allocations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ext4/ext4.h           |    2 +-
 fs/ext4/ext4_jbd2.c      |    4 ++-
 fs/ext4/ext4_jbd2.h      |   14 +++++++-----
 fs/ext4/extents.c        |   53 +++++++++++++++++++---------------------------
 fs/ext4/extents_status.c |   35 +++++++++++++++++-------------
 fs/ext4/extents_status.h |    2 +-
 fs/ext4/ialloc.c         |    3 ++-
 fs/ext4/indirect.c       |    2 +-
 fs/ext4/inode.c          |    6 +++--
 fs/ext4/ioctl.c          |    4 ++-
 fs/ext4/super.c          |    2 +-
 fs/jbd2/transaction.c    |    8 +++----
 12 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 90ff5acaf11f..52a34f5dfda2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3720,7 +3720,7 @@ extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			       struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_truncate(handle_t *, struct inode *);
 extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-				 ext4_lblk_t end);
+				 ext4_lblk_t end, int nofail);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 6def7339056d..780fde556dc0 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -88,7 +88,7 @@ static int ext4_journal_check_start(struct super_block *sb)
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 				  int type, int blocks, int rsv_blocks,
-				  int revoke_creds)
+				  int revoke_creds, gfp_t gfp_mask)
 {
 	journal_t *journal;
 	int err;
@@ -103,7 +103,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
-				   GFP_NOFS, type, line);
+				   gfp_mask, type, line);
 }
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0e4fa644df01..da79be8ffff4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -263,7 +263,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 				  int type, int blocks, int rsv_blocks,
-				  int revoke_creds);
+				  int revoke_creds, gfp_t gfp_mask);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 
 #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -304,7 +304,8 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
 
 #define ext4_journal_start_sb(sb, type, nblocks)			\
 	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0,	\
-				ext4_trans_default_revoke_credits(sb))
+				ext4_trans_default_revoke_credits(sb),	\
+				GFP_NOFS)
 
 #define ext4_journal_start(inode, type, nblocks)			\
 	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0,	\
@@ -314,9 +315,9 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
 	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks),\
 			     ext4_trans_default_revoke_credits((inode)->i_sb))
 
-#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
-	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
-			     (revoke_creds))
+#define ext4_journal_start_with_revoke(gfp_flags, inode, type, blocks, revoke_creds) \
+	__ext4_journal_start_sb((inode)->i_sb, __LINE__, (type), (blocks),\
+				0, (revoke_creds), (gfp_flags))
 
 static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     unsigned int line, int type,
@@ -324,7 +325,8 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     int revoke_creds)
 {
 	return __ext4_journal_start_sb(inode->i_sb, line, type, blocks,
-				       rsv_blocks, revoke_creds);
+				       rsv_blocks, revoke_creds,
+				       GFP_NOFS);
 }
 
 #define ext4_journal_stop(handle) \
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c0de30f25185..c6c16aa8b618 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1488,7 +1488,7 @@ static int ext4_ext_search_left(struct inode *inode,
 static int ext4_ext_search_right(struct inode *inode,
 				 struct ext4_ext_path *path,
 				 ext4_lblk_t *logical, ext4_fsblk_t *phys,
-				 struct ext4_extent *ret_ex)
+				 struct ext4_extent *ret_ex, int nofail)
 {
 	struct buffer_head *bh = NULL;
 	struct ext4_extent_header *eh;
@@ -1565,7 +1565,7 @@ static int ext4_ext_search_right(struct inode *inode,
 	while (++depth < path->p_depth) {
 		/* subtract from p_depth to get proper eh_depth */
 		bh = read_extent_tree_block(inode, block,
-					    path->p_depth - depth, 0);
+					    path->p_depth - depth, nofail);
 		if (IS_ERR(bh))
 			return PTR_ERR(bh);
 		eh = ext_block_hdr(bh);
@@ -1574,7 +1574,7 @@ static int ext4_ext_search_right(struct inode *inode,
 		put_bh(bh);
 	}
 
-	bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0);
+	bh = read_extent_tree_block(inode, block, path->p_depth - depth, nofail);
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
 	eh = ext_block_hdr(bh);
@@ -2773,7 +2773,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path)
 }
 
 int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-			  ext4_lblk_t end)
+			  ext4_lblk_t end, int nofail)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int depth = ext_depth(inode);
@@ -2781,6 +2781,10 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	struct partial_cluster partial;
 	handle_t *handle;
 	int i = 0, err = 0;
+	gfp_t gfp_mask = GFP_NOFS;
+
+	if (nofail)
+		gfp_mask |= __GFP_NOFAIL;
 
 	partial.pclu = 0;
 	partial.lblk = 0;
@@ -2789,7 +2793,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug(inode, "truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
+	handle = ext4_journal_start_with_revoke(gfp_mask,
+			inode, EXT4_HT_TRUNCATE,
 			depth + 1,
 			ext4_free_metadata_revoke_credits(inode->i_sb, depth));
 	if (IS_ERR(handle))
@@ -2877,7 +2882,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			 */
 			lblk = ex_end + 1;
 			err = ext4_ext_search_right(inode, path, &lblk, &pblk,
-						    NULL);
+						    NULL, nofail);
 			if (err < 0)
 				goto out;
 			if (pblk) {
@@ -2899,10 +2904,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	} else {
 		path = kcalloc(depth + 1, sizeof(struct ext4_ext_path),
 			       GFP_NOFS | __GFP_NOFAIL);
-		if (path == NULL) {
-			ext4_journal_stop(handle);
-			return -ENOMEM;
-		}
 		path[0].p_maxdepth = path[0].p_depth = depth;
 		path[0].p_hdr = ext_inode_hdr(inode);
 		i = 0;
@@ -2955,7 +2956,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			memset(path + i + 1, 0, sizeof(*path));
 			bh = read_extent_tree_block(inode,
 				ext4_idx_pblock(path[i].p_idx), depth - i - 1,
-				EXT4_EX_NOCACHE);
+				EXT4_EX_NOCACHE | nofail);
 			if (IS_ERR(bh)) {
 				/* should we reset i_size? */
 				err = PTR_ERR(bh);
@@ -4186,7 +4187,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (err)
 		goto out;
 	ar.lright = map->m_lblk;
-	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
+	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2, 0);
 	if (err < 0)
 		goto out;
 
@@ -4368,23 +4369,13 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 
 	last_block = (inode->i_size + sb->s_blocksize - 1)
 			>> EXT4_BLOCK_SIZE_BITS(sb);
-retry:
 	err = ext4_es_remove_extent(inode, last_block,
-				    EXT_MAX_BLOCKS - last_block);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry;
-	}
+				    EXT_MAX_BLOCKS - last_block,
+				    EXT4_EX_NOFAIL);
 	if (err)
 		return err;
-retry_remove_space:
-	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_remove_space;
-	}
+	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1,
+				    EXT4_EX_NOFAIL);
 	return err;
 }
 
@@ -5322,13 +5313,13 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	ext4_discard_preallocations(inode, 0);
 
 	ret = ext4_es_remove_extent(inode, punch_start,
-				    EXT_MAX_BLOCKS - punch_start);
+				    EXT_MAX_BLOCKS - punch_start, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
 	}
 
-	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1);
+	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5510,7 +5501,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 	}
 
 	ret = ext4_es_remove_extent(inode, offset_lblk,
-			EXT_MAX_BLOCKS - offset_lblk);
+				    EXT_MAX_BLOCKS - offset_lblk, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5574,10 +5565,10 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
 	BUG_ON(!inode_is_locked(inode1));
 	BUG_ON(!inode_is_locked(inode2));
 
-	*erp = ext4_es_remove_extent(inode1, lblk1, count);
+	*erp = ext4_es_remove_extent(inode1, lblk1, count, 0);
 	if (unlikely(*erp))
 		return 0;
-	*erp = ext4_es_remove_extent(inode2, lblk2, count);
+	*erp = ext4_es_remove_extent(inode2, lblk2, count, 0);
 	if (unlikely(*erp))
 		return 0;
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9a3a8996aacf..7f7711a2ea44 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -144,9 +144,10 @@
 static struct kmem_cache *ext4_es_cachep;
 static struct kmem_cache *ext4_pending_cachep;
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved);
+			      ext4_lblk_t end, int *reserved, int nofail);
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
@@ -452,10 +453,11 @@ static void ext4_es_list_del(struct inode *inode)
 
 static struct extent_status *
 ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
-		     ext4_fsblk_t pblk)
+		     ext4_fsblk_t pblk, int nofail)
 {
 	struct extent_status *es;
-	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
+	es = kmem_cache_alloc(ext4_es_cachep,
+			      GFP_ATOMIC | (nofail ? __GFP_NOFAIL : 0));
 	if (es == NULL)
 		return NULL;
 	es->es_lblk = lblk;
@@ -754,7 +756,8 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
 }
 #endif
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node **p = &tree->root.rb_node;
@@ -795,7 +798,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
 	}
 
 	es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
-				  newes->es_pblk);
+				  newes->es_pblk, nofail);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -848,11 +851,11 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_insert_extent_check(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, NULL);
+	err = __es_remove_extent(inode, lblk, end, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
@@ -902,7 +905,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
 	if (!es || es->es_lblk > end)
-		__es_insert_extent(inode, &newes);
+		__es_insert_extent(inode, &newes, 0);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 }
 
@@ -1294,6 +1297,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * @lblk - first block in range
  * @end - last block in range
  * @reserved - number of cluster reservations released
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * If @reserved is not NULL and delayed allocation is enabled, counts
  * block/cluster reservations freed by removing range and if bigalloc
@@ -1301,7 +1305,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * error code on failure.
  */
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved)
+			      ext4_lblk_t end, int *reserved, int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node *node;
@@ -1350,7 +1354,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 					orig_es.es_len - len2;
 			ext4_es_store_pblock_status(&newes, block,
 						    ext4_es_status(&orig_es));
-			err = __es_insert_extent(inode, &newes);
+			err = __es_insert_extent(inode, &newes, nofail);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
@@ -1426,12 +1430,13 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  * @inode - file containing range
  * @lblk - first block in range
  * @len - number of blocks to remove
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * Reduces block/cluster reservation count and for bigalloc cancels pending
  * reservations as needed. Returns 0 on success, error code on failure.
  */
 int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, int nofail)
 {
 	ext4_lblk_t end;
 	int err = 0;
@@ -1456,7 +1461,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * is reclaimed.
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, &reserved);
+	err = __es_remove_extent(inode, lblk, end, &reserved, nofail);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	ext4_da_release_space(inode, reserved);
@@ -2003,11 +2008,11 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err = __es_remove_extent(inode, lblk, lblk, NULL);
+	err = __es_remove_extent(inode, lblk, lblk, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 4ec30a798260..23d77094a165 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -134,7 +134,7 @@ extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len, ext4_fsblk_t pblk,
 				 unsigned int status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, int nofail);
 extern void ext4_es_find_extent_range(struct inode *inode,
 				      int (*match_fn)(struct extent_status *es),
 				      ext4_lblk_t lblk, ext4_lblk_t end,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f73e5eb43eae..82049fb94314 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1079,7 +1079,8 @@ struct inode *__ext4_new_inode(struct user_namespace *mnt_userns,
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 				 handle_type, nblocks, 0,
-				 ext4_trans_default_revoke_credits(sb));
+				 ext4_trans_default_revoke_credits(sb),
+				 GFP_NOFS);
 			if (IS_ERR(handle)) {
 				err = PTR_ERR(handle);
 				ext4_std_error(sb, err);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 89efa78ed4b2..910e87aea7be 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1125,7 +1125,7 @@ void ext4_ind_truncate(handle_t *handle, struct inode *inode)
 			return;
 	}
 
-	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block);
+	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block, 0);
 
 	/*
 	 * The orphan list entry will now protect us from any crash which
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..24246043d94b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1575,7 +1575,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 		ext4_lblk_t start, last;
 		start = index << (PAGE_SHIFT - inode->i_blkbits);
 		last = end << (PAGE_SHIFT - inode->i_blkbits);
-		ext4_es_remove_extent(inode, start, last - start + 1);
+		ext4_es_remove_extent(inode, start, last - start + 1, 0);
 	}
 
 	pagevec_init(&pvec);
@@ -4109,7 +4109,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 		ext4_discard_preallocations(inode, 0);
 
 		ret = ext4_es_remove_extent(inode, first_block,
-					    stop_block - first_block);
+					    stop_block - first_block, 0);
 		if (ret) {
 			up_write(&EXT4_I(inode)->i_data_sem);
 			goto out_stop;
@@ -4117,7 +4117,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 			ret = ext4_ext_remove_space(inode, first_block,
-						    stop_block - 1);
+						    stop_block - 1, 0);
 		else
 			ret = ext4_ind_remove_space(handle, inode, first_block,
 						    stop_block);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 606dee9e08a3..e4de05a6b976 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -79,8 +79,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 		(ei1->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	ei2->i_flags = tmp | (ei2->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	swap(ei1->i_disksize, ei2->i_disksize);
-	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
-	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS, 0);
+	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS, 0);
 
 	isize = i_size_read(inode1);
 	i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0775950ee84e..947e8376a35a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1393,7 +1393,7 @@ void ext4_clear_inode(struct inode *inode)
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode, 0);
-	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS, 0);
 	dquot_drop(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6a3caedd2285..23e0f003d43b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -476,9 +476,9 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 }
 
 /* Allocate a new handle.  This should probably be in a slab... */
-static handle_t *new_handle(int nblocks)
+static handle_t *new_handle(int nblocks, gfp_t gfp)
 {
-	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(gfp);
 	if (!handle)
 		return NULL;
 	handle->h_total_credits = nblocks;
@@ -505,13 +505,13 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 
 	nblocks += DIV_ROUND_UP(revoke_records,
 				journal->j_revoke_records_per_block);
-	handle = new_handle(nblocks);
+	handle = new_handle(nblocks, gfp_mask);
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
 	if (rsv_blocks) {
 		handle_t *rsv_handle;
 
-		rsv_handle = new_handle(rsv_blocks);
+		rsv_handle = new_handle(rsv_blocks, gfp_mask);
 		if (!rsv_handle) {
 			jbd2_free_handle(handle);
 			return ERR_PTR(-ENOMEM);



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-09-17  2:59 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
2021-09-15 11:56   ` Michal Hocko
2021-09-16 22:13     ` NeilBrown
2021-09-14  0:13 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
2021-09-14  1:31   ` Dave Chinner
2021-09-14  3:27     ` NeilBrown
2021-09-14  6:05       ` Dave Chinner
2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
2021-09-14 16:34   ` Mel Gorman
2021-09-14 21:48     ` NeilBrown
2021-09-15 12:06       ` Michal Hocko
2021-09-15 22:35         ` NeilBrown
2021-09-16  0:37           ` Dave Chinner
2021-09-16  6:52           ` Michal Hocko
2021-09-14 23:55     ` Dave Chinner
2021-09-15  8:59       ` Mel Gorman
2021-09-15 12:20         ` Michal Hocko
2021-09-15 14:35         ` Mel Gorman
2021-09-15 22:38           ` Dave Chinner
2021-09-16  9:00             ` Mel Gorman
2021-09-15  0:28   ` Theodore Ts'o
2021-09-15  5:25     ` NeilBrown
2021-09-15 17:02       ` Theodore Ts'o
2021-09-14  0:13 ` [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
2021-09-15 11:51   ` Michal Hocko
2021-09-14  0:13 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
2021-09-14  2:08   ` Dave Chinner
2021-09-14  2:35     ` NeilBrown
2021-09-14  5:33       ` Dave Chinner
2021-09-14 16:45       ` Mel Gorman
2021-09-14 21:13         ` NeilBrown
2021-09-14  0:13 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
2021-09-17  2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).