LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] mm: fix page cache convergence regression
@ 2019-05-24 15:31 Johannes Weiner
  2019-05-24 16:04 ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-05-24 15:31 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Matthew Wilcox
  Cc: linux-mm, cgroups, linux-kernel, kernel-team

Since a28334862993 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree gfp flags that
supplement the hardcoded gfp flags inside the xarray expansion code.
This is analogous to the radix tree API. Then restore the page cache
tree annotation that passes the __GFP_ACCOUNT flag during expansions.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             | 1 +
 include/linux/xarray.h | 2 ++
 lib/xarray.c           | 8 ++++++--
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e9d18b2c3f91..3b454d2119c4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -362,6 +362,7 @@ EXPORT_SYMBOL(inc_nlink);
 static void __address_space_init_once(struct address_space *mapping)
 {
 	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
+	mapping->i_pages.xa_gfp = __GFP_ACCOUNT;
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..cbbf76e4c973 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -292,6 +292,7 @@ struct xarray {
 	spinlock_t	xa_lock;
 /* private: The rest of the data structure is not to be used directly. */
 	gfp_t		xa_flags;
+	gfp_t		xa_gfp;
 	void __rcu *	xa_head;
 };
 
@@ -374,6 +375,7 @@ static inline void xa_init_flags(struct xarray *xa, gfp_t flags)
 {
 	spin_lock_init(&xa->xa_lock);
 	xa->xa_flags = flags;
+	xa->xa_gfp = 0;
 	xa->xa_head = NULL;
 }
 
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..324be9534861 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -298,6 +298,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
 		xas_destroy(xas);
 		return false;
 	}
+	gfp |= xas->xa->xa_gfp;
 	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 	if (!xas->xa_alloc)
 		return false;
@@ -325,6 +326,7 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
 		xas_destroy(xas);
 		return false;
 	}
+	gfp |= xas->xa->xa_gfp;
 	if (gfpflags_allow_blocking(gfp)) {
 		xas_unlock_type(xas, lock_type);
 		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
@@ -358,8 +360,10 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
 	if (node) {
 		xas->xa_alloc = NULL;
 	} else {
-		node = kmem_cache_alloc(radix_tree_node_cachep,
-					GFP_NOWAIT | __GFP_NOWARN);
+		gfp_t gfp;
+
+		gfp = GFP_NOWAIT | __GFP_NOWARN | xas->xa->xa_gfp;
+		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 		if (!node) {
 			xas_set_err(xas, -ENOMEM);
 			return NULL;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-24 15:31 [PATCH] mm: fix page cache convergence regression Johannes Weiner
@ 2019-05-24 16:04 ` Matthew Wilcox
  2019-05-24 17:39   ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2019-05-24 16:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, May 24, 2019 at 11:31:48AM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..cbbf76e4c973 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -292,6 +292,7 @@ struct xarray {
>  	spinlock_t	xa_lock;
>  /* private: The rest of the data structure is not to be used directly. */
>  	gfp_t		xa_flags;
> +	gfp_t		xa_gfp;
>  	void __rcu *	xa_head;
>  };

No.  I'm willing to go for a xa_flag which says to use __GFP_ACCOUNT, but
you can't add another element to the struct xarray.

We haven't even finished the discussion from yesterday.  I'm going to
go back to that thread and keep discussing there.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-24 16:04 ` Matthew Wilcox
@ 2019-05-24 17:39   ` Johannes Weiner
  2019-05-24 18:04     ` Shakeel Butt
  2019-05-30 16:15     ` Johannes Weiner
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-05-24 17:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, May 24, 2019 at 09:04:17AM -0700, Matthew Wilcox wrote:
> On Fri, May 24, 2019 at 11:31:48AM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> > index 0e01e6129145..cbbf76e4c973 100644
> > --- a/include/linux/xarray.h
> > +++ b/include/linux/xarray.h
> > @@ -292,6 +292,7 @@ struct xarray {
> >  	spinlock_t	xa_lock;
> >  /* private: The rest of the data structure is not to be used directly. */
> >  	gfp_t		xa_flags;
> > +	gfp_t		xa_gfp;
> >  	void __rcu *	xa_head;
> >  };
> 
> No.  I'm willing to go for a xa_flag which says to use __GFP_ACCOUNT, but
> you can't add another element to the struct xarray.

Ok, we can generalize per-tree gfp flags later if necessary.

Below is the updated fix that uses an XA_FLAGS_ACCOUNT flag instead.

---
From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 24 May 2019 10:12:46 -0400
Subject: [PATCH] mm: fix page cache convergence regression

Since a28334862993 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |  2 +-
 include/linux/xarray.h |  1 +
 lib/xarray.c           | 12 ++++++++++--
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e9d18b2c3f91..cd67859dbaf1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -361,7 +361,7 @@ EXPORT_SYMBOL(inc_nlink);
 
 static void __address_space_init_once(struct address_space *mapping)
 {
-	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
+	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..5921599b6dc4 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -265,6 +265,7 @@ enum xa_lock_type {
 #define XA_FLAGS_TRACK_FREE	((__force gfp_t)4U)
 #define XA_FLAGS_ZERO_BUSY	((__force gfp_t)8U)
 #define XA_FLAGS_ALLOC_WRAPPED	((__force gfp_t)16U)
+#define XA_FLAGS_ACCOUNT	((__force gfp_t)32U)
 #define XA_FLAGS_MARK(mark)	((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
 						(__force unsigned)(mark)))
 
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..446b956c9188 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
 		xas_destroy(xas);
 		return false;
 	}
+	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+		gfp |= __GFP_ACCOUNT;
 	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 	if (!xas->xa_alloc)
 		return false;
@@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
 		xas_destroy(xas);
 		return false;
 	}
+	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+		gfp |= __GFP_ACCOUNT;
 	if (gfpflags_allow_blocking(gfp)) {
 		xas_unlock_type(xas, lock_type);
 		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
@@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
 	if (node) {
 		xas->xa_alloc = NULL;
 	} else {
-		node = kmem_cache_alloc(radix_tree_node_cachep,
-					GFP_NOWAIT | __GFP_NOWARN);
+		gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
+
+		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+			gfp |= __GFP_ACCOUNT;
+
+		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 		if (!node) {
 			xas_set_err(xas, -ENOMEM);
 			return NULL;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-24 17:39   ` Johannes Weiner
@ 2019-05-24 18:04     ` Shakeel Butt
  2019-05-30 16:15     ` Johannes Weiner
  1 sibling, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2019-05-24 18:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Linus Torvalds, Andrew Morton, Linux MM, Cgroups,
	LKML, Kernel Team

On Fri, May 24, 2019 at 10:41 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 24, 2019 at 09:04:17AM -0700, Matthew Wilcox wrote:
> > On Fri, May 24, 2019 at 11:31:48AM -0400, Johannes Weiner wrote:
> > > diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> > > index 0e01e6129145..cbbf76e4c973 100644
> > > --- a/include/linux/xarray.h
> > > +++ b/include/linux/xarray.h
> > > @@ -292,6 +292,7 @@ struct xarray {
> > >     spinlock_t      xa_lock;
> > >  /* private: The rest of the data structure is not to be used directly. */
> > >     gfp_t           xa_flags;
> > > +   gfp_t           xa_gfp;
> > >     void __rcu *    xa_head;
> > >  };
> >
> > No.  I'm willing to go for a xa_flag which says to use __GFP_ACCOUNT, but
> > you can't add another element to the struct xarray.
>
> Ok, we can generalize per-tree gfp flags later if necessary.
>
> Below is the updated fix that uses an XA_FLAGS_ACCOUNT flag instead.
>
> ---
> From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 May 2019 10:12:46 -0400
> Subject: [PATCH] mm: fix page cache convergence regression
>
> Since a28334862993 ("page cache: Finish XArray conversion"), on most
> major Linux distributions, the page cache doesn't correctly transition
> when the hot data set is changing, and leaves the new pages thrashing
> indefinitely instead of kicking out the cold ones.
>
> On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
> running stock Arch Linux:
>
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120086/153600 workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120268/153600 workingset-b
>
> workingset-b is a 600M file on a 1G host that is otherwise entirely
> idle. No matter how often it's being accessed, it won't get cached.
>
> While investigating, I noticed that the non-resident information gets
> aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
> a problem because a workingset transition like this relies on the
> non-resident information tracked in the page cache tree of evicted
> file ranges: when the cache faults are refaults of recently evicted
> cache, we challenge the existing active set, and that allows a new
> workingset to establish itself.
>
> Tracing the shrinker that maintains this memory revealed that all page
> cache tree nodes were allocated to the root cgroup. This is a problem,
> because 1) the shrinker sizes the amount of non-resident information
> it keeps to the size of the cgroup's other memory and 2) on most major
> Linux distributions, only kernel threads live in the root cgroup and
> everything else gets put into services or session groups:
>
> [root@ham ~]# cat /proc/self/cgroup
> 0::/user.slice/user-0.slice/session-c1.scope
>
> As a result, we basically maintain no non-resident information for the
> workloads running on the system, thus breaking the caching algorithm.
>
> Looking through the code, I found the culprit in the above-mentioned
> patch: when switching from the radix tree to xarray, it dropped the
> __GFP_ACCOUNT flag from the tree node allocations - the flag that
> makes sure the allocated memory gets charged to and tracked by the
> cgroup of the calling process - in this case, the one doing the fault.
>
> To fix this, allow xarray users to specify per-tree flag that makes
> xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
> tree annotation to request such cgroup tracking for the cache nodes.
>
> With this patch applied, the page cache correctly converges on new
> workingsets again after just a few iterations:
>
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 124607/153600 workingset-a
> 87876/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 81313/153600 workingset-a
> 133321/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 63036/153600 workingset-a
> 153600/153600 workingset-b
>
> Cc: stable@vger.kernel.org # 4.20+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-24 17:39   ` Johannes Weiner
  2019-05-24 18:04     ` Shakeel Butt
@ 2019-05-30 16:15     ` Johannes Weiner
  2019-05-30 17:13       ` Matthew Wilcox
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-05-30 16:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

Are there any objections or feedback on the proposed fix below? This
is kind of a serious regression.

On Fri, May 24, 2019 at 01:39:00PM -0400, Johannes Weiner wrote:
> From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 May 2019 10:12:46 -0400
> Subject: [PATCH] mm: fix page cache convergence regression
> 
> Since a28334862993 ("page cache: Finish XArray conversion"), on most
> major Linux distributions, the page cache doesn't correctly transition
> when the hot data set is changing, and leaves the new pages thrashing
> indefinitely instead of kicking out the cold ones.
> 
> On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
> running stock Arch Linux:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120086/153600 workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120268/153600 workingset-b
> 
> workingset-b is a 600M file on a 1G host that is otherwise entirely
> idle. No matter how often it's being accessed, it won't get cached.
> 
> While investigating, I noticed that the non-resident information gets
> aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
> a problem because a workingset transition like this relies on the
> non-resident information tracked in the page cache tree of evicted
> file ranges: when the cache faults are refaults of recently evicted
> cache, we challenge the existing active set, and that allows a new
> workingset to establish itself.
> 
> Tracing the shrinker that maintains this memory revealed that all page
> cache tree nodes were allocated to the root cgroup. This is a problem,
> because 1) the shrinker sizes the amount of non-resident information
> it keeps to the size of the cgroup's other memory and 2) on most major
> Linux distributions, only kernel threads live in the root cgroup and
> everything else gets put into services or session groups:
> 
> [root@ham ~]# cat /proc/self/cgroup
> 0::/user.slice/user-0.slice/session-c1.scope
> 
> As a result, we basically maintain no non-resident information for the
> workloads running on the system, thus breaking the caching algorithm.
> 
> Looking through the code, I found the culprit in the above-mentioned
> patch: when switching from the radix tree to xarray, it dropped the
> __GFP_ACCOUNT flag from the tree node allocations - the flag that
> makes sure the allocated memory gets charged to and tracked by the
> cgroup of the calling process - in this case, the one doing the fault.
> 
> To fix this, allow xarray users to specify per-tree flag that makes
> xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
> tree annotation to request such cgroup tracking for the cache nodes.
> 
> With this patch applied, the page cache correctly converges on new
> workingsets again after just a few iterations:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 124607/153600 workingset-a
> 87876/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 81313/153600 workingset-a
> 133321/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 63036/153600 workingset-a
> 153600/153600 workingset-b
> 
> Cc: stable@vger.kernel.org # 4.20+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/inode.c             |  2 +-
>  include/linux/xarray.h |  1 +
>  lib/xarray.c           | 12 ++++++++++--
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e9d18b2c3f91..cd67859dbaf1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -361,7 +361,7 @@ EXPORT_SYMBOL(inc_nlink);
>  
>  static void __address_space_init_once(struct address_space *mapping)
>  {
> -	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
> +	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
>  	init_rwsem(&mapping->i_mmap_rwsem);
>  	INIT_LIST_HEAD(&mapping->private_list);
>  	spin_lock_init(&mapping->private_lock);
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..5921599b6dc4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -265,6 +265,7 @@ enum xa_lock_type {
>  #define XA_FLAGS_TRACK_FREE	((__force gfp_t)4U)
>  #define XA_FLAGS_ZERO_BUSY	((__force gfp_t)8U)
>  #define XA_FLAGS_ALLOC_WRAPPED	((__force gfp_t)16U)
> +#define XA_FLAGS_ACCOUNT	((__force gfp_t)32U)
>  #define XA_FLAGS_MARK(mark)	((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
>  						(__force unsigned)(mark)))
>  
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 6be3acbb861f..446b956c9188 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  	if (!xas->xa_alloc)
>  		return false;
> @@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	if (gfpflags_allow_blocking(gfp)) {
>  		xas_unlock_type(xas, lock_type);
>  		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> @@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
>  	if (node) {
>  		xas->xa_alloc = NULL;
>  	} else {
> -		node = kmem_cache_alloc(radix_tree_node_cachep,
> -					GFP_NOWAIT | __GFP_NOWARN);
> +		gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
> +
> +		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +			gfp |= __GFP_ACCOUNT;
> +
> +		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  		if (!node) {
>  			xas_set_err(xas, -ENOMEM);
>  			return NULL;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-30 16:15     ` Johannes Weiner
@ 2019-05-30 17:13       ` Matthew Wilcox
  2019-05-30 17:58         ` Johannes Weiner
  2019-06-24 15:19         ` Johannes Weiner
  0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2019-05-30 17:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, May 30, 2019 at 12:15:48PM -0400, Johannes Weiner wrote:
> Are there any objections or feedback on the proposed fix below? This
> is kind of a serious regression.

I'll drop it into the xarray tree for merging in a week, if that's ok
with you?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-30 17:13       ` Matthew Wilcox
@ 2019-05-30 17:58         ` Johannes Weiner
  2019-06-24 15:19         ` Johannes Weiner
  1 sibling, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-05-30 17:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, May 30, 2019 at 10:13:56AM -0700, Matthew Wilcox wrote:
> On Thu, May 30, 2019 at 12:15:48PM -0400, Johannes Weiner wrote:
> > Are there any objections or feedback on the proposed fix below? This
> > is kind of a serious regression.
> 
> I'll drop it into the xarray tree for merging in a week, if that's ok
> with you?

That sounds great, thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-05-30 17:13       ` Matthew Wilcox
  2019-05-30 17:58         ` Johannes Weiner
@ 2019-06-24 15:19         ` Johannes Weiner
  2019-06-24 21:52           ` Linus Torvalds
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-06-24 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, May 30, 2019 at 10:13:56AM -0700, Matthew Wilcox wrote:
> On Thu, May 30, 2019 at 12:15:48PM -0400, Johannes Weiner wrote:
> > Are there any objections or feedback on the proposed fix below? This
> > is kind of a serious regression.
> 
> I'll drop it into the xarray tree for merging in a week, if that's ok
> with you?

Hey, it's three weeks later and we're about to miss 5.2.

This sucks, Matthew. You introduced a serious regression to the MM
subsystem, whose process and patch routing you largely bypassed. When
I encountered the problem and provided a reproducer and a fix, you
gave me a hard time on cosmetic grounds. I incorporated all your
feedback, and still you show no urgency to get this patch or a fix of
your own into mainline. It's your bug, please fix it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix page cache convergence regression
  2019-06-24 15:19         ` Johannes Weiner
@ 2019-06-24 21:52           ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2019-06-24 21:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Andrew Morton, Linux-MM, Cgroups,
	Linux List Kernel Mailing, kernel-team

On Mon, Jun 24, 2019 at 11:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hey, it's three weeks later and we're about to miss 5.2.
>
> This sucks, Matthew.

Yes.

And I do think that having a real gfp field there would be better than
the very odd xa_flags that is *marked* as being gfp_t, but isn't
really a gfp_t at all.

So how about we apply Johannes' patch, and then work on making that
xa_flags field be a proper type of its own. Because it really isn't a
gfp_t, and never has been, even if there might be some very limited
and hacky overlap right now.

Alternatrively, the subset of bits that _can_ be used as a gfp should
actually be used as such, in xas_alloc/xas_nomem. So that you can do

    xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | __GFP_ACCOUNT);

in __address_space_init_once() and it would do what it is supposed to do.

Willy?

               Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-06-24 21:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-24 15:31 [PATCH] mm: fix page cache convergence regression Johannes Weiner
2019-05-24 16:04 ` Matthew Wilcox
2019-05-24 17:39   ` Johannes Weiner
2019-05-24 18:04     ` Shakeel Butt
2019-05-30 16:15     ` Johannes Weiner
2019-05-30 17:13       ` Matthew Wilcox
2019-05-30 17:58         ` Johannes Weiner
2019-06-24 15:19         ` Johannes Weiner
2019-06-24 21:52           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).