LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache
@ 2020-03-23 23:41 Dave Hansen
  2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
  2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm

MADV_PAGEOUT ignores swap cache pages which are not mapped
into the calling process.  That's nasty because it means that
some of the most easily reclaimable pages are inaccessible to
a mechanism designed to reclaim pages.

This all turned out to be a bit nastier and more complicated
than I would have liked.  This has been lightly tested, but I
did pass a normal barrage of compile tests.

I rigged up a little test program to try to create these
situations.  Basically, if the parent "reader" RSS changes
in response to MADV_PAGEOUT actions in the child, there is
a problem.

	https://www.sr71.net/~dave/intel/madv-pageout.c

I'd add this to selftests, but it *requires* swap to work
and its parsing of /proc/self/maps is quite icky.

P.S. mincore() really doesn't work at all in this situation,
probably something we need to look at too.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages
  2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
@ 2020-03-23 23:41 ` Dave Hansen
  2020-03-26  6:24   ` Minchan Kim
  2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages.  Enable
MADV_PAGEOUT to find and reclaim swap cache.

The long story:

Looking for another issue, I wrote a simple test which had two
processes: a parent and a fork()'d child.  The parent reads a
memory buffer shared by the fork() and the child calls
madvise(MADV_PAGEOUT) on the same buffer.

The first call to MADV_PAGEOUT does what is expected: it pages
the memory out and causes faults in the parent.  However, after
that, it does not cause any faults in the parent.  MADV_PAGEOUT
only works once!  This was a surprise.

The PTEs in the shared buffer start out pte_present()==1 in
both parent and child.  The first MADV_PAGEOUT operation replaces
those with pte_present()==0 swap PTEs.  The parent process
quickly faults and recreates pte_present()==1.  However, the
child process (the one calling MADV_PAGEOUT) never touches the
memory and has retained the non-present swap PTEs.

This situation could also happen in the case where a single
process had some of its data placed in the swap cache but where
the memory has not yet been reclaimed.

The MADV_PAGEOUT code has a pte_present()==0 check.  It will
essentially ignore any pte_present()==0 pages.  This essentially
makes unmapped swap cache immune from MADV_PAGEOUT, which is not
very friendly behavior.

Enable MADV_PAGEOUT to find and reclaim swap cache.  Because
swap cache is not pinned by holding the PTE lock, a reference
must be held until the page is isolated, where a second
reference is obtained.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---

 b/mm/madvise.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-23 16:30:48.505385896 -0700
+++ b/mm/madvise.c	2020-03-23 16:30:48.509385896 -0700
@@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s
 #endif		/* CONFIG_SWAP */
 
 /*
+ * Given a PTE, find the corresponding 'struct page'
+ * and acquire a reference.  Also handles non-present
+ * swap PTEs.
+ *
+ * Returns NULL when there is no page to reclaim.
+ */
+static struct page *pte_get_reclaim_page(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t ptent)
+{
+	swp_entry_t entry;
+	struct page *page;
+
+	/* Totally empty PTE: */
+	if (pte_none(ptent))
+		return NULL;
+
+	/* Handle present or PROT_NONE ptes: */
+	if (!is_swap_pte(ptent)) {
+		page = vm_normal_page(vma, addr, ptent);
+		if (page)
+			get_page(page);
+		return page;
+	}
+
+	/*
+	 * 'ptent' is now definitely a (non-present) swap
+	 * PTE in this process.  Go look for additional
+	 * references to the swap cache.
+	 */
+
+	/*
+	 * Is it one of the "swap PTEs" that's not really
+	 * swap?  Do not try to reclaim those.
+	 */
+	entry = pte_to_swp_entry(ptent);
+	if (non_swap_entry(entry))
+		return NULL;
+
+	/*
+	 * The PTE was a true swap entry.  The page may be in
+	 * the swap cache.
+	 */
+	return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
 static long madvise_willneed(struct vm_area_struct *vma,
@@ -398,13 +444,8 @@ regular_page:
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (pte_none(ptent))
-			continue;
-
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* 'page' can be mapped, in the swap cache or both */
+		page = pte_get_reclaim_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
@@ -413,9 +454,10 @@ regular_page:
 		 * are sure it's worth. Split it if we are only owner.
 		 */
 		if (PageTransCompound(page)) {
-			if (page_mapcount(page) != 1)
+			if (page_mapcount(page) != 1) {
+				put_page(page);
 				break;
-			get_page(page);
+			}
 			if (!trylock_page(page)) {
 				put_page(page);
 				break;
@@ -436,12 +478,14 @@ regular_page:
 		}
 
 		/* Do not interfere with other mappings of this page */
-		if (page_mapcount(page) != 1)
+		if (page_mapcount(page) != 1) {
+			put_page(page);
 			continue;
+		}
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
-		if (pte_young(ptent)) {
+		if (!is_swap_pte(ptent) && pte_young(ptent)) {
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			ptent = pte_mkold(ptent);
@@ -466,6 +510,8 @@ regular_page:
 			}
 		} else
 			deactivate_page(page);
+		/* drop ref acquired in pte_get_reclaim_page() */
+		put_page(page);
 	}
 
 	arch_leave_lazy_mmu_mode();
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
  2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
  2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
@ 2020-03-23 23:41 ` Dave Hansen
  2020-03-26  6:28   ` Minchan Kim
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

MADV_PAGEOUT might interfere with other processes if it is
allowed to reclaim pages shared with other processses.  A
previous patch tried to avoid this for anonymous pages
which were shared by a fork().  It did this by checking
page_mapcount().

That works great for mapped pages.  But, it can not detect
unmapped swap cache pages.  This has not been a problem,
until the previous patch which added the ability for
MADV_PAGEOUT to *find* swap cache pages.

A process doing MADV_PAGEOUT which finds an unmapped swap
cache page and evicts it might interfere with another process
which had the same page mapped.  But, such a page would have
a page_mapcount() of 1 since the page is only actually mapped
in the *other* process.  The page_mapcount() test would fail
to detect the situation.

Thankfully, there is a reference count for swap entries.
To fix this, simply consult both page_mapcount() and the swap
reference count via page_swapcount().

I rigged up a little test program to try to create these
situations.  Basically, if the parent "reader" RSS changes
in response to MADV_PAGEOUT actions in the child, there is
a problem.

	https://www.sr71.net/~dave/intel/madv-pageout.c

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---

 b/mm/madvise.c |   37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache	2020-03-23 16:30:52.022385888 -0700
+++ b/mm/madvise.c	2020-03-23 16:41:15.448384333 -0700
@@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
 {
 	swp_entry_t entry;
 	struct page *page;
+	int nr_page_references = 0;
 
 	/* Totally empty PTE: */
 	if (pte_none(ptent))
@@ -271,7 +272,7 @@ static struct page *pte_get_reclaim_page
 		page = vm_normal_page(vma, addr, ptent);
 		if (page)
 			get_page(page);
-		return page;
+		goto got_page;
 	}
 
 	/*
@@ -292,7 +293,33 @@ static struct page *pte_get_reclaim_page
 	 * The PTE was a true swap entry.  The page may be in
 	 * the swap cache.
 	 */
-	return lookup_swap_cache(entry, vma, addr);
+	page = lookup_swap_cache(entry, vma, addr);
+	if (!page)
+		return NULL;
+got_page:
+	/*
+	 * Account for references to the swap entry.  These
+	 * might be "upgraded" to a normal mapping at any
+	 * time.
+	 */
+	if (PageSwapCache(page))
+		nr_page_references += page_swapcount(page);
+
+	/*
+	 * Account for all mappings of the page, including
+	 * when it is in the swap cache.  This ensures that
+	 * MADV_PAGOUT not interfere with anything shared
+	 * with another process.
+	 */
+	nr_page_references += page_mapcount(page);
+
+	/* Any extra references?  Do not reclaim it. */
+	if (nr_page_references > 1) {
+		put_page(page);
+		return NULL;
+	}
+
+	return page;
 }
 
 /*
@@ -477,12 +504,6 @@ regular_page:
 			continue;
 		}
 
-		/* Do not interfere with other mappings of this page */
-		if (page_mapcount(page) != 1) {
-			put_page(page);
-			continue;
-		}
-
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
 		if (!is_swap_pte(ptent) && pte_young(ptent)) {
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages
  2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
@ 2020-03-26  6:24   ` Minchan Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2020-03-26  6:24 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm

On Mon, Mar 23, 2020 at 04:41:49PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages.  Enable
> MADV_PAGEOUT to find and reclaim swap cache.
> 
> The long story:
> 
> Looking for another issue, I wrote a simple test which had two
> processes: a parent and a fork()'d child.  The parent reads a
> memory buffer shared by the fork() and the child calls
> madvise(MADV_PAGEOUT) on the same buffer.
> 
> The first call to MADV_PAGEOUT does what is expected: it pages
> the memory out and causes faults in the parent.  However, after
> that, it does not cause any faults in the parent.  MADV_PAGEOUT
> only works once!  This was a surprise.
> 
> The PTEs in the shared buffer start out pte_present()==1 in
> both parent and child.  The first MADV_PAGEOUT operation replaces
> those with pte_present()==0 swap PTEs.  The parent process
> quickly faults and recreates pte_present()==1.  However, the
> child process (the one calling MADV_PAGEOUT) never touches the
> memory and has retained the non-present swap PTEs.
> 
> This situation could also happen in the case where a single
> process had some of its data placed in the swap cache but where
> the memory has not yet been reclaimed.
> 
> The MADV_PAGEOUT code has a pte_present()==0 check.  It will
> essentially ignore any pte_present()==0 pages.  This essentially
> makes unmapped swap cache immune from MADV_PAGEOUT, which is not
> very friendly behavior.
> 
> Enable MADV_PAGEOUT to find and reclaim swap cache.  Because
> swap cache is not pinned by holding the PTE lock, a reference
> must be held until the page is isolated, where a second
> reference is obtained.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
  2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
@ 2020-03-26  6:28   ` Minchan Kim
  2020-03-26 23:00     ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2020-03-26  6:28 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm

On Mon, Mar 23, 2020 at 04:41:51PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> MADV_PAGEOUT might interfere with other processes if it is
> allowed to reclaim pages shared with other processses.  A
> previous patch tried to avoid this for anonymous pages
> which were shared by a fork().  It did this by checking
> page_mapcount().
> 
> That works great for mapped pages.  But, it can not detect
> unmapped swap cache pages.  This has not been a problem,
> until the previous patch which added the ability for
> MADV_PAGEOUT to *find* swap cache pages.
> 
> A process doing MADV_PAGEOUT which finds an unmapped swap
> cache page and evicts it might interfere with another process
> which had the same page mapped.  But, such a page would have
> a page_mapcount() of 1 since the page is only actually mapped
> in the *other* process.  The page_mapcount() test would fail
> to detect the situation.
> 
> Thankfully, there is a reference count for swap entries.
> To fix this, simply consult both page_mapcount() and the swap
> reference count via page_swapcount().
> 
> I rigged up a little test program to try to create these
> situations.  Basically, if the parent "reader" RSS changes
> in response to MADV_PAGEOUT actions in the child, there is
> a problem.
> 
> 	https://www.sr71.net/~dave/intel/madv-pageout.c
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Daniel Colascione <dancol@google.com>
> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  b/mm/madvise.c |   37 +++++++++++++++++++++++++++++--------
>  1 file changed, 29 insertions(+), 8 deletions(-)
> 
> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache	2020-03-23 16:30:52.022385888 -0700
> +++ b/mm/madvise.c	2020-03-23 16:41:15.448384333 -0700
> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
>  {
>  	swp_entry_t entry;
>  	struct page *page;
> +	int nr_page_references = 0;

nit: just 'referenced' would be enough.

Acked-by: Minchan Kim <minchan@kernel.org>

Thanks, Dave!

_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
  2020-03-26  6:28   ` Minchan Kim
@ 2020-03-26 23:00     ` Dave Hansen
  2020-03-27  6:42       ` Minchan Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-26 23:00 UTC (permalink / raw)
  To: Minchan Kim, Dave Hansen
  Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm

On 3/25/20 11:28 PM, Minchan Kim wrote:
>> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
>> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache	2020-03-23 16:30:52.022385888 -0700
>> +++ b/mm/madvise.c	2020-03-23 16:41:15.448384333 -0700
>> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
>>  {
>>  	swp_entry_t entry;
>>  	struct page *page;
>> +	int nr_page_references = 0;
> nit: just 'referenced' would be enough.

I guess I could track one bit like that.  But, it would require checking
both page_mapcount() and page_swapcount() for being >1.  This way, I
just accumulate the count and have a check at a single place.

I think it ends up much simpler this way.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
  2020-03-26 23:00     ` Dave Hansen
@ 2020-03-27  6:42       ` Minchan Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2020-03-27  6:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm

On Thu, Mar 26, 2020 at 04:00:09PM -0700, Dave Hansen wrote:
> On 3/25/20 11:28 PM, Minchan Kim wrote:
> >> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
> >> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache	2020-03-23 16:30:52.022385888 -0700
> >> +++ b/mm/madvise.c	2020-03-23 16:41:15.448384333 -0700
> >> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
> >>  {
> >>  	swp_entry_t entry;
> >>  	struct page *page;
> >> +	int nr_page_references = 0;
> > nit: just 'referenced' would be enough.
> 
> I guess I could track one bit like that.  But, it would require checking
> both page_mapcount() and page_swapcount() for being >1.  This way, I
> just accumulate the count and have a check at a single place.
> 
> I think it ends up much simpler this way.

I meant the variable name. 'referenced' would be enough for indication
like rmap.c and khugepaged.c. Anyway, it's up to you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
2020-03-26  6:24   ` Minchan Kim
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
2020-03-26  6:28   ` Minchan Kim
2020-03-26 23:00     ` Dave Hansen
2020-03-27  6:42       ` Minchan Kim

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lkml.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git