LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Hansen <dave.hansen@linux.intel.com>
To: linux-kernel@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	mhocko@suse.com, jannh@google.com, vbabka@suse.cz,
	minchan@kernel.org, dancol@google.com, joel@joelfernandes.org,
	akpm@linux-foundation.org
Subject: [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages
Date: Mon, 23 Mar 2020 16:41:49 -0700
Message-ID: <20200323234149.9FE95081@viggo.jf.intel.com> (raw)
In-Reply-To: <20200323234147.558EBA81@viggo.jf.intel.com>


From: Dave Hansen <dave.hansen@linux.intel.com>

tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages.  Enable
MADV_PAGEOUT to find and reclaim swap cache.

The long story:

Looking for another issue, I wrote a simple test which had two
processes: a parent and a fork()'d child.  The parent reads a
memory buffer shared by the fork() and the child calls
madvise(MADV_PAGEOUT) on the same buffer.

The first call to MADV_PAGEOUT does what is expected: it pages
the memory out and causes faults in the parent.  However, after
that, it does not cause any faults in the parent.  MADV_PAGEOUT
only works once!  This was a surprise.

The PTEs in the shared buffer start out pte_present()==1 in
both parent and child.  The first MADV_PAGEOUT operation replaces
those with pte_present()==0 swap PTEs.  The parent process
quickly faults and recreates pte_present()==1.  However, the
child process (the one calling MADV_PAGEOUT) never touches the
memory and has retained the non-present swap PTEs.

This situation could also happen in the case where a single
process had some of its data placed in the swap cache but where
the memory has not yet been reclaimed.

The MADV_PAGEOUT code has a pte_present()==0 check.  It will
essentially ignore any pte_present()==0 pages.  This essentially
makes unmapped swap cache immune from MADV_PAGEOUT, which is not
very friendly behavior.

Enable MADV_PAGEOUT to find and reclaim swap cache.  Because
swap cache is not pinned by holding the PTE lock, a reference
must be held until the page is isolated, where a second
reference is obtained.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---

 b/mm/madvise.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-23 16:30:48.505385896 -0700
+++ b/mm/madvise.c	2020-03-23 16:30:48.509385896 -0700
@@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s
 #endif		/* CONFIG_SWAP */
 
 /*
+ * Given a PTE, find the corresponding 'struct page'
+ * and acquire a reference.  Also handles non-present
+ * swap PTEs.
+ *
+ * Returns NULL when there is no page to reclaim.
+ */
+static struct page *pte_get_reclaim_page(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t ptent)
+{
+	swp_entry_t entry;
+	struct page *page;
+
+	/* Totally empty PTE: */
+	if (pte_none(ptent))
+		return NULL;
+
+	/* Handle present or PROT_NONE ptes: */
+	if (!is_swap_pte(ptent)) {
+		page = vm_normal_page(vma, addr, ptent);
+		if (page)
+			get_page(page);
+		return page;
+	}
+
+	/*
+	 * 'ptent' is now definitely a (non-present) swap
+	 * PTE in this process.  Go look for additional
+	 * references to the swap cache.
+	 */
+
+	/*
+	 * Is it one of the "swap PTEs" that's not really
+	 * swap?  Do not try to reclaim those.
+	 */
+	entry = pte_to_swp_entry(ptent);
+	if (non_swap_entry(entry))
+		return NULL;
+
+	/*
+	 * The PTE was a true swap entry.  The page may be in
+	 * the swap cache.
+	 */
+	return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
 static long madvise_willneed(struct vm_area_struct *vma,
@@ -398,13 +444,8 @@ regular_page:
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (pte_none(ptent))
-			continue;
-
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* 'page' can be mapped, in the swap cache or both */
+		page = pte_get_reclaim_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
@@ -413,9 +454,10 @@ regular_page:
 		 * are sure it's worth. Split it if we are only owner.
 		 */
 		if (PageTransCompound(page)) {
-			if (page_mapcount(page) != 1)
+			if (page_mapcount(page) != 1) {
+				put_page(page);
 				break;
-			get_page(page);
+			}
 			if (!trylock_page(page)) {
 				put_page(page);
 				break;
@@ -436,12 +478,14 @@ regular_page:
 		}
 
 		/* Do not interfere with other mappings of this page */
-		if (page_mapcount(page) != 1)
+		if (page_mapcount(page) != 1) {
+			put_page(page);
 			continue;
+		}
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
-		if (pte_young(ptent)) {
+		if (!is_swap_pte(ptent) && pte_young(ptent)) {
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			ptent = pte_mkold(ptent);
@@ -466,6 +510,8 @@ regular_page:
 			}
 		} else
 			deactivate_page(page);
+		/* drop ref acquired in pte_get_reclaim_page() */
+		put_page(page);
 	}
 
 	arch_leave_lazy_mmu_mode();
_

  reply index

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
2020-03-23 23:41 ` Dave Hansen [this message]
2020-03-26  6:24   ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Minchan Kim
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
2020-03-26  6:28   ` Minchan Kim
2020-03-26 23:00     ` Dave Hansen
2020-03-27  6:42       ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200323234149.9FE95081@viggo.jf.intel.com \
    --to=dave.hansen@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dancol@google.com \
    --cc=jannh@google.com \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lkml.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git