LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 1/4] mm: free swp_entry in madvise_free
@ 2015-03-11  1:20 Minchan Kim
  2015-03-11  1:20 ` [PATCH 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-11  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Minchan Kim

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G consume)
on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat slower
(ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..ebe692e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	swp_entry_t entry;
 	unsigned long next;
+	int nr_swap = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
@@ -293,8 +295,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
 
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
@@ -326,6 +342,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(mm, addr, pte, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
+
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
+
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 next:
-- 
1.9.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/4] mm: change deactivate_page with deactivate_file_page
  2015-03-11  1:20 [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
@ 2015-03-11  1:20 ` Minchan Kim
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-11  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Minchan Kim

Deactivate_page was born for file invalidation so it has too
specific logic for file-backed pages. So, let's change the name
of function to file-specific and yield the generic name.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  2 +-
 mm/swap.c            | 20 ++++++++++----------
 mm/truncate.c        |  2 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca..cee108c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void deactivate_file_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e6..5b2a605 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -42,7 +42,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -743,7 +743,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 			      void *arg)
 {
 	int lru, file;
@@ -811,22 +811,22 @@ void lru_add_drain_cpu(int cpu)
 		local_irq_restore(flags);
 	}
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
 	activate_page_drain(cpu);
 }
 
 /**
- * deactivate_page - forcefully deactivate a page
+ * deactivate_file_page - forcefully deactivate a file page
  * @page: page to deactivate
  *
  * This function hints the VM that @page is a good reclaim candidate,
  * for example if its invalidation fails due to the page being dirty
  * or under writeback.
  */
-void deactivate_page(struct page *page)
+void deactivate_file_page(struct page *page)
 {
 	/*
 	 * In a workload with many unevictable page such as mprotect, unevictable
@@ -836,11 +836,11 @@ void deactivate_page(struct page *page)
 		return;
 
 	if (likely(get_page_unless_zero(page))) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);
 
 		if (!pagevec_add(pvec, page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		put_cpu_var(lru_deactivate_file_pvecs);
 	}
 }
 
@@ -872,7 +872,7 @@ void lru_add_drain_all(void)
 
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7a9d8a3..66af903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -490,7 +490,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			 * of interest and try to speed up its reclaim.
 			 */
 			if (!ret)
-				deactivate_page(page);
+				deactivate_file_page(page);
 			count += ret;
 		}
 		pagevec_remove_exceptionals(&pvec);
-- 
1.9.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-11  1:20 [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
  2015-03-11  1:20 ` [PATCH 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
@ 2015-03-11  1:20 ` Minchan Kim
  2015-03-11  2:14   ` Wang, Yalin
                     ` (3 more replies)
  2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
  2015-03-19  0:46 ` [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
  3 siblings, 4 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-11  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Minchan Kim

MADV_FREE is hint that it's okay to discard pages if there is
memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
to free them so there is no worth to remain them in active anonymous LRU
so this patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

A arguable issue for the approach would be whether we should put it to
head or tail in inactive list. I selected *head* because kernel cannot
make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot* so landing of inactive head would be
comprimise for various usecases.

This is fixing a suboptimal behavior of MADV_FREE when pages living on
the active list will sit there for a long time even under memory
pressure while the inactive list is reclaimed heavily. This basically
breaks the whole purpose of using MADV_FREE to help the system to free
memory which is might not be used.

Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cee108c..0428e4c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index ebe692e..22e8f0c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -340,6 +340,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = pte_mkold(ptent);
 		ptent = pte_mkclean(ptent);
 		set_pte_at(mm, addr, pte, ptent);
+		if (PageActive(page))
+			deactivate_page(page);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 5b2a605..393968c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -43,6 +43,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -815,6 +833,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -873,6 +907,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-11  1:20 [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
  2015-03-11  1:20 ` [PATCH 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
@ 2015-03-11  1:20 ` Minchan Kim
  2015-03-30  5:22   ` Minchan Kim
                     ` (2 more replies)
  2015-03-19  0:46 ` [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
  3 siblings, 3 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-11  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Minchan Kim,
	Hugh Dickins, Cyrill Gorcunov, Pavel Emelyanov, Yalin Wang

Bascially, MADV_FREE relys on the pte dirty to decide whether
it allows VM to discard the page. However, if there is swap-in,
pte pointed out the page has no pte_dirty. So, MADV_FREE checks
PageDirty and PageSwapCache for those pages to not discard it
because swapped-in page could live on swap cache or PageDirty
when it is removed from swapcache.

The problem in here is that anonymous pages can have PageDirty if
it is removed from swapcache so that VM cannot parse those pages
as freeable even if we did madvise_free. Look at below example.

ptr = malloc();
memset(ptr);
..
heavy memory pressure -> swap-out all of pages
..
out of memory pressure so there are lots of free pages
..
var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
               but SetPageDirty

madvise_free(ptr);
..
..
heavy memory pressure -> VM cannot discard the page by PageDirty.

PageDirty for anonymous page aims for avoiding duplicating
swapping out. In other words, if a page have swapped-in but
live swapcache(ie, !PageDirty), we could save swapout if the page
is selected as victim by VM in future because swap device have
kept previous swapped-out contents of the page.

So, rather than relying on the PG_dirty for working madvise_free,
pte_dirty is more straightforward. Inherently, swapped-out page was
pte_dirty so this patch restores the dirtiness when swap-in fault
happens so madvise_free doesn't rely on the PageDirty any more.

Cc: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 1 -
 mm/memory.c  | 9 +++++++--
 mm/rmap.c    | 2 +-
 mm/vmscan.c  | 3 +--
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..a045798 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				continue;
 			}
 
-			ClearPageDirty(page);
 			unlock_page(page);
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 0f96a4a..40428a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
+
+	/*
+	 * Every page swapped-out was pte_dirty so we make pte dirty again.
+	 * MADV_FREE relies on it.
+	 */
+	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite(pte, vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index 47b3ba8..34c1d66 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 		if (flags & TTU_FREE) {
 			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty && !PageDirty(page)) {
+			if (!dirty) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 260c413..3357ffa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
-			!PageDirty(page))
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
 		*freeable = true;
 
 	/* Reclaim if clean, defer dirty pages to writeback */
-- 
1.9.3


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
@ 2015-03-11  2:14   ` Wang, Yalin
  2015-03-11  4:30     ` Minchan Kim
  2015-04-01 20:38     ` Rik van Riel
  2015-03-11  9:05   ` [RFC ] mm: don't ignore file map pages for madvise_free( ) Wang, Yalin
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Wang, Yalin @ 2015-03-11  2:14 UTC (permalink / raw)
  To: 'Minchan Kim', Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li

> -----Original Message-----
> From: Minchan Kim [mailto:minchan@kernel.org]
> Sent: Wednesday, March 11, 2015 9:21 AM
> To: Andrew Morton
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; Michal Hocko;
> Johannes Weiner; Mel Gorman; Rik van Riel; Shaohua Li; Wang, Yalin; Minchan
> Kim
> Subject: [PATCH 3/4] mm: move lazy free pages to inactive list
> 
> MADV_FREE is hint that it's okay to discard pages if there is
> memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> to free them so there is no worth to remain them in active anonymous LRU
> so this patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> A arguable issue for the approach would be whether we should put it to
> head or tail in inactive list. I selected *head* because kernel cannot
> make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot* so landing of inactive head would be
> comprimise for various usecases.
> 
> This is fixing a suboptimal behavior of MADV_FREE when pages living on
> the active list will sit there for a long time even under memory
> pressure while the inactive list is reclaimed heavily. This basically
> breaks the whole purpose of using MADV_FREE to help the system to free
> memory which is might not be used.
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h |  1 +
>  mm/madvise.c         |  2 ++
>  mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index cee108c..0428e4c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
>  extern void swap_setup(void);
> 
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index ebe692e..22e8f0c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -340,6 +340,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
> long addr,
>  		ptent = pte_mkold(ptent);
>  		ptent = pte_mkclean(ptent);
>  		set_pte_at(mm, addr, pte, ptent);
> +		if (PageActive(page))
> +			deactivate_page(page);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}

I think this place should be changed like this:
  +		if (!page_referenced(page, false, NULL, NULL, NULL) && PageActive(page))
  +			deactivate_page(page);
Because we don't know if other processes are reference this page,
If it is true, don't need deactivate this page.

Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-11  2:14   ` Wang, Yalin
@ 2015-03-11  4:30     ` Minchan Kim
  2015-04-01 20:38     ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-11  4:30 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li

On Wed, Mar 11, 2015 at 10:14:51AM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan@kernel.org]
> > Sent: Wednesday, March 11, 2015 9:21 AM
> > To: Andrew Morton
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; Michal Hocko;
> > Johannes Weiner; Mel Gorman; Rik van Riel; Shaohua Li; Wang, Yalin; Minchan
> > Kim
> > Subject: [PATCH 3/4] mm: move lazy free pages to inactive list
> > 
> > MADV_FREE is hint that it's okay to discard pages if there is
> > memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> > to free them so there is no worth to remain them in active anonymous LRU
> > so this patch moves them to inactive LRU list's head.
> > 
> > This means that MADV_FREE-ed pages which were living on the inactive list
> > are reclaimed first because they are more likely to be cold rather than
> > recently active pages.
> > 
> > A arguable issue for the approach would be whether we should put it to
> > head or tail in inactive list. I selected *head* because kernel cannot
> > make sure it's really cold or warm for every MADV_FREE usecase but
> > at least we know it's not *hot* so landing of inactive head would be
> > comprimise for various usecases.
> > 
> > This is fixing a suboptimal behavior of MADV_FREE when pages living on
> > the active list will sit there for a long time even under memory
> > pressure while the inactive list is reclaimed heavily. This basically
> > breaks the whole purpose of using MADV_FREE to help the system to free
> > memory which is might not be used.
> > 
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h |  1 +
> >  mm/madvise.c         |  2 ++
> >  mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 38 insertions(+)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index cee108c..0428e4c 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > +extern void deactivate_page(struct page *page);
> >  extern void swap_setup(void);
> > 
> >  extern void add_page_to_unevictable_list(struct page *page);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index ebe692e..22e8f0c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -340,6 +340,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
> > long addr,
> >  		ptent = pte_mkold(ptent);
> >  		ptent = pte_mkclean(ptent);
> >  		set_pte_at(mm, addr, pte, ptent);
> > +		if (PageActive(page))
> > +			deactivate_page(page);
> >  		tlb_remove_tlb_entry(tlb, pte, addr);
> >  	}
> 
> I think this place should be changed like this:
>   +		if (!page_referenced(page, false, NULL, NULL, NULL) && PageActive(page))
>   +			deactivate_page(page);
> Because we don't know if other processes are reference this page,
> If it is true, don't need deactivate this page.

The page_referenced is too much heavy operation to do it
in madvise_free fast path.
If other processes(parent or child) referenced the page,
shrink_page_list in slow path could filter it out and
activates the page.

In addition, shared case for anon pages happens by fork mostly
so we could expect child will do exec soonish in many cases.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC ] mm: don't ignore file map pages for madvise_free( )
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
  2015-03-11  2:14   ` Wang, Yalin
@ 2015-03-11  9:05   ` Wang, Yalin
  2015-03-11  9:47   ` [RFC] mm:do recheck for freeable page in reclaim path Wang, Yalin
  2015-03-20 22:43   ` [PATCH 3/4] mm: move lazy free pages to inactive list Andrew Morton
  3 siblings, 0 replies; 28+ messages in thread
From: Wang, Yalin @ 2015-03-11  9:05 UTC (permalink / raw)
  To: 'Minchan Kim', Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li

Hi

I just want to explain my ideas about file map pages for madvise_free() syscall.
As the following patch,
For file map vma, there is 2 types:
1. private file map
	In this type, the pages of this vma are file map pages or anon page (when COW happened),
2. shared file map
	In this type, the pages of this vma are all file map pages.

No matter which type file map,
We can handle file map vma as the following:
If the page is file map pages,
We just clear its pte young bit(pte_mkold()),
This will have some advantages, it will make page
Reclaim path move this file map page into inactive
lru list aggressively.

If the page is anon map page, we just handle it in the
Same way as for the pages in anon vma.


---

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..8fdc82f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -322,7 +322,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
                ptent = ptep_get_and_clear_full(mm, addr, pte,
                                                tlb->fullmm);
                ptent = pte_mkold(ptent);
-               ptent = pte_mkclean(ptent);
+               if (PageAnon(page))
+                       ptent = pte_mkclean(ptent);
                set_pte_at(mm, addr, pte, ptent);
                tlb_remove_tlb_entry(tlb, pte, addr);
        }
@@ -364,10 +365,6 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
        if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
                return -EINVAL;

-       /* MADV_FREE works for only anon vma at the moment */
-       if (vma->vm_file)
-               return -EINVAL;
-
        start = max(vma->vm_start, start_addr);
        if (start >= vma->vm_end)
                return -EINVAL;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC] mm:do recheck for freeable page in reclaim path
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
  2015-03-11  2:14   ` Wang, Yalin
  2015-03-11  9:05   ` [RFC ] mm: don't ignore file map pages for madvise_free( ) Wang, Yalin
@ 2015-03-11  9:47   ` Wang, Yalin
  2015-03-20 22:43   ` [PATCH 3/4] mm: move lazy free pages to inactive list Andrew Morton
  3 siblings, 0 replies; 28+ messages in thread
From: Wang, Yalin @ 2015-03-11  9:47 UTC (permalink / raw)
  To: 'Minchan Kim',
	Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li

In reclaim path, if encounter a freeable page,
the try_to_unmap may fail, because the page's pte is
dirty, we can recheck this page as normal non-freeable page,
this means we can swap out this page into swap partition.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 mm/vmscan.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 260c413..9930850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1000,6 +1000,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
+recheck:
 		if (!force_reclaim)
 			references = page_check_references(page, sc,
 							&freeable);
@@ -1045,6 +1046,10 @@ unmap:
 			switch (try_to_unmap(page,
 				freeable ? TTU_FREE : ttu_flags)) {
 			case SWAP_FAIL:
+				if (freeable) {
+					freeable = false;
+					goto recheck;
+				}
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
-- 
2.2.2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] mm: free swp_entry in madvise_free
  2015-03-11  1:20 [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
                   ` (2 preceding siblings ...)
  2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
@ 2015-03-19  0:46 ` Minchan Kim
  3 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-19  0:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

Bump up.

On Wed, Mar 11, 2015 at 10:20:35AM +0900, Minchan Kim wrote:
> When I test below piece of code with 12 processes(ie, 512M * 12 = 6G consume)
> on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat slower
> (ie, 2x times) than madvise_dontneed.
> 
> loop = 5;
> mmap(512M);
> while (loop--) {
>         memset(512M);
>         madvise(MADV_FREE or MADV_DONTNEED);
> }
> 
> The reason is lots of swapin.
> 
> 1) dontneed: 1,612 swapin
> 2) madvfree: 879,585 swapin
> 
> If we find hinted pages were already swapped out when syscall is called,
> it's pointless to keep the swapped-out pages in pte.
> Instead, let's free the cold page because swapin is more expensive
> than (alloc page + zeroing).
> 
> With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> 
> 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/madvise.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..ebe692e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	pte_t *pte, ptent;
>  	struct page *page;
> +	swp_entry_t entry;
>  	unsigned long next;
> +	int nr_swap = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -293,8 +295,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (!pte_present(ptent))
> +		if (pte_none(ptent))
>  			continue;
> +		/*
> +		 * If the pte has swp_entry, just clear page table to
> +		 * prevent swap-in which is more expensive rather than
> +		 * (page allocation + zeroing).
> +		 */
> +		if (!pte_present(ptent)) {
> +			entry = pte_to_swp_entry(ptent);
> +			if (non_swap_entry(entry))
> +				continue;
> +			nr_swap--;
> +			free_swap_and_cache(entry);
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			continue;
> +		}
>  
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
> @@ -326,6 +342,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> +
> +	if (nr_swap) {
> +		if (current->mm == mm)
> +			sync_mm_rss(mm);
> +
> +		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
> +	}
> +
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  next:
> -- 
> 1.9.3
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
                     ` (2 preceding siblings ...)
  2015-03-11  9:47   ` [RFC] mm:do recheck for freeable page in reclaim path Wang, Yalin
@ 2015-03-20 22:43   ` Andrew Morton
  2015-03-30  5:35     ` Minchan Kim
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2015-03-20 22:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

On Wed, 11 Mar 2015 10:20:37 +0900 Minchan Kim <minchan@kernel.org> wrote:

> MADV_FREE is hint that it's okay to discard pages if there is
> memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> to free them so there is no worth to remain them in active anonymous LRU
> so this patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> A arguable issue for the approach would be whether we should put it to
> head or tail in inactive list. I selected *head* because kernel cannot
> make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot* so landing of inactive head would be
> comprimise for various usecases.
> 
> This is fixing a suboptimal behavior of MADV_FREE when pages living on
> the active list will sit there for a long time even under memory
> pressure while the inactive list is reclaimed heavily. This basically
> breaks the whole purpose of using MADV_FREE to help the system to free
> memory which is might not be used.
> 
> @@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> +
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +			    void *arg)
>
> ...
>
> @@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
>  	}
>  }
>  
> +void deactivate_page(struct page *page)
> +{

lru_deactivate_file_fn() and deactivate_file_page() are carefully
documented and lru_deactivate_fn() and deactivate_page() should
be as well.  In fact it becomes more important now that we have two
similar-looking things.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
@ 2015-03-30  5:22   ` Minchan Kim
  2015-03-30  8:51     ` Cyrill Gorcunov
  2015-04-08 23:50   ` Minchan Kim
  2015-04-11 21:40   ` Hugh Dickins
  2 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-03-30  5:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Hugh Dickins,
	Cyrill Gorcunov, Pavel Emelyanov

2nd description trial.

>From ccfc6c79634f6cec69d8fb23b0e863ebfa5b893c Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 30 Mar 2015 13:43:08 +0900
Subject: [PATCH v2] mm: make every pte dirty on do_swap_page

Bascially, MADV_FREE relys on the dirty bit in page table entry
to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't
discard the page.

However, if swap-in by read fault happens, page table entry
point out the page doesn't have marked dirty bit so MADV_FREE
might discard the page wrongly. For avoiding the problem,
MADV_FREE did more checks with PageDirty and PageSwapCache.
It worked out because swapped-in page lives on swap cache
and since it was evicted from the swap cache, the page has
PG_dirty flag. So both page flags checks effectvely prevent
wrong discarding by MADV_FREE.

A problem in above logic is that swapped-in page has PG_dirty
since they are removed from swap cache so VM cannot consider
those pages as freeable any more alghouth madvise_free is
called in future. Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and removed from swapcache.
               page table doesn't mark dirty bit and page
               descriptor includes PG_dirty
..
..
madvise_free(ptr);
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

Rather than relying on the PG_dirty of page descriptor
for preventing discarding a page, dirty bit in page table is more
straightforward and simple. So, this patch makes page table dirty
bit marked whenever swap-in happens. Inherenty, page table entry
point out swapped-out page had dirty bit so I think it's no prblem.

With this, it removes complicated logic and makes freeable page
checking by madvise_free simple. Of course, we could solve
above mentioned example.

Cc: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---

* From v1:
  * Rewrite description - Andrew

 mm/madvise.c |  1 -
 mm/memory.c  | 10 ++++++++--
 mm/rmap.c    |  2 +-
 mm/vmscan.c  |  3 +--
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..a045798 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				continue;
 			}
 
-			ClearPageDirty(page);
 			unlock_page(page);
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 6743966..48ff537 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2521,9 +2521,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
+
+	/*
+	 * The page is swapping in now was dirty before it was swapped out
+	 * so restore the state again(ie, pte_mkdirty) because MADV_FREE
+	 * relies on the dirty bit on page table.
+	 */
+	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite(pte, vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a4..281e806 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1275,7 +1275,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 		if (flags & TTU_FREE) {
 			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty && !PageDirty(page)) {
+			if (!dirty) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc6cd51..fffebf0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
-			!PageDirty(page))
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
 		*freeable = true;
 
 	/* Reclaim if clean, defer dirty pages to writeback */
-- 
1.9.3

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-20 22:43   ` [PATCH 3/4] mm: move lazy free pages to inactive list Andrew Morton
@ 2015-03-30  5:35     ` Minchan Kim
  2015-03-30 21:20       ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-03-30  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

Hello Andrew,

On Fri, Mar 20, 2015 at 03:43:58PM -0700, Andrew Morton wrote:
> On Wed, 11 Mar 2015 10:20:37 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > MADV_FREE is hint that it's okay to discard pages if there is
> > memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> > to free them so there is no worth to remain them in active anonymous LRU
> > so this patch moves them to inactive LRU list's head.
> > 
> > This means that MADV_FREE-ed pages which were living on the inactive list
> > are reclaimed first because they are more likely to be cold rather than
> > recently active pages.
> > 
> > A arguable issue for the approach would be whether we should put it to
> > head or tail in inactive list. I selected *head* because kernel cannot
> > make sure it's really cold or warm for every MADV_FREE usecase but
> > at least we know it's not *hot* so landing of inactive head would be
> > comprimise for various usecases.
> > 
> > This is fixing a suboptimal behavior of MADV_FREE when pages living on
> > the active list will sit there for a long time even under memory
> > pressure while the inactive list is reclaimed heavily. This basically
> > breaks the whole purpose of using MADV_FREE to help the system to free
> > memory which is might not be used.
> > 
> > @@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> >  	update_page_reclaim_stat(lruvec, file, 0);
> >  }
> >  
> > +
> > +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> > +			    void *arg)
> >
> > ...
> >
> > @@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
> >  	}
> >  }
> >  
> > +void deactivate_page(struct page *page)
> > +{
> 
> lru_deactivate_file_fn() and deactivate_file_page() are carefully
> documented and lru_deactivate_fn() and deactivate_page() should
> be as well.  In fact it becomes more important now that we have two
> similar-looking things.

Sorry, I have missed this comment.

Acutally, deactive_file_page was too specific on file-backed page
invalidation when I implemented first time. That's why it had a lot
description but deactivate_page is too general so I think short comment
is enough. :)

Here it goes.

Thanks.

>From 1dbff1d18876962e5248346b59e41014561c09ac Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 30 Mar 2015 14:30:44 +0900
Subject: [PATCH] mm: document deactivate_page

This patch adds function description for deactivate_page.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/swap.c b/mm/swap.c
index 6b5adc7..b65fc8c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -866,6 +866,13 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * This function moves @page to inactive list if @page was on active list and
+ * was not unevictable page to accelerate to reclaim @page.
+ */
 void deactivate_page(struct page *page)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-- 
1.9.3


> 
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-30  5:22   ` Minchan Kim
@ 2015-03-30  8:51     ` Cyrill Gorcunov
  2015-03-30  8:59       ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2015-03-30  8:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin.Wang, Hugh Dickins, Pavel Emelyanov

On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote:
> 2nd description trial.
...
Hi Minchan, could you please point for which repo this patch,
linux-next?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-30  8:51     ` Cyrill Gorcunov
@ 2015-03-30  8:59       ` Minchan Kim
  2015-03-30 21:14         ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-03-30  8:59 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin.Wang, Hugh Dickins, Pavel Emelyanov

Hi Cyrill,

On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote:
> On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote:
> > 2nd description trial.
> ...
> Hi Minchan, could you please point for which repo this patch,
> linux-next?

It was based on v4.0-rc5-mmotm-2015-03-24-17-02.
As well, I confirmed it was applied on local-next-20150327.

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-30  8:59       ` Minchan Kim
@ 2015-03-30 21:14         ` Cyrill Gorcunov
  2015-03-31  4:38           ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2015-03-30 21:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin.Wang, Hugh Dickins, Pavel Emelyanov

On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote:
> Hi Cyrill,
> 
> On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote:
> > On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote:
> > > 2nd description trial.
> > ...
> > Hi Minchan, could you please point for which repo this patch,
> > linux-next?
> 
> It was based on v4.0-rc5-mmotm-2015-03-24-17-02.
> As well, I confirmed it was applied on local-next-20150327.
> 
> Thanks.

Hi Minchan! I managed to fetch mmotm and the change looks
reasonable to me. Still better to wait for review from Mel
or Hugh, maybe I miss something obvious.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-30  5:35     ` Minchan Kim
@ 2015-03-30 21:20       ` Andrew Morton
  2015-03-31  4:45         ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2015-03-30 21:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

On Mon, 30 Mar 2015 14:35:02 +0900 Minchan Kim <minchan@kernel.org> wrote:

> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -866,6 +866,13 @@ void deactivate_file_page(struct page *page)
>  	}
>  }
>  
> +/**
> + * deactivate_page - deactivate a page
> + * @page: page to deactivate
> + *
> + * This function moves @page to inactive list if @page was on active list and
> + * was not unevictable page to accelerate to reclaim @page.
> + */
>  void deactivate_page(struct page *page)
>  {
>  	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {

Thanks.

deactivate_page() doesn't look at or alter PageReferenced().  Should it?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-30 21:14         ` Cyrill Gorcunov
@ 2015-03-31  4:38           ` Minchan Kim
  0 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-31  4:38 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin.Wang, Hugh Dickins, Pavel Emelyanov

On Tue, Mar 31, 2015 at 12:14:46AM +0300, Cyrill Gorcunov wrote:
> On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote:
> > Hi Cyrill,
> > 
> > On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote:
> > > On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote:
> > > > 2nd description trial.
> > > ...
> > > Hi Minchan, could you please point for which repo this patch,
> > > linux-next?
> > 
> > It was based on v4.0-rc5-mmotm-2015-03-24-17-02.
> > As well, I confirmed it was applied on local-next-20150327.
> > 
> > Thanks.
> 
> Hi Minchan! I managed to fetch mmotm and the change looks
> reasonable to me. Still better to wait for review from Mel
> or Hugh, maybe I miss something obvious.

Thanks for the review!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-30 21:20       ` Andrew Morton
@ 2015-03-31  4:45         ` Minchan Kim
  2015-03-31  5:28           ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-03-31  4:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

Hello Andrew,

On Mon, Mar 30, 2015 at 02:20:10PM -0700, Andrew Morton wrote:
> On Mon, 30 Mar 2015 14:35:02 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -866,6 +866,13 @@ void deactivate_file_page(struct page *page)
> >  	}
> >  }
> >  
> > +/**
> > + * deactivate_page - deactivate a page
> > + * @page: page to deactivate
> > + *
> > + * This function moves @page to inactive list if @page was on active list and
> > + * was not unevictable page to accelerate to reclaim @page.
> > + */
> >  void deactivate_page(struct page *page)
> >  {
> >  	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> 
> Thanks.
> 
> deactivate_page() doesn't look at or alter PageReferenced().  Should it?

Absolutely true. Thanks.
Here it goes.

>From 2b2c92eb73a1cceac615b9abd4c0f5f0c3395ff5 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 31 Mar 2015 13:38:46 +0900
Subject: [PATCH] mm: lru_deactivate_fn should clear PG_referenced

deactivate_page aims for accelerate for reclaiming through
moving pages from active list to inactive list so we should
clear PG_referenced for the goal.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swap.c b/mm/swap.c
index b65fc8c..6b420022 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -800,6 +800,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
 		ClearPageActive(page);
+		ClearPageReferenced(page);
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-- 
1.9.3



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-31  4:45         ` Minchan Kim
@ 2015-03-31  5:28           ` Andrew Morton
  2015-03-31  5:57             ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2015-03-31  5:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

On Tue, 31 Mar 2015 13:45:25 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > deactivate_page() doesn't look at or alter PageReferenced().  Should it?
> 
> Absolutely true. Thanks.
> Here it goes.
> 
> >From 2b2c92eb73a1cceac615b9abd4c0f5f0c3395ff5 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 31 Mar 2015 13:38:46 +0900
> Subject: [PATCH] mm: lru_deactivate_fn should clear PG_referenced
> 
> deactivate_page aims for accelerate for reclaiming through
> moving pages from active list to inactive list so we should
> clear PG_referenced for the goal.
> 
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -800,6 +800,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>  		ClearPageActive(page);
> +		ClearPageReferenced(page);
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);

What if we have

	PageLRU(page) && !PageActive(page) && PageReferenced(page)

if we really want to "accelerate the reclaim of @page" then we should
clear PG_referenced there too.

(And what about page_referenced(page) :))

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-31  5:28           ` Andrew Morton
@ 2015-03-31  5:57             ` Minchan Kim
  0 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-03-31  5:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang

On Mon, Mar 30, 2015 at 10:28:47PM -0700, Andrew Morton wrote:
> On Tue, 31 Mar 2015 13:45:25 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > deactivate_page() doesn't look at or alter PageReferenced().  Should it?
> > 
> > Absolutely true. Thanks.
> > Here it goes.
> > 
> > >From 2b2c92eb73a1cceac615b9abd4c0f5f0c3395ff5 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Tue, 31 Mar 2015 13:38:46 +0900
> > Subject: [PATCH] mm: lru_deactivate_fn should clear PG_referenced
> > 
> > deactivate_page aims for accelerate for reclaiming through
> > moving pages from active list to inactive list so we should
> > clear PG_referenced for the goal.
> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -800,6 +800,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> >  
> >  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> >  		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> >  		add_page_to_lru_list(page, lruvec, lru);
> >  
> >  		__count_vm_event(PGDEACTIVATE);
> 
> What if we have
> 
> 	PageLRU(page) && !PageActive(page) && PageReferenced(page)
> 
> if we really want to "accelerate the reclaim of @page" then we should
> clear PG_referenced there too.

The function's name is *deactivate*_page. IOW, I think it should work
for only pages in active list, IMHO.

> 
> (And what about page_referenced(page) :))

Yes, I considered it when you mentioned PG_referenced. Now, madvise_free
clear out access bit of page table when the syscall is called so
shrink_page_list could reclaim pages easily.

Of course, we could clear access bit by page_referenced for general purpose,
not only madvise_free but it would hurt performance for madvise_free so
I'd like to leave it unless there is a need for the function.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm: move lazy free pages to inactive list
  2015-03-11  2:14   ` Wang, Yalin
  2015-03-11  4:30     ` Minchan Kim
@ 2015-04-01 20:38     ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2015-04-01 20:38 UTC (permalink / raw)
  To: Wang, Yalin, 'Minchan Kim', Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Shaohua Li

On 03/10/2015 10:14 PM, Wang, Yalin wrote:
>> -----Original Message-----
>> From: Minchan Kim [mailto:minchan@kernel.org]
>> Sent: Wednesday, March 11, 2015 9:21 AM
>> To: Andrew Morton
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; Michal Hocko;
>> Johannes Weiner; Mel Gorman; Rik van Riel; Shaohua Li; Wang, Yalin; Minchan
>> Kim
>> Subject: [PATCH 3/4] mm: move lazy free pages to inactive list
>>
>> MADV_FREE is hint that it's okay to discard pages if there is
>> memory pressure and we uses reclaimers(ie, kswapd and direct reclaim)
>> to free them so there is no worth to remain them in active anonymous LRU
>> so this patch moves them to inactive LRU list's head.
>>
>> This means that MADV_FREE-ed pages which were living on the inactive list
>> are reclaimed first because they are more likely to be cold rather than
>> recently active pages.
>>
>> A arguable issue for the approach would be whether we should put it to
>> head or tail in inactive list. I selected *head* because kernel cannot
>> make sure it's really cold or warm for every MADV_FREE usecase but
>> at least we know it's not *hot* so landing of inactive head would be
>> comprimise for various usecases.
>>
>> This is fixing a suboptimal behavior of MADV_FREE when pages living on
>> the active list will sit there for a long time even under memory
>> pressure while the inactive list is reclaimed heavily. This basically
>> breaks the whole purpose of using MADV_FREE to help the system to free
>> memory which is might not be used.
>>
>> Acked-by: Michal Hocko <mhocko@suse.cz>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> ---
>>  include/linux/swap.h |  1 +
>>  mm/madvise.c         |  2 ++
>>  mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index cee108c..0428e4c 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
>>  extern void lru_add_drain_all(void);
>>  extern void rotate_reclaimable_page(struct page *page);
>>  extern void deactivate_file_page(struct page *page);
>> +extern void deactivate_page(struct page *page);
>>  extern void swap_setup(void);
>>
>>  extern void add_page_to_unevictable_list(struct page *page);
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index ebe692e..22e8f0c 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -340,6 +340,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>  		ptent = pte_mkold(ptent);
>>  		ptent = pte_mkclean(ptent);
>>  		set_pte_at(mm, addr, pte, ptent);
>> +		if (PageActive(page))
>> +			deactivate_page(page);
>>  		tlb_remove_tlb_entry(tlb, pte, addr);
>>  	}
> 
> I think this place should be changed like this:
>   +		if (!page_referenced(page, false, NULL, NULL, NULL) && PageActive(page))
>   +			deactivate_page(page);
> Because we don't know if other processes are reference this page,
> If it is true, don't need deactivate this page.

We never clear the page and pte referenced bits on an active
page, that is only done when the page is moved to the inactive
list through LRU movement.

In other words, the page_referenced() check will return true
most of the time, even if the page was last referenced half
an hour ago (but there was no memory pressure).

Minchan's code looks correct.

The code may even want a ClearPageReferenced(page) in there...


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
  2015-03-30  5:22   ` Minchan Kim
@ 2015-04-08 23:50   ` Minchan Kim
  2015-04-09 20:59     ` Andrew Morton
  2015-04-11 21:40   ` Hugh Dickins
  2 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-04-08 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Hugh Dickins,
	Cyrill Gorcunov, Pavel Emelyanov

Bump.

On Wed, Mar 11, 2015 at 10:20:38AM +0900, Minchan Kim wrote:
> Bascially, MADV_FREE relys on the pte dirty to decide whether
> it allows VM to discard the page. However, if there is swap-in,
> pte pointed out the page has no pte_dirty. So, MADV_FREE checks
> PageDirty and PageSwapCache for those pages to not discard it
> because swapped-in page could live on swap cache or PageDirty
> when it is removed from swapcache.
> 
> The problem in here is that anonymous pages can have PageDirty if
> it is removed from swapcache so that VM cannot parse those pages
> as freeable even if we did madvise_free. Look at below example.
> 
> ptr = malloc();
> memset(ptr);
> ..
> heavy memory pressure -> swap-out all of pages
> ..
> out of memory pressure so there are lots of free pages
> ..
> var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
>                but SetPageDirty
> 
> madvise_free(ptr);
> ..
> ..
> heavy memory pressure -> VM cannot discard the page by PageDirty.
> 
> PageDirty for anonymous page aims for avoiding duplicating
> swapping out. In other words, if a page have swapped-in but
> live swapcache(ie, !PageDirty), we could save swapout if the page
> is selected as victim by VM in future because swap device have
> kept previous swapped-out contents of the page.
> 
> So, rather than relying on the PG_dirty for working madvise_free,
> pte_dirty is more straightforward. Inherently, swapped-out page was
> pte_dirty so this patch restores the dirtiness when swap-in fault
> happens so madvise_free doesn't rely on the PageDirty any more.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/madvise.c | 1 -
>  mm/memory.c  | 9 +++++++--
>  mm/rmap.c    | 2 +-
>  mm/vmscan.c  | 3 +--
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22e8f0c..a045798 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				continue;
>  			}
>  
> -			ClearPageDirty(page);
>  			unlock_page(page);
>  		}
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f96a4a..40428a5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> -	pte = mk_pte(page, vma->vm_page_prot);
> +
> +	/*
> +	 * Every page swapped-out was pte_dirty so we make pte dirty again.
> +	 * MADV_FREE relies on it.
> +	 */
> +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
>  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = maybe_mkwrite(pte, vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = 1;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 47b3ba8..34c1d66 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  		if (flags & TTU_FREE) {
>  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> -			if (!dirty && !PageDirty(page)) {
> +			if (!dirty) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
>  				goto discard;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 260c413..3357ffa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_KEEP;
>  	}
>  
> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> -			!PageDirty(page))
> +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
>  		*freeable = true;
>  
>  	/* Reclaim if clean, defer dirty pages to writeback */
> -- 
> 1.9.3
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-04-08 23:50   ` Minchan Kim
@ 2015-04-09 20:59     ` Andrew Morton
  2015-04-10  0:08       ` Minchan Kim
  2015-04-10  0:14       ` Rik van Riel
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2015-04-09 20:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Hugh Dickins,
	Cyrill Gorcunov, Pavel Emelyanov

On Thu, 9 Apr 2015 08:50:25 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Bump.

I'm getting the feeling that MADV_FREE is out of control.

Below is the overall rollup of

mm-support-madvisemadv_free.patch
mm-support-madvisemadv_free-fix.patch
mm-support-madvisemadv_free-fix-2.patch
mm-dont-split-thp-page-when-syscall-is-called.patch
mm-dont-split-thp-page-when-syscall-is-called-fix.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
mm-free-swp_entry-in-madvise_free.patch
mm-move-lazy-free-pages-to-inactive-list.patch
mm-move-lazy-free-pages-to-inactive-list-fix.patch
mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch
mm-make-every-pte-dirty-on-do_swap_page.patch


It's pretty large and has its sticky little paws in all sorts of places.


The feature would need to be pretty darn useful to justify a mainline
merge.  Has any such usefulness been demonstrated?




 arch/alpha/include/uapi/asm/mman.h     |    1 
 arch/mips/include/uapi/asm/mman.h      |    1 
 arch/parisc/include/uapi/asm/mman.h    |    1 
 arch/xtensa/include/uapi/asm/mman.h    |    1 
 include/linux/huge_mm.h                |    4 
 include/linux/rmap.h                   |    9 -
 include/linux/swap.h                   |    1 
 include/linux/vm_event_item.h          |    1 
 include/uapi/asm-generic/mman-common.h |    1 
 mm/huge_memory.c                       |   35 ++++
 mm/madvise.c                           |  175 +++++++++++++++++++++++
 mm/memory.c                            |   10 +
 mm/rmap.c                              |   46 +++++-
 mm/swap.c                              |   44 +++++
 mm/vmscan.c                            |   63 ++++++--
 mm/vmstat.c                            |    1 
 16 files changed, 372 insertions(+), 22 deletions(-)

diff -puN include/linux/rmap.h~mm-support-madvisemadv_free include/linux/rmap.h
--- a/include/linux/rmap.h~mm-support-madvisemadv_free
+++ a/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_FREE = 8,			/* free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
@@ -183,7 +184,8 @@ static inline void page_dup_rmap(struct
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+			struct mem_cgroup *memcg, unsigned long *vm_flags,
+			int *is_pte_dirty);
 
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *memcg,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int *is_pte_dirty)
 {
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
 	return 0;
 }
 
diff -puN include/linux/vm_event_item.h~mm-support-madvisemadv_free include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-support-madvisemadv_free
+++ a/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff -puN include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free
+++ a/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff -puN mm/madvise.c~mm-support-madvisemadv_free mm/madvise.c
--- a/mm/madvise.c~mm-support-madvisemadv_free
+++ a/mm/madvise.c
@@ -19,6 +19,14 @@
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
+
+struct madvise_free_private {
+	struct vm_area_struct *vma;
+	struct mmu_gather *tlb;
+};
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int b
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -254,6 +263,163 @@ static long madvise_willneed(struct vm_a
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct madvise_free_private *fp = walk->private;
+	struct mmu_gather *tlb = fp->tlb;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = fp->vma;
+	spinlock_t *ptl;
+	pte_t *pte, ptent;
+	struct page *page;
+	swp_entry_t entry;
+	unsigned long next;
+	int nr_swap = 0;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		if (next - addr != HPAGE_PMD_SIZE)
+			split_huge_page_pmd(vma, addr, pmd);
+		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+			goto next;
+		/* fall through */
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (PageSwapCache(page)) {
+			if (!trylock_page(page))
+				continue;
+
+			if (!try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			unlock_page(page);
+		}
+
+		/*
+		 * Some of architecture(ex, PPC) don't update TLB
+		 * with set_pte_at and tlb_remove_tlb_entry so for
+		 * the portability, remap the pte with old|clean
+		 * after pte clearing.
+		 */
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		ptent = pte_mkold(ptent);
+		ptent = pte_mkclean(ptent);
+		set_pte_at(mm, addr, pte, ptent);
+		if (PageActive(page))
+			deactivate_page(page);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+	}
+
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+next:
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct madvise_free_private fp = {
+		.vma = vma,
+		.tlb = tlb,
+	};
+
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = &fp,
+	};
+
+	BUG_ON(addr >= end);
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (vma->vm_file)
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -377,6 +543,14 @@ madvise_vma(struct vm_area_struct *vma,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -396,6 +570,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff -puN mm/rmap.c~mm-support-madvisemadv_free mm/rmap.c
--- a/mm/rmap.c~mm-support-madvisemadv_free
+++ a/mm/rmap.c
@@ -712,6 +712,7 @@ int page_mapped_in_vma(struct page *page
 }
 
 struct page_referenced_arg {
+	int dirtied;
 	int mapcount;
 	int referenced;
 	unsigned long vm_flags;
@@ -726,6 +727,7 @@ static int page_referenced_one(struct pa
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	int referenced = 0;
+	int dirty = 0;
 	struct page_referenced_arg *pra = arg;
 
 	if (unlikely(PageTransHuge(page))) {
@@ -749,6 +751,15 @@ static int page_referenced_one(struct pa
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
+
+		/*
+		 * Use pmd_freeable instead of raw pmd_dirty because in some
+		 * of architecture, pmd_dirty is not defined unless
+		 * CONFIG_TRANSPARENT_HUGEPAGE is enabled
+		 */
+		if (!pmd_freeable(*pmd))
+			dirty++;
+
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
@@ -778,6 +789,10 @@ static int page_referenced_one(struct pa
 			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
 				referenced++;
 		}
+
+		if (pte_dirty(*pte))
+			dirty++;
+
 		pte_unmap_unlock(pte, ptl);
 	}
 
@@ -786,6 +801,9 @@ static int page_referenced_one(struct pa
 		pra->vm_flags |= vma->vm_flags;
 	}
 
+	if (dirty)
+		pra->dirtied++;
+
 	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
@@ -810,6 +828,7 @@ static bool invalid_page_referenced_vma(
  * @is_locked: caller holds lock on the page
  * @memcg: target memory cgroup
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -817,7 +836,8 @@ static bool invalid_page_referenced_vma(
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *memcg,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int *is_pte_dirty)
 {
 	int ret;
 	int we_locked = 0;
@@ -832,6 +852,9 @@ int page_referenced(struct page *page,
 	};
 
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
+
 	if (!page_mapped(page))
 		return 0;
 
@@ -859,6 +882,9 @@ int page_referenced(struct page *page,
 	if (we_locked)
 		unlock_page(page);
 
+	if (is_pte_dirty)
+		*is_pte_dirty = pra.dirtied;
+
 	return pra.referenced;
 }
 
@@ -1187,6 +1213,7 @@ static int try_to_unmap_one(struct page
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	int dirty = 0;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1216,7 +1243,8 @@ static int try_to_unmap_one(struct page
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	dirty = pte_dirty(pteval);
+	if (dirty)
 		set_page_dirty(page);
 
 	/* Update high watermark before we lower rss */
@@ -1245,6 +1273,19 @@ static int try_to_unmap_one(struct page
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
+		if (flags & TTU_FREE) {
+			VM_BUG_ON_PAGE(PageSwapCache(page), page);
+			if (!dirty) {
+				/* It's a freeable page by MADV_FREE */
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
+			} else {
+				set_pte_at(mm, address, pte, pteval);
+				ret = SWAP_FAIL;
+				goto out_unmap;
+			}
+		}
+
 		if (PageSwapCache(page)) {
 			/*
 			 * Store the swap location in the pte.
@@ -1285,6 +1326,7 @@ static int try_to_unmap_one(struct page
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
+discard:
 	page_remove_rmap(page);
 	page_cache_release(page);
 
diff -puN mm/vmscan.c~mm-support-madvisemadv_free mm/vmscan.c
--- a/mm/vmscan.c~mm-support-madvisemadv_free
+++ a/mm/vmscan.c
@@ -754,13 +754,17 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct scan_control *sc)
+						  struct scan_control *sc,
+						  bool *freeable)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
+	int pte_dirty;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
-					  &vm_flags);
+					  &vm_flags, &pte_dirty);
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
@@ -801,6 +805,9 @@ static enum page_references page_check_r
 		return PAGEREF_KEEP;
 	}
 
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
+		*freeable = true;
+
 	/* Reclaim if clean, defer dirty pages to writeback */
 	if (referenced_page && !PageSwapBacked(page))
 		return PAGEREF_RECLAIM_CLEAN;
@@ -869,6 +876,7 @@ static unsigned long shrink_page_list(st
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool freeable = false;
 
 		cond_resched();
 
@@ -992,7 +1000,8 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (!force_reclaim)
-			references = page_check_references(page, sc);
+			references = page_check_references(page, sc,
+							&freeable);
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
@@ -1009,22 +1018,31 @@ static unsigned long shrink_page_list(st
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!(sc->gfp_mask & __GFP_IO))
-				goto keep_locked;
-			if (!add_to_swap(page, page_list))
-				goto activate_locked;
-			may_enter_fs = 1;
-
-			/* Adding to swap updated mapping */
-			mapping = page_mapping(page);
+			if (!freeable) {
+				if (!(sc->gfp_mask & __GFP_IO))
+					goto keep_locked;
+				if (!add_to_swap(page, page_list))
+					goto activate_locked;
+				may_enter_fs = 1;
+				/* Adding to swap updated mapping */
+				mapping = page_mapping(page);
+			} else {
+				if (likely(!PageTransHuge(page)))
+					goto unmap;
+				/* try_to_unmap isn't aware of THP page */
+				if (unlikely(split_huge_page_to_list(page,
+								page_list)))
+					goto keep_locked;
+			}
 		}
-
+unmap:
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+		if (page_mapped(page) && (mapping || freeable)) {
+			switch (try_to_unmap(page,
+				freeable ? TTU_FREE : ttu_flags)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1032,7 +1050,20 @@ static unsigned long shrink_page_list(st
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_SUCCESS:
-				; /* try to free the page below */
+				/* try to free the page below */
+				if (!freeable)
+					break;
+				/*
+				 * Freeable anon page doesn't have mapping
+				 * due to skipping of swapcache so we free
+				 * page in here rather than __remove_mapping.
+				 */
+				VM_BUG_ON_PAGE(PageSwapCache(page), page);
+				if (!page_freeze_refs(page, 1))
+					goto keep_locked;
+				__ClearPageLocked(page);
+				count_vm_event(PGLAZYFREED);
+				goto free_it;
 			}
 		}
 
@@ -1789,7 +1820,7 @@ static void shrink_active_list(unsigned
 		}
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
-				    &vm_flags)) {
+				    &vm_flags, NULL)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
diff -puN mm/vmstat.c~mm-support-madvisemadv_free mm/vmstat.c
--- a/mm/vmstat.c~mm-support-madvisemadv_free
+++ a/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
diff -puN arch/alpha/include/uapi/asm/mman.h~mm-support-madvisemadv_free arch/alpha/include/uapi/asm/mman.h
--- a/arch/alpha/include/uapi/asm/mman.h~mm-support-madvisemadv_free
+++ a/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff -puN arch/mips/include/uapi/asm/mman.h~mm-support-madvisemadv_free arch/mips/include/uapi/asm/mman.h
--- a/arch/mips/include/uapi/asm/mman.h~mm-support-madvisemadv_free
+++ a/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff -puN arch/parisc/include/uapi/asm/mman.h~mm-support-madvisemadv_free arch/parisc/include/uapi/asm/mman.h
--- a/arch/parisc/include/uapi/asm/mman.h~mm-support-madvisemadv_free
+++ a/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff -puN arch/xtensa/include/uapi/asm/mman.h~mm-support-madvisemadv_free arch/xtensa/include/uapi/asm/mman.h
--- a/arch/xtensa/include/uapi/asm/mman.h~mm-support-madvisemadv_free
+++ a/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff -puN include/linux/huge_mm.h~mm-support-madvisemadv_free include/linux/huge_mm.h
--- a/include/linux/huge_mm.h~mm-support-madvisemadv_free
+++ a/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pm
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
@@ -56,6 +59,7 @@ extern pmd_t *page_check_address_pmd(str
 				     unsigned long address,
 				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
+extern int pmd_freeable(pmd_t pmd);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff -puN mm/huge_memory.c~mm-support-madvisemadv_free mm/huge_memory.c
--- a/mm/huge_memory.c~mm-support-madvisemadv_free
+++ a/mm/huge_memory.c
@@ -1384,6 +1384,36 @@ out:
 	return 0;
 }
 
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = tlb->mm;
+	int ret = 1;
+
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+		struct page *page;
+		pmd_t orig_pmd;
+
+		orig_pmd = pmdp_get_and_clear(mm, addr, pmd);
+
+		/* No hugepage in swapcache */
+		page = pmd_page(orig_pmd);
+		VM_BUG_ON_PAGE(PageSwapCache(page), page);
+
+		orig_pmd = pmd_mkold(orig_pmd);
+		orig_pmd = pmd_mkclean(orig_pmd);
+
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		spin_unlock(ptl);
+		ret = 0;
+	}
+
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1599,6 +1629,11 @@ unlock:
 	return NULL;
 }
 
+int pmd_freeable(pmd_t pmd)
+{
+	return !pmd_dirty(pmd);
+}
+
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
 				       unsigned long address)
diff -puN include/linux/swap.h~mm-support-madvisemadv_free include/linux/swap.h
--- a/include/linux/swap.h~mm-support-madvisemadv_free
+++ a/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff -puN mm/swap.c~mm-support-madvisemadv_free mm/swap.c
--- a/mm/swap.c~mm-support-madvisemadv_free
+++ a/mm/swap.c
@@ -44,6 +44,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -797,6 +798,24 @@ static void lru_deactivate_file_fn(struc
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -823,6 +842,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -852,6 +875,26 @@ void deactivate_file_page(struct page *p
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -881,6 +924,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
diff -puN mm/memory.c~mm-support-madvisemadv_free mm/memory.c
--- a/mm/memory.c~mm-support-madvisemadv_free
+++ a/mm/memory.c
@@ -2555,9 +2555,15 @@ static int do_swap_page(struct mm_struct
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
+
+	/*
+	 * The page is swapping in now was dirty before it was swapped out
+	 * so restore the state again(ie, pte_mkdirty) because MADV_FREE
+	 * relies on the dirty bit on page table.
+	 */
+	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite(pte, vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
_


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-04-09 20:59     ` Andrew Morton
@ 2015-04-10  0:08       ` Minchan Kim
  2015-04-10  0:14       ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-04-10  0:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Rik van Riel, Shaohua Li, Yalin.Wang, Hugh Dickins,
	Cyrill Gorcunov, Pavel Emelyanov

Hello Andrew,

On Thu, Apr 09, 2015 at 01:59:39PM -0700, Andrew Morton wrote:
> On Thu, 9 Apr 2015 08:50:25 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Bump.
> 
> I'm getting the feeling that MADV_FREE is out of control.
> 
> Below is the overall rollup of
> 
> mm-support-madvisemadv_free.patch
> mm-support-madvisemadv_free-fix.patch
> mm-support-madvisemadv_free-fix-2.patch
> mm-dont-split-thp-page-when-syscall-is-called.patch
> mm-dont-split-thp-page-when-syscall-is-called-fix.patch
> mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
> mm-free-swp_entry-in-madvise_free.patch
> mm-move-lazy-free-pages-to-inactive-list.patch
> mm-move-lazy-free-pages-to-inactive-list-fix.patch
> mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
> mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch
> mm-make-every-pte-dirty-on-do_swap_page.patch
> 
> 
> It's pretty large and has its sticky little paws in all sorts of places.
> 
> 
> The feature would need to be pretty darn useful to justify a mainline
> merge.  Has any such usefulness been demonstrated?

Jemalloc has used MADV_FREE instead of MADV_DONTNEED for a long time
in MADV_FREE supporting OSes(FreeBSD, Solaris, Darwin, Windows).
It used MADV_DONTNEED on only Linux because there was no the feature.

========================== &< ===========================

jemalloc:

/*
 * Methods for purging unused pages differ between operating systems.
 *
 *   madvise(..., MADV_DONTNEED) : On Linux, this immediately discards pages,
 *                                 such that new pages will be demand-zeroed if
 *                                 the address region is later touched.
 *   madvise(..., MADV_FREE) : On FreeBSD and Darwin, this marks pages as being
 *                             unused, such that they will be discarded rather
 *                             than swapped out.
 */
...

bool
pages_purge(void *addr, size_t length)
{
        bool unzeroed;

#ifdef _WIN32
        VirtualAlloc(addr, length, MEM_RESET, PAGE_READWRITE);
        unzeroed = true;
#elif defined(JEMALLOC_HAVE_MADVISE)
#  ifdef JEMALLOC_PURGE_MADVISE_DONTNEED
#    define JEMALLOC_MADV_PURGE MADV_DONTNEED
#    define JEMALLOC_MADV_ZEROS true
#  elif defined(JEMALLOC_PURGE_MADVISE_FREE)
#    define JEMALLOC_MADV_PURGE MADV_FREE
#    define JEMALLOC_MADV_ZEROS false
#  else
#    error "No madvise(2) flag defined for purging unused dirty pages."
#  endif
        int err = madvise(addr, length, JEMALLOC_MADV_PURGE);
        unzeroed = (!JEMALLOC_MADV_ZEROS || err != 0); 
#  undef JEMALLOC_MADV_PURGE
#  undef JEMALLOC_MADV_ZEROS
#else
        /* Last resort no-op. */
        unzeroed = true;
#endif
        return (unzeroed);
}


Tcmalloc is same page.

========================== &< ===========================

// MADV_FREE is specifically designed for use by malloc(), but only
// FreeBSD supports it; in linux we fall back to the somewhat inferior
// MADV_DONTNEED.
#if !defined(MADV_FREE) && defined(MADV_DONTNEED)
# define MADV_FREE  MADV_DONTNEED
#endif

..

bool TCMalloc_SystemRelease(void* start, size_t length) {
#ifdef MADV_FREE
  if (FLAGS_malloc_devmem_start) {
    // It's not safe to use MADV_FREE/MADV_DONTNEED if we've been
    // mapping /dev/mem for heap memory.
    return false;
  }
  if (FLAGS_malloc_disable_memory_release) return false;
  if (pagesize == 0) pagesize = getpagesize();
  const size_t pagemask = pagesize - 1;

  size_t new_start = reinterpret_cast<size_t>(start);
  size_t end = new_start + length;
  size_t new_end = end;

  // Round up the starting address and round down the ending address
  // to be page aligned:
  new_start = (new_start + pagesize - 1) & ~pagemask;
  new_end = new_end & ~pagemask;

  ASSERT((new_start & pagemask) == 0);
  ASSERT((new_end & pagemask) == 0);
  ASSERT(new_start >= reinterpret_cast<size_t>(start));
  ASSERT(new_end <= end);

  if (new_end > new_start) {
    int result;
    do {
      result = madvise(reinterpret_cast<char*>(new_start),
          new_end - new_start, MADV_FREE);
    } while (result == -1 && errno == EAGAIN);

    return result != -1;
  }
#endif
  return false;
}

glibc want it, too.
https://sourceware.org/ml/libc-alpha/2015-02/msg00197.html


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-04-09 20:59     ` Andrew Morton
  2015-04-10  0:08       ` Minchan Kim
@ 2015-04-10  0:14       ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2015-04-10  0:14 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: linux-kernel, linux-mm, Michal Hocko, Johannes Weiner,
	Mel Gorman, Shaohua Li, Yalin.Wang, Hugh Dickins,
	Cyrill Gorcunov, Pavel Emelyanov

On 04/09/2015 04:59 PM, Andrew Morton wrote:
> On Thu, 9 Apr 2015 08:50:25 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
>> Bump.
> 
> I'm getting the feeling that MADV_FREE is out of control.
> 
> Below is the overall rollup of
> 
> mm-support-madvisemadv_free.patch
> mm-support-madvisemadv_free-fix.patch
> mm-support-madvisemadv_free-fix-2.patch
> mm-dont-split-thp-page-when-syscall-is-called.patch
> mm-dont-split-thp-page-when-syscall-is-called-fix.patch
> mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
> mm-free-swp_entry-in-madvise_free.patch
> mm-move-lazy-free-pages-to-inactive-list.patch
> mm-move-lazy-free-pages-to-inactive-list-fix.patch
> mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
> mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch
> mm-make-every-pte-dirty-on-do_swap_page.patch
> 
> 
> It's pretty large and has its sticky little paws in all sorts of places.
> 
> 
> The feature would need to be pretty darn useful to justify a mainline
> merge.  Has any such usefulness been demonstrated?

The performance increase of MADV_FREE over MADV_DONTNEED is
quite significant. I suspect this is especially important for
mobile devices, which are more memory starved than desktop
systems and servers.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
  2015-03-30  5:22   ` Minchan Kim
  2015-04-08 23:50   ` Minchan Kim
@ 2015-04-11 21:40   ` Hugh Dickins
  2015-04-12 14:48     ` Minchan Kim
  2 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2015-04-11 21:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, mm-commits, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin Wang, Hugh Dickins, Cyrill Gorcunov, Pavel Emelyanov,
	Yalin Wang

On Wed, 11 Mar 2015, Minchan Kim wrote:

> Bascially, MADV_FREE relys on the pte dirty to decide whether
> it allows VM to discard the page. However, if there is swap-in,
> pte pointed out the page has no pte_dirty. So, MADV_FREE checks
> PageDirty and PageSwapCache for those pages to not discard it
> because swapped-in page could live on swap cache or PageDirty
> when it is removed from swapcache.
> 
> The problem in here is that anonymous pages can have PageDirty if
> it is removed from swapcache so that VM cannot parse those pages
> as freeable even if we did madvise_free. Look at below example.
> 
> ptr = malloc();
> memset(ptr);
> ..
> heavy memory pressure -> swap-out all of pages
> ..
> out of memory pressure so there are lots of free pages
> ..
> var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
>                but SetPageDirty
> 
> madvise_free(ptr);
> ..
> ..
> heavy memory pressure -> VM cannot discard the page by PageDirty.
> 
> PageDirty for anonymous page aims for avoiding duplicating
> swapping out. In other words, if a page have swapped-in but
> live swapcache(ie, !PageDirty), we could save swapout if the page
> is selected as victim by VM in future because swap device have
> kept previous swapped-out contents of the page.
> 
> So, rather than relying on the PG_dirty for working madvise_free,
> pte_dirty is more straightforward. Inherently, swapped-out page was
> pte_dirty so this patch restores the dirtiness when swap-in fault
> happens so madvise_free doesn't rely on the PageDirty any more.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Sorry, but NAK to this patch,
mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree
(I hope it hasn't reached linux-next yet).

You may well be right that pte_dirty<->PageDirty can be handled
differently, in a way more favourable to MADV_FREE.  And this patch
may be a step in the right direction, but I've barely given it thought.

As it stands, it segfaults more than any patch I've seen in years:
I just tried applying it to 4.0-rc7-mm1, and running kernel builds
in low memory with swap.  Even if I leave KSM out, and memcg out, and
swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon.

I have a choice: spend a few hours tracking down the errors, and
post a fix patch on top of yours?  But even then I'd want to spend
a lot longer thinking through every dirty/Dirty in the source before
I'd feel comfortable to give an ack.

This is users' data, and we need to be very careful with it: errors
in MADV_FREE are one thing, for now that's easy to avoid; but in this
patch you're changing the rules for Anon PageDirty for everyone.

I think for now I'll have to leave it to you to do much more source
diligence and testing, before coming back with a corrected patch for
us then to review, slowly and carefully.

Hugh

> ---
>  mm/madvise.c | 1 -
>  mm/memory.c  | 9 +++++++--
>  mm/rmap.c    | 2 +-
>  mm/vmscan.c  | 3 +--
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22e8f0c..a045798 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				continue;
>  			}
>  
> -			ClearPageDirty(page);
>  			unlock_page(page);
>  		}
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f96a4a..40428a5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> -	pte = mk_pte(page, vma->vm_page_prot);
> +
> +	/*
> +	 * Every page swapped-out was pte_dirty so we make pte dirty again.
> +	 * MADV_FREE relies on it.
> +	 */
> +	pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
>  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = maybe_mkwrite(pte, vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = 1;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 47b3ba8..34c1d66 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  		if (flags & TTU_FREE) {
>  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> -			if (!dirty && !PageDirty(page)) {
> +			if (!dirty) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
>  				goto discard;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 260c413..3357ffa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_KEEP;
>  	}
>  
> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> -			!PageDirty(page))
> +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
>  		*freeable = true;
>  
>  	/* Reclaim if clean, defer dirty pages to writeback */
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-04-11 21:40   ` Hugh Dickins
@ 2015-04-12 14:48     ` Minchan Kim
  2015-04-15  6:49       ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2015-04-12 14:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, mm-commits, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin Wang, Cyrill Gorcunov, Pavel Emelyanov

Hello Hugh,

On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote:
> On Wed, 11 Mar 2015, Minchan Kim wrote:
> 
> > Bascially, MADV_FREE relys on the pte dirty to decide whether
> > it allows VM to discard the page. However, if there is swap-in,
> > pte pointed out the page has no pte_dirty. So, MADV_FREE checks
> > PageDirty and PageSwapCache for those pages to not discard it
> > because swapped-in page could live on swap cache or PageDirty
> > when it is removed from swapcache.
> > 
> > The problem in here is that anonymous pages can have PageDirty if
> > it is removed from swapcache so that VM cannot parse those pages
> > as freeable even if we did madvise_free. Look at below example.
> > 
> > ptr = malloc();
> > memset(ptr);
> > ..
> > heavy memory pressure -> swap-out all of pages
> > ..
> > out of memory pressure so there are lots of free pages
> > ..
> > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
> >                but SetPageDirty
> > 
> > madvise_free(ptr);
> > ..
> > ..
> > heavy memory pressure -> VM cannot discard the page by PageDirty.
> > 
> > PageDirty for anonymous page aims for avoiding duplicating
> > swapping out. In other words, if a page have swapped-in but
> > live swapcache(ie, !PageDirty), we could save swapout if the page
> > is selected as victim by VM in future because swap device have
> > kept previous swapped-out contents of the page.
> > 
> > So, rather than relying on the PG_dirty for working madvise_free,
> > pte_dirty is more straightforward. Inherently, swapped-out page was
> > pte_dirty so this patch restores the dirtiness when swap-in fault
> > happens so madvise_free doesn't rely on the PageDirty any more.
> > 
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Sorry, but NAK to this patch,
> mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree
> (I hope it hasn't reached linux-next yet).
> 
> You may well be right that pte_dirty<->PageDirty can be handled
> differently, in a way more favourable to MADV_FREE.  And this patch
> may be a step in the right direction, but I've barely given it thought.
> 
> As it stands, it segfaults more than any patch I've seen in years:
> I just tried applying it to 4.0-rc7-mm1, and running kernel builds
> in low memory with swap.  Even if I leave KSM out, and memcg out, and
> swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon.
> 
> I have a choice: spend a few hours tracking down the errors, and
> post a fix patch on top of yours?  But even then I'd want to spend
> a lot longer thinking through every dirty/Dirty in the source before
> I'd feel comfortable to give an ack.
> 
> This is users' data, and we need to be very careful with it: errors
> in MADV_FREE are one thing, for now that's easy to avoid; but in this
> patch you're changing the rules for Anon PageDirty for everyone.
> 
> I think for now I'll have to leave it to you to do much more source
> diligence and testing, before coming back with a corrected patch for
> us then to review, slowly and carefully.

Sorry for my bad. I will keep your advise in mind.
I will investigate the problem as soon as I get back to work
after vacation.

Thanks for the the review.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
  2015-04-12 14:48     ` Minchan Kim
@ 2015-04-15  6:49       ` Minchan Kim
  0 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2015-04-15  6:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, mm-commits, Michal Hocko,
	Johannes Weiner, Mel Gorman, Rik van Riel, Shaohua Li,
	Yalin Wang, Cyrill Gorcunov, Pavel Emelyanov

On Sun, Apr 12, 2015 at 11:48:23PM +0900, Minchan Kim wrote:
> Hello Hugh,
> 
> On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote:
> > On Wed, 11 Mar 2015, Minchan Kim wrote:
> > 
> > > Bascially, MADV_FREE relys on the pte dirty to decide whether
> > > it allows VM to discard the page. However, if there is swap-in,
> > > pte pointed out the page has no pte_dirty. So, MADV_FREE checks
> > > PageDirty and PageSwapCache for those pages to not discard it
> > > because swapped-in page could live on swap cache or PageDirty
> > > when it is removed from swapcache.
> > > 
> > > The problem in here is that anonymous pages can have PageDirty if
> > > it is removed from swapcache so that VM cannot parse those pages
> > > as freeable even if we did madvise_free. Look at below example.
> > > 
> > > ptr = malloc();
> > > memset(ptr);
> > > ..
> > > heavy memory pressure -> swap-out all of pages
> > > ..
> > > out of memory pressure so there are lots of free pages
> > > ..
> > > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
> > >                but SetPageDirty
> > > 
> > > madvise_free(ptr);
> > > ..
> > > ..
> > > heavy memory pressure -> VM cannot discard the page by PageDirty.
> > > 
> > > PageDirty for anonymous page aims for avoiding duplicating
> > > swapping out. In other words, if a page have swapped-in but
> > > live swapcache(ie, !PageDirty), we could save swapout if the page
> > > is selected as victim by VM in future because swap device have
> > > kept previous swapped-out contents of the page.
> > > 
> > > So, rather than relying on the PG_dirty for working madvise_free,
> > > pte_dirty is more straightforward. Inherently, swapped-out page was
> > > pte_dirty so this patch restores the dirtiness when swap-in fault
> > > happens so madvise_free doesn't rely on the PageDirty any more.
> > > 
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> > > Cc: Pavel Emelyanov <xemul@parallels.com>
> > > Reported-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > Sorry, but NAK to this patch,
> > mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree
> > (I hope it hasn't reached linux-next yet).
> > 
> > You may well be right that pte_dirty<->PageDirty can be handled
> > differently, in a way more favourable to MADV_FREE.  And this patch
> > may be a step in the right direction, but I've barely given it thought.
> > 
> > As it stands, it segfaults more than any patch I've seen in years:
> > I just tried applying it to 4.0-rc7-mm1, and running kernel builds
> > in low memory with swap.  Even if I leave KSM out, and memcg out, and
> > swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon.
> > 
> > I have a choice: spend a few hours tracking down the errors, and
> > post a fix patch on top of yours?  But even then I'd want to spend
> > a lot longer thinking through every dirty/Dirty in the source before
> > I'd feel comfortable to give an ack.
> > 
> > This is users' data, and we need to be very careful with it: errors
> > in MADV_FREE are one thing, for now that's easy to avoid; but in this
> > patch you're changing the rules for Anon PageDirty for everyone.
> > 
> > I think for now I'll have to leave it to you to do much more source
> > diligence and testing, before coming back with a corrected patch for
> > us then to review, slowly and carefully.
> 
> Sorry for my bad. I will keep your advise in mind.
> I will investigate the problem as soon as I get back to work
> after vacation.
> 
> Thanks for the the review.

When I look at the code, migration doesn't restore dirty bit of pte
in remove_migration_pte and relys on PG_dirty which was set by
try_to_unmap_one. I think it was a reason you saw segfault.
I will spend more time to investigate another code piece which might
ignore dirty bit restore.

Thanks.



> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-04-15  6:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11  1:20 [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim
2015-03-11  1:20 ` [PATCH 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
2015-03-11  1:20 ` [PATCH 3/4] mm: move lazy free pages to inactive list Minchan Kim
2015-03-11  2:14   ` Wang, Yalin
2015-03-11  4:30     ` Minchan Kim
2015-04-01 20:38     ` Rik van Riel
2015-03-11  9:05   ` [RFC ] mm: don't ignore file map pages for madvise_free( ) Wang, Yalin
2015-03-11  9:47   ` [RFC] mm:do recheck for freeable page in reclaim path Wang, Yalin
2015-03-20 22:43   ` [PATCH 3/4] mm: move lazy free pages to inactive list Andrew Morton
2015-03-30  5:35     ` Minchan Kim
2015-03-30 21:20       ` Andrew Morton
2015-03-31  4:45         ` Minchan Kim
2015-03-31  5:28           ` Andrew Morton
2015-03-31  5:57             ` Minchan Kim
2015-03-11  1:20 ` [PATCH 4/4] mm: make every pte dirty on do_swap_page Minchan Kim
2015-03-30  5:22   ` Minchan Kim
2015-03-30  8:51     ` Cyrill Gorcunov
2015-03-30  8:59       ` Minchan Kim
2015-03-30 21:14         ` Cyrill Gorcunov
2015-03-31  4:38           ` Minchan Kim
2015-04-08 23:50   ` Minchan Kim
2015-04-09 20:59     ` Andrew Morton
2015-04-10  0:08       ` Minchan Kim
2015-04-10  0:14       ` Rik van Riel
2015-04-11 21:40   ` Hugh Dickins
2015-04-12 14:48     ` Minchan Kim
2015-04-15  6:49       ` Minchan Kim
2015-03-19  0:46 ` [PATCH 1/4] mm: free swp_entry in madvise_free Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).