LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT
@ 2019-06-03  5:36 Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 1/4] mm: introduce MADV_COLD Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Minchan Kim @ 2019-06-03  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, Minchan Kim

This patch is part of previous series:
https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
Originally, it was created for external madvise hinting feature.

https://lkml.org/lkml/2019/5/31/463
Michal wanted to separte the discussion from external hinting interface
so this patchset includes only first part of my entire patchset
  - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.

However, I keep entire description for others for easier understanding
why this kinds of hint was born.

Thanks.

This patchset is against on next-20190530.

Below is description of previous entire patchset.
================= &< =====================

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

    struct pr_madvise_param {
            int size;               /* the size of this structure */
            int cookie;             /* reserved to support atomicity */
            int nr_elem;            /* count of below arrary fields */
            int __user *hints;      /* hints for each range */
            /* to store result of each operation */
            const struct iovec __user *results;
            /* input address ranges */
            const struct iovec __user *ranges;
    };
    
    int process_madvise(int pidfd, struct pr_madvise_param *u_param,
                            unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once. It also has cookie variable
to support atomicity of the API for address ranges operations. IOW, if
target process changes address space since monitor process has parsed
address ranges via map_files or maps, the API can detect the race so
could cancel entire address space operation. It's not implemented yet.
Daniel Colascione suggested a idea(Please read description in patch[6/6])
and this patchset adds cookie a variable for the future.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.

                                       A          B      delta   ratio(%)
               allocstall_dma          0          0          0       0.00
           allocstall_movable       1464        457      -1007     -69.00
            allocstall_normal     263210     190763     -72447     -28.00
             allocstall_total     264674     191220     -73454     -28.00
          compact_daemon_wake      26912      25294      -1618      -7.00
                 compact_fail      17885      14151      -3734     -21.00
         compact_free_scanned 4204766409 3835994922 -368771487      -9.00
             compact_isolated    3446484    2967618    -478866     -14.00
      compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
                compact_stall      19387      15343      -4044     -21.00
              compact_success       1502       1192       -310     -21.00
kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
            kswapd_inodesteal     221635     233093      11458       5.00
 kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
                   nr_dirtied     259934     296476      36542      14.00
  nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
              nr_vmscan_write    1274232    2661733    1387501     108.00
                   nr_written    1514060    2937560    1423500      94.00
                   pageoutrun      67561      55133     -12428     -19.00
                   pgactivate    2335060    1984882    -350178     -15.00
                  pgalloc_dma   13743011   14096463     353452       2.00
              pgalloc_movable          0          0          0       0.00
               pgalloc_normal   18742440   16802065   -1940375     -11.00
                pgalloc_total   32485451   30898528   -1586923      -5.00
                 pgdeactivate    4262210    2930670   -1331540     -32.00
                      pgfault   30812334   31085065     272731       0.00
                       pgfree   33553970   31765164   -1788806      -6.00
                 pginodesteal      33411      15084     -18327     -55.00
                  pglazyfreed          0          0          0       0.00
                   pgmajfault     551312    1508299     956987     173.00
               pgmigrate_fail      43927      29330     -14597     -34.00
            pgmigrate_success    1399851    1203922    -195929     -14.00
                       pgpgin   24141776   19032156   -5109620     -22.00
                      pgpgout     959344    1103316     143972      15.00
                 pgpgoutclean    4639732    3765868    -873864     -19.00
                     pgrefill    4884560    3006938   -1877622     -39.00
                    pgrotated      37828      25897     -11931     -32.00
                pgscan_direct    1456037     957567    -498470     -35.00
       pgscan_direct_throttle          0          0          0       0.00
                pgscan_kswapd    6667767    5047360   -1620407     -25.00
                 pgscan_total    8123804    6004927   -2118877     -27.00
                   pgskip_dma          0          0          0       0.00
               pgskip_movable          0          0          0       0.00
                pgskip_normal      14907      25382      10475      70.00
                 pgskip_total      14907      25382      10475      70.00
               pgsteal_direct    1118986     690215    -428771     -39.00
               pgsteal_kswapd    4750223    3657107   -1093116     -24.00
                pgsteal_total    5869209    4347322   -1521887     -26.00
                       pswpin     417613    1392647     975034     233.00
                      pswpout    1274224    2661731    1387507     108.00
                slabs_scanned   13686905   10807200   -2879705     -22.00
          workingset_activate     668966     569444     -99522     -15.00
       workingset_nodereclaim      38957      32621      -6336     -17.00
           workingset_refault    2816795    2179782    -637013     -23.00
           workingset_restore     294320     168601    -125719     -43.00

pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.

Brian Geffon in ChromeOS team had an experiment with process_madvise(2)
Quote form him:
"What I found is that by using process_madvise after a tab has been back
grounded for more than 45 seconds reduced the average tab switch times by
25%! This is a huge result and very obvious validation that process_madvise
hints works well for the ChromeOS use case."

This patchset is against on next-20190530.

Minchan Kim (4):
  mm: introduce MADV_COLD
  mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  mm: account nr_isolated_xxx in [isolate|putback]_lru_page
  mm: introduce MADV_PAGEOUT

 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 ++
 include/linux/swap.h                   |   2 +
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/compaction.c                        |   2 -
 mm/gup.c                               |   7 +-
 mm/internal.h                          |   2 +-
 mm/khugepaged.c                        |   3 -
 mm/madvise.c                           | 241 ++++++++++++++++++++++++-
 mm/memory-failure.c                    |   3 -
 mm/memory_hotplug.c                    |   4 -
 mm/mempolicy.c                         |   6 +-
 mm/migrate.c                           |  37 +---
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  43 +++++
 mm/vmscan.c                            |  62 ++++++-
 16 files changed, 367 insertions(+), 65 deletions(-)

-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-03  5:36 [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
@ 2019-06-03  5:36 ` Minchan Kim
  2019-06-04 20:38   ` Joel Fernandes
  2019-06-03  5:36 ` [PATCH v1 2/4] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2019-06-03  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, Minchan Kim

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
files's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. Even, it could
give a bonus to make them be reclaimed on swapless system. However,
MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
in the end. So it's better to move inactive anon's LRU list, not file LRU.
Furthermore, it would help to avoid unnecessary scanning of cold anonymous
if system doesn't have a swap device.

All of error rule is same with MADV_DONTNEED.

Note:
This hint works with only private pages(IOW, page_mapcount(page) < 2)
because shared page could have more chance to be accessed from other
processes sharing the page although the caller reset the reference bits.
It ends up preventing the reclaim of the page and wastes CPU cycle.

* RFCv2
 * add more description - mhocko

* RFCv1
 * renaming from MADV_COOL to MADV_COLD - hannes

* internal review
 * use clear_page_youn in deactivate_page - joelaf
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 ++++
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/internal.h                          |   2 +-
 mm/madvise.c                           | 115 ++++++++++++++++++++++++-
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  43 +++++++++
 8 files changed, 176 insertions(+), 4 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
 	SetPageYoung(page);
 }
 
+static inline void clear_page_young(struct page *page)
+{
+	ClearPageYoung(page);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
 	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }
 
+static void clear_page_young(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext))
+		return;
+
+	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_COLD	5		/* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b95166..75a4f96ec0fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
 	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..ab158766858a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_COLD:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		if (is_huge_zero_pmd(*pmd))
+			goto huge_unlock;
+
+		page = pmd_page(*pmd);
+		if (page_mapcount(page) > 1)
+			goto huge_unlock;
+
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		pmdp_test_and_clear_young(vma, addr, pmd);
+		deactivate_page(page);
+huge_unlock:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+regular_page:
+	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (page_mapcount(page) > 1)
+			continue;
+
+		ptep_test_and_clear_young(vma, addr, pte);
+		deactivate_page(page);
+	}
+
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk cool_walk = {
+		.pmd_entry = madvise_cold_pte_range,
+		.mm = vma->vm_mm,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &cool_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  int behavior)
 {
 	*prev = vma;
-	if (!can_madv_dontneed_vma(vma))
+	if (!can_madv_lru_vma(vma))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
@@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_FREE:
+	case MADV_COLD:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a58778c91d4..f73d5f5145f0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			continue;
 
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 7b079976cbec..cebedab15aa2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		clear_page_young(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
@@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		get_page(page);
+		if (!pagevec_add(pvec, page) || PageCompound(page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 /**
  * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
@@ -687,6 +729,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 2/4] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  2019-06-03  5:36 [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 1/4] mm: introduce MADV_COLD Minchan Kim
@ 2019-06-03  5:36 ` Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 4/4] mm: introduce MADV_PAGEOUT Minchan Kim
  3 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2019-06-03  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, Minchan Kim

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

* RFCv1
 * use ignore_referecnes as parameter name - hannes

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      struct reclaim_stat *stat,
-				      bool force_reclaim)
+				      bool ignore_references)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		struct address_space *mapping;
 		struct page *page;
 		int may_enter_fs;
-		enum page_references references = PAGEREF_RECLAIM_CLEAN;
+		enum page_references references = PAGEREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
 
@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!force_reclaim)
+		if (!ignore_references)
 			references = page_check_references(page, sc);
 
 		switch (references) {
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page
  2019-06-03  5:36 [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 1/4] mm: introduce MADV_COLD Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 2/4] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
@ 2019-06-03  5:36 ` Minchan Kim
  2019-06-03  5:36 ` [PATCH v1 4/4] mm: introduce MADV_PAGEOUT Minchan Kim
  3 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2019-06-03  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, Minchan Kim

The isolate counting is pecpu counter so it would be not huge gain
to work them by batch. Rather than complicating to make them batch,
let's make it more stright-foward via adding the counting logic
into [isolate|putback]_lru_page API.

Link: http://lkml.kernel.org/r/20190531165927.GA20067@cmpxchg.org
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/compaction.c     |  2 --
 mm/gup.c            |  7 +------
 mm/khugepaged.c     |  3 ---
 mm/memory-failure.c |  3 ---
 mm/memory_hotplug.c |  4 ----
 mm/mempolicy.c      |  6 +-----
 mm/migrate.c        | 37 ++++++++-----------------------------
 mm/vmscan.c         | 22 ++++++++++++++++------
 8 files changed, 26 insertions(+), 58 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..c6591682deda 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -982,8 +982,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index 63ac50e48072..2d9a9bc358c7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1360,13 +1360,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 					drain_allow = false;
 				}
 
-				if (!isolate_lru_page(head)) {
+				if (!isolate_lru_page(head))
 					list_add_tail(&head->lru, &cma_page_list);
-					mod_node_page_state(page_pgdat(head),
-							    NR_ISOLATED_ANON +
-							    page_is_file_cache(head),
-							    hpage_nr_pages(head));
-				}
 			}
 		}
 	}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..3359df994fb4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -503,7 +503,6 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -602,8 +601,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bc749265a8f3..2187bad7ceff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1796,9 +1796,6 @@ static int __soft_offline_page(struct page *page, int flags)
 		 * so use !__PageMovable instead for LRU page's mapping
 		 * cannot have PAGE_MAPPING_MOVABLE.
 		 */
-		if (!__PageMovable(page))
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-						page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a88c5f334e5a..a41bea24d0c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1390,10 +1390,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
 		if (!ret) { /* Success */
 			list_add_tail(&page->lru, &source);
-			if (!__PageMovable(page))
-				inc_node_page_state(page, NR_ISOLATED_ANON +
-						    page_is_file_cache(page));
-
 		} else {
 			pr_warn("failed to isolate pfn %lx\n", pfn);
 			dump_page(page, "isolation failed");
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5b3bf1747c19..cfb0590f69bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,12 +948,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 	 * Avoid migrating a page that is shared with others.
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
-		if (!isolate_lru_page(head)) {
+		if (!isolate_lru_page(head))
 			list_add_tail(&head->lru, pagelist);
-			mod_node_page_state(page_pgdat(head),
-				NR_ISOLATED_ANON + page_is_file_cache(head),
-				hpage_nr_pages(head));
-		}
 	}
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 572b4bc85d76..39b95ba04d3e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -190,8 +190,6 @@ void putback_movable_pages(struct list_head *l)
 			unlock_page(page);
 			put_page(page);
 		} else {
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
 			putback_lru_page(page);
 		}
 	}
@@ -1181,10 +1179,17 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (page_count(page) == 1) {
+		bool is_lru = !__PageMovable(page);
+
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
-		if (unlikely(__PageMovable(page))) {
+		if (likely(is_lru))
+			mod_node_page_state(page_pgdat(page),
+						NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						hpage_nr_pages(page));
+		else {
 			lock_page(page);
 			if (!PageMovable(page))
 				__ClearPageIsolated(page);
@@ -1210,15 +1215,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * restored.
 		 */
 		list_del(&page->lru);
-
-		/*
-		 * Compaction can migrate also non-LRU pages which are
-		 * not accounted to NR_ISOLATED_*. They can be recognized
-		 * as __PageMovable
-		 */
-		if (likely(!__PageMovable(page)))
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
 	}
 
 	/*
@@ -1572,9 +1568,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 
 		err = 0;
 		list_add_tail(&head->lru, pagelist);
-		mod_node_page_state(page_pgdat(head),
-			NR_ISOLATED_ANON + page_is_file_cache(head),
-			hpage_nr_pages(head));
 	}
 out_putpage:
 	/*
@@ -1890,8 +1883,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
-	int page_lru;
-
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
 	/* Avoid migrating to a node that is nearly full */
@@ -1913,10 +1904,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 		return 0;
 	}
 
-	page_lru = page_is_file_cache(page);
-	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
-				hpage_nr_pages(page));
-
 	/*
 	 * Isolating the page has taken another reference, so the
 	 * caller's reference can be safely dropped without the page
@@ -1971,8 +1958,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
-					page_is_file_cache(page));
 			putback_lru_page(page);
 		}
 		isolated = 0;
@@ -2002,7 +1987,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	int page_lru = page_is_file_cache(page);
 	unsigned long start = address & HPAGE_PMD_MASK;
 
 	new_page = alloc_pages_node(node,
@@ -2048,8 +2032,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		/* Retake the callers reference and putback on LRU */
 		get_page(page);
 		putback_lru_page(page);
-		mod_node_page_state(page_pgdat(page),
-			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
 
 		goto out_unlock;
 	}
@@ -2099,9 +2081,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-	mod_node_page_state(page_pgdat(page),
-			NR_ISOLATED_ANON + page_lru,
-			-HPAGE_PMD_NR);
 	return isolated;
 
 out_fail:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0973a46a0472..56df55e8afcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -999,6 +999,9 @@ int remove_mapping(struct address_space *mapping, struct page *page)
 void putback_lru_page(struct page *page)
 {
 	lru_cache_add(page);
+	mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_cache(page),
+				-hpage_nr_pages(page));
 	put_page(page);		/* drop ref from isolate */
 }
 
@@ -1464,6 +1467,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		nr_reclaimed += nr_pages;
 
+		mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						-nr_pages);
 		/*
 		 * Is there need to periodically free_page_list? It would
 		 * appear not as the counts should be low
@@ -1539,7 +1545,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_IGNORE_ACCESS, &dummy_stat, true);
 	list_splice(&clean_pages, page_list);
-	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1615,6 +1620,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 */
 		ClearPageLRU(page);
 		ret = 0;
+		__mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						hpage_nr_pages(page));
 	}
 
 	return ret;
@@ -1746,6 +1754,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    total_scan, skipped, nr_taken, mode, lru);
 	update_lru_sizes(lruvec, lru, nr_zone_taken);
+
 	return nr_taken;
 }
 
@@ -1794,6 +1803,9 @@ int isolate_lru_page(struct page *page)
 			ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
+			mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						hpage_nr_pages(page));
 		}
 		spin_unlock_irq(&pgdat->lru_lock);
 	}
@@ -1885,6 +1897,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 
+		__mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+						page_is_file_cache(page),
+						-hpage_nr_pages(page));
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
@@ -1962,7 +1977,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
@@ -1988,8 +2002,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	move_pages_to_lru(lruvec, &page_list);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
@@ -2048,7 +2060,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_vm_events(PGREFILL, nr_scanned);
@@ -2117,7 +2128,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
  2019-06-03  5:36 [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
                   ` (2 preceding siblings ...)
  2019-06-03  5:36 ` [PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page Minchan Kim
@ 2019-06-03  5:36 ` Minchan Kim
  2019-06-03 20:39   ` Johannes Weiner
  2019-06-04 15:38   ` Christopher Lameter
  3 siblings, 2 replies; 15+ messages in thread
From: Minchan Kim @ 2019-06-03  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, Minchan Kim

When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims *any LRU* pages instantly. The hint can help kernel in deciding
which pages to evict proactively.

All of error rule is same with MADV_DONTNEED.

Note:
    This hint works with only private pages(IOW, page_mapcount(page) < 2)
    because shared page could have more chance to be accessed from other
    processes sharing the page so that it could cause major fault soon,
    which is inefficient.

* RFC v2
 * make reclaim_pages simple via factoring out isolate logic - hannes

* RFCv1
 * rename from MADV_COLD to MADV_PAGEOUT - hannes
 * bail out if process is being killed - Hillf
 * fix reclaim_pages bugs - Hillf

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 126 +++++++++++++++++++++++++
 mm/vmscan.c                            |  34 +++++++
 4 files changed, 162 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1190f4e7f7b9..92e347a89ddc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -44,6 +44,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
 #define MADV_COLD	5		/* deactivatie these pages */
+#define MADV_PAGEOUT	6		/* reclaim these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index ab158766858a..b010249cb8b6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_COLD:
+	case MADV_PAGEOUT:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	LIST_HEAD(page_list);
+	struct page *page;
+	int isolated = 0;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long next;
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		if (is_huge_zero_pmd(*pmd))
+			goto huge_unlock;
+
+		page = pmd_page(*pmd);
+		if (page_mapcount(page) > 1)
+			goto huge_unlock;
+
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		if (isolate_lru_page(page))
+			goto huge_unlock;
+
+		list_add(&page->lru, &page_list);
+huge_unlock:
+		spin_unlock(ptl);
+		reclaim_pages(&page_list);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+regular_page:
+	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (page_mapcount(page) > 1)
+			continue;
+
+		if (isolate_lru_page(page))
+			continue;
+
+		isolated++;
+		list_add(&page->lru, &page_list);
+		if (isolated >= SWAP_CLUSTER_MAX) {
+			pte_unmap_unlock(orig_pte, ptl);
+			reclaim_pages(&page_list);
+			isolated = 0;
+			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+			orig_pte = pte;
+		}
+	}
+
+	pte_unmap_unlock(orig_pte, ptl);
+	reclaim_pages(&page_list);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk warm_walk = {
+		.pmd_entry = madvise_pageout_pte_range,
+		.mm = vma->vm_mm,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &warm_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_pageout(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COLD:
 		return madvise_cold(vma, prev, start, end);
+	case MADV_PAGEOUT:
+		return madvise_pageout(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_DONTNEED:
 	case MADV_FREE:
 	case MADV_COLD:
+	case MADV_PAGEOUT:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56df55e8afcd..2c2cf442db58 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2136,6 +2136,40 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			nr_deactivate, nr_rotated, sc->priority, file);
 }
 
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+	unsigned long nr_reclaimed = 0;
+	LIST_HEAD(node_page_list);
+	struct reclaim_stat dummy_stat;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
+		.may_writepage = 1,
+		.may_unmap = 1,
+		.may_swap = 1,
+	};
+
+	while (!list_empty(page_list)) {
+		struct page *page;
+
+		page = lru_to_page(page_list);
+		list_move(&page->lru, &node_page_list);
+		nr_reclaimed += shrink_page_list(&node_page_list,
+						page_pgdat(page),
+						&sc, TTU_IGNORE_ACCESS,
+						&dummy_stat, true);
+		if (!list_empty(&node_page_list)) {
+			struct page *page = lru_to_page(&node_page_list);
+
+			list_del(&page->lru);
+			putback_lru_page(page);
+
+		}
+	}
+
+	return nr_reclaimed;
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has
  * to do too much work.
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
  2019-06-03  5:36 ` [PATCH v1 4/4] mm: introduce MADV_PAGEOUT Minchan Kim
@ 2019-06-03 20:39   ` Johannes Weiner
  2019-06-04  3:55     ` Minchan Kim
  2019-06-04 15:38   ` Christopher Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2019-06-03 20:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Mon, Jun 03, 2019 at 02:36:55PM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
> 
> This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> syscall. MADV_PAGEOUT can be used by a process to mark a memory
> range as not expected to be used for a long time so that kernel
> reclaims *any LRU* pages instantly. The hint can help kernel in deciding
> which pages to evict proactively.
> 
> All of error rule is same with MADV_DONTNEED.
> 
> Note:
>     This hint works with only private pages(IOW, page_mapcount(page) < 2)
>     because shared page could have more chance to be accessed from other
>     processes sharing the page so that it could cause major fault soon,
>     which is inefficient.
> 
> * RFC v2
>  * make reclaim_pages simple via factoring out isolate logic - hannes
> 
> * RFCv1
>  * rename from MADV_COLD to MADV_PAGEOUT - hannes
>  * bail out if process is being killed - Hillf
>  * fix reclaim_pages bugs - Hillf
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 126 +++++++++++++++++++++++++
>  mm/vmscan.c                            |  34 +++++++
>  4 files changed, 162 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ce997edb8bb..063c0c1e112b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -365,6 +365,7 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern unsigned long vm_total_pages;
>  
> +extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 1190f4e7f7b9..92e347a89ddc 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -44,6 +44,7 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
>  #define MADV_COLD	5		/* deactivatie these pages */
> +#define MADV_PAGEOUT	6		/* reclaim these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index ab158766858a..b010249cb8b6 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_COLD:
> +	case MADV_PAGEOUT:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	int isolated = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (isolate_lru_page(page))
> +			goto huge_unlock;
> +
> +		list_add(&page->lru, &page_list);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		reclaim_pages(&page_list);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		isolated++;
> +		list_add(&page->lru, &page_list);
> +		if (isolated >= SWAP_CLUSTER_MAX) {
> +			pte_unmap_unlock(orig_pte, ptl);
> +			reclaim_pages(&page_list);
> +			isolated = 0;
> +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +			orig_pte = pte;
> +		}
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	reclaim_pages(&page_list);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_pageout_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk warm_walk = {
> +		.pmd_entry = madvise_pageout_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &warm_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +
> +static long madvise_pageout(struct vm_area_struct *vma,
> +			struct vm_area_struct **prev,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	*prev = vma;
> +	if (!can_madv_lru_vma(vma))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COLD:
>  		return madvise_cold(vma, prev, start, end);
> +	case MADV_PAGEOUT:
> +		return madvise_pageout(vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
>  	case MADV_COLD:
> +	case MADV_PAGEOUT:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 56df55e8afcd..2c2cf442db58 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2136,6 +2136,40 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			nr_deactivate, nr_rotated, sc->priority, file);
>  }
>  
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		list_move(&page->lru, &node_page_list);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +						page_pgdat(page),
> +						&sc, TTU_IGNORE_ACCESS,
> +						&dummy_stat, true);
> +		if (!list_empty(&node_page_list)) {
> +			struct page *page = lru_to_page(&node_page_list);
> +
> +			list_del(&page->lru);
> +			putback_lru_page(page);
> +
> +		}
> +	}

Awesome, this is way more readable now. Thanks for the cleanup!

Regarding the loop, for the vast majority of instances, pages on
page_list will actually be from the same node. It would make sense to
do batching here and collect pages until last_pgdat != pgdat. That
should reduce the number of tlb flushes and memcg uncharge flushes in
shrink_page_list().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
  2019-06-03 20:39   ` Johannes Weiner
@ 2019-06-04  3:55     ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2019-06-04  3:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Mon, Jun 03, 2019 at 04:39:11PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 02:36:55PM +0900, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> > 
> > This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> > syscall. MADV_PAGEOUT can be used by a process to mark a memory
> > range as not expected to be used for a long time so that kernel
> > reclaims *any LRU* pages instantly. The hint can help kernel in deciding
> > which pages to evict proactively.
> > 
> > All of error rule is same with MADV_DONTNEED.
> > 
> > Note:
> >     This hint works with only private pages(IOW, page_mapcount(page) < 2)
> >     because shared page could have more chance to be accessed from other
> >     processes sharing the page so that it could cause major fault soon,
> >     which is inefficient.
> > 
> > * RFC v2
> >  * make reclaim_pages simple via factoring out isolate logic - hannes
> > 
> > * RFCv1
> >  * rename from MADV_COLD to MADV_PAGEOUT - hannes
> >  * bail out if process is being killed - Hillf
> >  * fix reclaim_pages bugs - Hillf
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 126 +++++++++++++++++++++++++
> >  mm/vmscan.c                            |  34 +++++++
> >  4 files changed, 162 insertions(+)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0ce997edb8bb..063c0c1e112b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -365,6 +365,7 @@ extern int vm_swappiness;
> >  extern int remove_mapping(struct address_space *mapping, struct page *page);
> >  extern unsigned long vm_total_pages;
> >  
> > +extern unsigned long reclaim_pages(struct list_head *page_list);
> >  #ifdef CONFIG_NUMA
> >  extern int node_reclaim_mode;
> >  extern int sysctl_min_unmapped_ratio;
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 1190f4e7f7b9..92e347a89ddc 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -44,6 +44,7 @@
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> >  #define MADV_COLD	5		/* deactivatie these pages */
> > +#define MADV_PAGEOUT	6		/* reclaim these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index ab158766858a..b010249cb8b6 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_COLD:
> > +	case MADV_PAGEOUT:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	LIST_HEAD(page_list);
> > +	struct page *page;
> > +	int isolated = 0;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	if (fatal_signal_pending(current))
> > +		return -EINTR;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		if (isolate_lru_page(page))
> > +			goto huge_unlock;
> > +
> > +		list_add(&page->lru, &page_list);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		reclaim_pages(&page_list);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		isolated++;
> > +		list_add(&page->lru, &page_list);
> > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			reclaim_pages(&page_list);
> > +			isolated = 0;
> > +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +			orig_pte = pte;
> > +		}
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	reclaim_pages(&page_list);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk warm_walk = {
> > +		.pmd_entry = madvise_pageout_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &warm_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +
> > +static long madvise_pageout(struct vm_area_struct *vma,
> > +			struct vm_area_struct **prev,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	*prev = vma;
> > +	if (!can_madv_lru_vma(vma))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COLD:
> >  		return madvise_cold(vma, prev, start, end);
> > +	case MADV_PAGEOUT:
> > +		return madvise_pageout(vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> >  	case MADV_COLD:
> > +	case MADV_PAGEOUT:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 56df55e8afcd..2c2cf442db58 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2136,6 +2136,40 @@ static void shrink_active_list(unsigned long nr_to_scan,
> >  			nr_deactivate, nr_rotated, sc->priority, file);
> >  }
> >  
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		list_move(&page->lru, &node_page_list);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +						page_pgdat(page),
> > +						&sc, TTU_IGNORE_ACCESS,
> > +						&dummy_stat, true);
> > +		if (!list_empty(&node_page_list)) {
> > +			struct page *page = lru_to_page(&node_page_list);
> > +
> > +			list_del(&page->lru);
> > +			putback_lru_page(page);
> > +
> > +		}
> > +	}
> 
> Awesome, this is way more readable now. Thanks for the cleanup!
> 
> Regarding the loop, for the vast majority of instances, pages on
> page_list will actually be from the same node. It would make sense to
> do batching here and collect pages until last_pgdat != pgdat. That
> should reduce the number of tlb flushes and memcg uncharge flushes in
> shrink_page_list().

Sure, Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
  2019-06-03  5:36 ` [PATCH v1 4/4] mm: introduce MADV_PAGEOUT Minchan Kim
  2019-06-03 20:39   ` Johannes Weiner
@ 2019-06-04 15:38   ` Christopher Lameter
  1 sibling, 0 replies; 15+ messages in thread
From: Christopher Lameter @ 2019-06-04 15:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Mon, 3 Jun 2019, Minchan Kim wrote:

> @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
>  	return 0;
>  }
>
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	int isolated = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}

I have seen this before multiple times. Is there a way to avoid
replicating the whole shebang?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-03  5:36 ` [PATCH v1 1/4] mm: introduce MADV_COLD Minchan Kim
@ 2019-06-04 20:38   ` Joel Fernandes
  2019-06-10 10:09     ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Joel Fernandes @ 2019-06-04 20:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> files's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. Even, it could
> give a bonus to make them be reclaimed on swapless system. However,
> MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> in the end. So it's better to move inactive anon's LRU list, not file LRU.
> Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> if system doesn't have a swap device.
> 
> All of error rule is same with MADV_DONTNEED.
> 
> Note:
> This hint works with only private pages(IOW, page_mapcount(page) < 2)
> because shared page could have more chance to be accessed from other
> processes sharing the page although the caller reset the reference bits.
> It ends up preventing the reclaim of the page and wastes CPU cycle.
> 
> * RFCv2
>  * add more description - mhocko
> 
> * RFCv1
>  * renaming from MADV_COOL to MADV_COLD - hannes
> 
> * internal review
>  * use clear_page_youn in deactivate_page - joelaf
>  * Revise the description - surenb
>  * Renaming from MADV_WARM to MADV_COOL - surenb
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 ++++
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/internal.h                          |   2 +-
>  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
>  mm/oom_kill.c                          |   2 +-
>  mm/swap.c                              |  43 +++++++++
>  8 files changed, 176 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 9f8712a4b1a5..58b06654c8dd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
>  TESTPAGEFLAG(Young, young, PF_ANY)
>  SETPAGEFLAG(Young, young, PF_ANY)
>  TESTCLEARFLAG(Young, young, PF_ANY)
> +CLEARPAGEFLAG(Young, young, PF_ANY)
>  PAGEFLAG(Idle, idle, PF_ANY)
>  #endif
>  
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f3f43b317150 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
>  	SetPageYoung(page);
>  }
>  
> +static inline void clear_page_young(struct page *page)
> +{
> +	ClearPageYoung(page);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	return TestClearPageYoung(page);
> @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
>  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
>  }
>  
> +static void clear_page_young(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	struct page_ext *page_ext = lookup_page_ext(page);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index de2c67a33b7e..0ce997edb8bb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index bea0278f65ab..1190f4e7f7b9 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -43,6 +43,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_COLD	5		/* deactivatie these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/internal.h b/mm/internal.h
> index 9eeaf2b95166..75a4f96ec0fb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
>  {
>  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..ab158766858a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
> +	case MADV_COLD:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		pmdp_test_and_clear_young(vma, addr, pmd);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		ptep_test_and_clear_young(vma, addr, pte);

Wondering here how it interacts with idle page tracking. Here since young
flag is cleared by the cold hint, page_referenced_one() or
page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
if it was previously set since it does not know any more that a page was
actively referenced.

Should you also be clearing the page-idle flag here if the ptep young/accessed
bit was previously set, just so that page-idle tracking works smoothly when
this hint is concurrently applied?

> +		deactivate_page(page);
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk cool_walk = {
> +		.pmd_entry = madvise_cold_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &cool_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +static long madvise_cold(struct vm_area_struct *vma,
> +			struct vm_area_struct **prev,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	*prev = vma;
> +	if (!can_madv_lru_vma(vma))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  				  int behavior)
>  {
>  	*prev = vma;
> -	if (!can_madv_dontneed_vma(vma))
> +	if (!can_madv_lru_vma(vma))
>  		return -EINVAL;
>  
>  	if (!userfaultfd_remove(vma, start, end)) {
> @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  			 */
>  			return -ENOMEM;
>  		}
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			return -EINVAL;
>  		if (end > vma->vm_end) {
>  			/*
> @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_remove(vma, prev, start, end);
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
> +	case MADV_COLD:
> +		return madvise_cold(vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
> +	case MADV_COLD:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5a58778c91d4..f73d5f5145f0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			continue;
>  
>  		/*
> diff --git a/mm/swap.c b/mm/swap.c
> index 7b079976cbec..cebedab15aa2 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -47,6 +47,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +			    void *arg)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		int file = page_is_file_cache(page);
> +		int lru = page_lru_base_type(page);
> +
> +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		ClearPageActive(page);
> +		ClearPageReferenced(page);
> +		clear_page_young(page);

I was curious, how does the above interact with clear_refs?

It appears that a range that is marked as cold will appear to be unreferenced
(since referenced flag is cleared) even though it may have been referenced in
the past. Very least, I believe this behavior should be documented somewhere.

I believe it is best for us to make the clear_refs technique behave similar
to the idle-page tracking for the purposes of figuring out which pages are
referenced, so that clear_refs is more immune to this hint.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-04 20:38   ` Joel Fernandes
@ 2019-06-10 10:09     ` Minchan Kim
  2019-06-12 17:21       ` Joel Fernandes
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2019-06-10 10:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

Hi Joel,

On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > 
> > 	active file page -> inactive file LRU
> > 	active anon page -> inacdtive anon LRU
> > 
> > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > files's head because MADV_COLD is a little bit different symantic.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage* so freeing such pages is almost zero
> > overhead since we don't need to swap out and access afterward causes just
> > minor fault. Thus, it would make sense to put those freeable pages in
> > inactive file LRU to compete other used-once pages. Even, it could
> > give a bonus to make them be reclaimed on swapless system. However,
> > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > if system doesn't have a swap device.
> > 
> > All of error rule is same with MADV_DONTNEED.
> > 
> > Note:
> > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > because shared page could have more chance to be accessed from other
> > processes sharing the page although the caller reset the reference bits.
> > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > 
> > * RFCv2
> >  * add more description - mhocko
> > 
> > * RFCv1
> >  * renaming from MADV_COOL to MADV_COLD - hannes
> > 
> > * internal review
> >  * use clear_page_youn in deactivate_page - joelaf
> >  * Revise the description - surenb
> >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/page-flags.h             |   1 +
> >  include/linux/page_idle.h              |  15 ++++
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/internal.h                          |   2 +-
> >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> >  mm/oom_kill.c                          |   2 +-
> >  mm/swap.c                              |  43 +++++++++
> >  8 files changed, 176 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 9f8712a4b1a5..58b06654c8dd 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> >  TESTPAGEFLAG(Young, young, PF_ANY)
> >  SETPAGEFLAG(Young, young, PF_ANY)
> >  TESTCLEARFLAG(Young, young, PF_ANY)
> > +CLEARPAGEFLAG(Young, young, PF_ANY)
> >  PAGEFLAG(Idle, idle, PF_ANY)
> >  #endif
> >  
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f3f43b317150 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> >  	SetPageYoung(page);
> >  }
> >  
> > +static inline void clear_page_young(struct page *page)
> > +{
> > +	ClearPageYoung(page);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	return TestClearPageYoung(page);
> > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> >  }
> >  
> > +static void clear_page_young(struct page *page)
> > +{
> > +	struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > +	if (unlikely(!page_ext))
> > +		return;
> > +
> > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	struct page_ext *page_ext = lookup_page_ext(page);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index de2c67a33b7e..0ce997edb8bb 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > +extern void deactivate_page(struct page *page);
> >  extern void mark_page_lazyfree(struct page *page);
> >  extern void swap_setup(void);
> >  
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index bea0278f65ab..1190f4e7f7b9 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -43,6 +43,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_COLD	5		/* deactivatie these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 9eeaf2b95166..75a4f96ec0fb 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >  		unsigned long floor, unsigned long ceiling);
> >  
> > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> >  {
> >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> >  }
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 628022e674a7..ab158766858a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_COLD:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > +		deactivate_page(page);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		ptep_test_and_clear_young(vma, addr, pte);
> 
> Wondering here how it interacts with idle page tracking. Here since young
> flag is cleared by the cold hint, page_referenced_one() or
> page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> if it was previously set since it does not know any more that a page was
> actively referenced.

ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
doesn't affect.

> 
> Should you also be clearing the page-idle flag here if the ptep young/accessed
> bit was previously set, just so that page-idle tracking works smoothly when
> this hint is concurrently applied?

deactivate_page will remove PG_young bit so that the page will be reclaimed.
Do I miss your point?

> 
> > +		deactivate_page(page);
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_cold_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk cool_walk = {
> > +		.pmd_entry = madvise_cold_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &cool_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static long madvise_cold(struct vm_area_struct *vma,
> > +			struct vm_area_struct **prev,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	*prev = vma;
> > +	if (!can_madv_lru_vma(vma))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  				  int behavior)
> >  {
> >  	*prev = vma;
> > -	if (!can_madv_dontneed_vma(vma))
> > +	if (!can_madv_lru_vma(vma))
> >  		return -EINVAL;
> >  
> >  	if (!userfaultfd_remove(vma, start, end)) {
> > @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  			 */
> >  			return -ENOMEM;
> >  		}
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			return -EINVAL;
> >  		if (end > vma->vm_end) {
> >  			/*
> > @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_remove(vma, prev, start, end);
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> > +	case MADV_COLD:
> > +		return madvise_cold(vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> > +	case MADV_COLD:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5a58778c91d4..f73d5f5145f0 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >  
> >  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			continue;
> >  
> >  		/*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 7b079976cbec..cebedab15aa2 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -47,6 +47,7 @@ int page_cluster;
> >  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> > @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> >  	update_page_reclaim_stat(lruvec, file, 0);
> >  }
> >  
> > +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> > +			    void *arg)
> > +{
> > +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > +		int file = page_is_file_cache(page);
> > +		int lru = page_lru_base_type(page);
> > +
> > +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> > +		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> > +		clear_page_young(page);
> 
> I was curious, how does the above interact with clear_refs?

clear_refs couldn't find the access any longer. However, it's the
intention because we couldn't reclaim the page without it.

> 
> It appears that a range that is marked as cold will appear to be unreferenced
> (since referenced flag is cleared) even though it may have been referenced in
> the past. Very least, I believe this behavior should be documented somewhere.

Okay, I will add some comment.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-10 10:09     ` Minchan Kim
@ 2019-06-12 17:21       ` Joel Fernandes
  2019-06-13  4:48         ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Joel Fernandes @ 2019-06-12 17:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use.  This could reduce
> > > workingset eviction so it ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > > 
> > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > 
> > > 	active file page -> inactive file LRU
> > > 	active anon page -> inacdtive anon LRU
> > > 
> > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > files's head because MADV_COLD is a little bit different symantic.
> > > MADV_FREE means it's okay to discard when the memory pressure because
> > > the content of the page is *garbage* so freeing such pages is almost zero
> > > overhead since we don't need to swap out and access afterward causes just
> > > minor fault. Thus, it would make sense to put those freeable pages in
> > > inactive file LRU to compete other used-once pages. Even, it could
> > > give a bonus to make them be reclaimed on swapless system. However,
> > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > if system doesn't have a swap device.
> > > 
> > > All of error rule is same with MADV_DONTNEED.
> > > 
> > > Note:
> > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > because shared page could have more chance to be accessed from other
> > > processes sharing the page although the caller reset the reference bits.
> > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > 
> > > * RFCv2
> > >  * add more description - mhocko
> > > 
> > > * RFCv1
> > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > 
> > > * internal review
> > >  * use clear_page_youn in deactivate_page - joelaf
> > >  * Revise the description - surenb
> > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  include/linux/page-flags.h             |   1 +
> > >  include/linux/page_idle.h              |  15 ++++
> > >  include/linux/swap.h                   |   1 +
> > >  include/uapi/asm-generic/mman-common.h |   1 +
> > >  mm/internal.h                          |   2 +-
> > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > >  mm/oom_kill.c                          |   2 +-
> > >  mm/swap.c                              |  43 +++++++++
> > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > >  SETPAGEFLAG(Young, young, PF_ANY)
> > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > >  PAGEFLAG(Idle, idle, PF_ANY)
> > >  #endif
> > >  
> > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > index 1e894d34bdce..f3f43b317150 100644
> > > --- a/include/linux/page_idle.h
> > > +++ b/include/linux/page_idle.h
> > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > >  	SetPageYoung(page);
> > >  }
> > >  
> > > +static inline void clear_page_young(struct page *page)
> > > +{
> > > +	ClearPageYoung(page);
> > > +}
> > > +
> > >  static inline bool test_and_clear_page_young(struct page *page)
> > >  {
> > >  	return TestClearPageYoung(page);
> > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > >  }
> > >  
> > > +static void clear_page_young(struct page *page)
> > > +{
> > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > +
> > > +	if (unlikely(!page_ext))
> > > +		return;
> > > +
> > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > +}
> > > +
> > >  static inline bool test_and_clear_page_young(struct page *page)
> > >  {
> > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index de2c67a33b7e..0ce997edb8bb 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > >  extern void lru_add_drain_all(void);
> > >  extern void rotate_reclaimable_page(struct page *page);
> > >  extern void deactivate_file_page(struct page *page);
> > > +extern void deactivate_page(struct page *page);
> > >  extern void mark_page_lazyfree(struct page *page);
> > >  extern void swap_setup(void);
> > >  
> > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > index bea0278f65ab..1190f4e7f7b9 100644
> > > --- a/include/uapi/asm-generic/mman-common.h
> > > +++ b/include/uapi/asm-generic/mman-common.h
> > > @@ -43,6 +43,7 @@
> > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > >  #define MADV_WILLNEED	3		/* will need these pages */
> > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > +#define MADV_COLD	5		/* deactivatie these pages */
> > >  
> > >  /* common parameters: try to keep these consistent across architectures */
> > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > >  		unsigned long floor, unsigned long ceiling);
> > >  
> > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > >  {
> > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > >  }
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 628022e674a7..ab158766858a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > >  	case MADV_REMOVE:
> > >  	case MADV_WILLNEED:
> > >  	case MADV_DONTNEED:
> > > +	case MADV_COLD:
> > >  	case MADV_FREE:
> > >  		return 0;
> > >  	default:
> > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >  
> > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > +				unsigned long end, struct mm_walk *walk)
> > > +{
> > > +	pte_t *orig_pte, *pte, ptent;
> > > +	spinlock_t *ptl;
> > > +	struct page *page;
> > > +	struct vm_area_struct *vma = walk->vma;
> > > +	unsigned long next;
> > > +
> > > +	next = pmd_addr_end(addr, end);
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > +		if (!ptl)
> > > +			return 0;
> > > +
> > > +		if (is_huge_zero_pmd(*pmd))
> > > +			goto huge_unlock;
> > > +
> > > +		page = pmd_page(*pmd);
> > > +		if (page_mapcount(page) > 1)
> > > +			goto huge_unlock;
> > > +
> > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > +			int err;
> > > +
> > > +			get_page(page);
> > > +			spin_unlock(ptl);
> > > +			lock_page(page);
> > > +			err = split_huge_page(page);
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +			if (!err)
> > > +				goto regular_page;
> > > +			return 0;
> > > +		}
> > > +
> > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > +		deactivate_page(page);
> > > +huge_unlock:
> > > +		spin_unlock(ptl);
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (pmd_trans_unstable(pmd))
> > > +		return 0;
> > > +
> > > +regular_page:
> > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > +		ptent = *pte;
> > > +
> > > +		if (pte_none(ptent))
> > > +			continue;
> > > +
> > > +		if (!pte_present(ptent))
> > > +			continue;
> > > +
> > > +		page = vm_normal_page(vma, addr, ptent);
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (page_mapcount(page) > 1)
> > > +			continue;
> > > +
> > > +		ptep_test_and_clear_young(vma, addr, pte);
> > 
> > Wondering here how it interacts with idle page tracking. Here since young
> > flag is cleared by the cold hint, page_referenced_one() or
> > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > if it was previously set since it does not know any more that a page was
> > actively referenced.
> 
> ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> doesn't affect.

Clearing of the young bit in the PTE does affect idle tracking.

Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.

> > bit was previously set, just so that page-idle tracking works smoothly when
> > this hint is concurrently applied?
> 
> deactivate_page will remove PG_young bit so that the page will be reclaimed.
> Do I miss your point?

Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
is set. Now the page is accessed from userspace thus setting the accessed PTE
bit.  Now a remote process passes this process_madvise cold hint (I know your
current series does not support remote process, but I am saying for future
when you post this). Because you cleared the PTE accessed bit through the
hint, idle tracking no longer will know that the page is referenced and the
user gets confused because accessed page appears to be idle.

I think to fix this, what you should do is clear the PG_Idle flag if the
young/accessed PTE bits are set. If PG_Idle is already cleared, then you
don't need to do anything.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-12 17:21       ` Joel Fernandes
@ 2019-06-13  4:48         ` Minchan Kim
  2019-06-19 17:13           ` Joel Fernandes
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2019-06-13  4:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > Hi Joel,
> > 
> > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use.  This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > > 
> > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > 
> > > > 	active file page -> inactive file LRU
> > > > 	active anon page -> inacdtive anon LRU
> > > > 
> > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > files's head because MADV_COLD is a little bit different symantic.
> > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > overhead since we don't need to swap out and access afterward causes just
> > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > if system doesn't have a swap device.
> > > > 
> > > > All of error rule is same with MADV_DONTNEED.
> > > > 
> > > > Note:
> > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > because shared page could have more chance to be accessed from other
> > > > processes sharing the page although the caller reset the reference bits.
> > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > 
> > > > * RFCv2
> > > >  * add more description - mhocko
> > > > 
> > > > * RFCv1
> > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > 
> > > > * internal review
> > > >  * use clear_page_youn in deactivate_page - joelaf
> > > >  * Revise the description - surenb
> > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > 
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  include/linux/page-flags.h             |   1 +
> > > >  include/linux/page_idle.h              |  15 ++++
> > > >  include/linux/swap.h                   |   1 +
> > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > >  mm/internal.h                          |   2 +-
> > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > >  mm/oom_kill.c                          |   2 +-
> > > >  mm/swap.c                              |  43 +++++++++
> > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > --- a/include/linux/page-flags.h
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > >  #endif
> > > >  
> > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > index 1e894d34bdce..f3f43b317150 100644
> > > > --- a/include/linux/page_idle.h
> > > > +++ b/include/linux/page_idle.h
> > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > >  	SetPageYoung(page);
> > > >  }
> > > >  
> > > > +static inline void clear_page_young(struct page *page)
> > > > +{
> > > > +	ClearPageYoung(page);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	return TestClearPageYoung(page);
> > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > >  }
> > > >  
> > > > +static void clear_page_young(struct page *page)
> > > > +{
> > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > +
> > > > +	if (unlikely(!page_ext))
> > > > +		return;
> > > > +
> > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > >  extern void lru_add_drain_all(void);
> > > >  extern void rotate_reclaimable_page(struct page *page);
> > > >  extern void deactivate_file_page(struct page *page);
> > > > +extern void deactivate_page(struct page *page);
> > > >  extern void mark_page_lazyfree(struct page *page);
> > > >  extern void swap_setup(void);
> > > >  
> > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > @@ -43,6 +43,7 @@
> > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > >  
> > > >  /* common parameters: try to keep these consistent across architectures */
> > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > >  		unsigned long floor, unsigned long ceiling);
> > > >  
> > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > >  {
> > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > >  }
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 628022e674a7..ab158766858a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > >  	case MADV_REMOVE:
> > > >  	case MADV_WILLNEED:
> > > >  	case MADV_DONTNEED:
> > > > +	case MADV_COLD:
> > > >  	case MADV_FREE:
> > > >  		return 0;
> > > >  	default:
> > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > +				unsigned long end, struct mm_walk *walk)
> > > > +{
> > > > +	pte_t *orig_pte, *pte, ptent;
> > > > +	spinlock_t *ptl;
> > > > +	struct page *page;
> > > > +	struct vm_area_struct *vma = walk->vma;
> > > > +	unsigned long next;
> > > > +
> > > > +	next = pmd_addr_end(addr, end);
> > > > +	if (pmd_trans_huge(*pmd)) {
> > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > +		if (!ptl)
> > > > +			return 0;
> > > > +
> > > > +		if (is_huge_zero_pmd(*pmd))
> > > > +			goto huge_unlock;
> > > > +
> > > > +		page = pmd_page(*pmd);
> > > > +		if (page_mapcount(page) > 1)
> > > > +			goto huge_unlock;
> > > > +
> > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > +			int err;
> > > > +
> > > > +			get_page(page);
> > > > +			spin_unlock(ptl);
> > > > +			lock_page(page);
> > > > +			err = split_huge_page(page);
> > > > +			unlock_page(page);
> > > > +			put_page(page);
> > > > +			if (!err)
> > > > +				goto regular_page;
> > > > +			return 0;
> > > > +		}
> > > > +
> > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > +		deactivate_page(page);
> > > > +huge_unlock:
> > > > +		spin_unlock(ptl);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	if (pmd_trans_unstable(pmd))
> > > > +		return 0;
> > > > +
> > > > +regular_page:
> > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > +		ptent = *pte;
> > > > +
> > > > +		if (pte_none(ptent))
> > > > +			continue;
> > > > +
> > > > +		if (!pte_present(ptent))
> > > > +			continue;
> > > > +
> > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (page_mapcount(page) > 1)
> > > > +			continue;
> > > > +
> > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > 
> > > Wondering here how it interacts with idle page tracking. Here since young
> > > flag is cleared by the cold hint, page_referenced_one() or
> > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > if it was previously set since it does not know any more that a page was
> > > actively referenced.
> > 
> > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > doesn't affect.

You said *young flag* in the comment, which made me confused. I thought you meant
PG_young flag but you mean PTE access bit.

> 
> Clearing of the young bit in the PTE does affect idle tracking.
> 
> Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> 
> > > bit was previously set, just so that page-idle tracking works smoothly when
> > > this hint is concurrently applied?
> > 
> > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > Do I miss your point?
> 
> Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> is set. Now the page is accessed from userspace thus setting the accessed PTE
> bit.  Now a remote process passes this process_madvise cold hint (I know your
> current series does not support remote process, but I am saying for future
> when you post this). Because you cleared the PTE accessed bit through the
> hint, idle tracking no longer will know that the page is referenced and the
> user gets confused because accessed page appears to be idle.

Right.

> 
> I think to fix this, what you should do is clear the PG_Idle flag if the
> young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> don't need to do anything.

I'm not sure. What does it make MADV_COLD special?
How about MADV_FREE|MADV_DONTNEED?
Why don't they clear PG_Idle if pte was young at tearing down pte? 
The page could be shared by other processes so if we miss to clear out
PG_idle in there, page idle tracking could miss the access history forever.

If it's not what you want, maybe we need to fix all places all at once.
However, I'm not sure. Rather than, I want to keep PG_idle in those hints
even though pte was accesssed because the process now gives strong hint
"The page is idle from now on". It's valid because he knows himself better than
others, even admin. IOW, he declare the page is not workingset any more.
What's the problem if page idle tracking feature miss it?
If other processs still have access bit of their pte for the page, page idle
tracking could find the page as non-idle so it's no problem, either.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-13  4:48         ` Minchan Kim
@ 2019-06-19 17:13           ` Joel Fernandes
  2019-06-20  5:01             ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Joel Fernandes @ 2019-06-19 17:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton

Sorry Minchan for late reply, I had a visa interview travel and then the weekend
came. I replied below:

On Thu, Jun 13, 2019 at 01:48:24PM +0900, Minchan Kim wrote:
> On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> > On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > > Hi Joel,
> > > 
> > > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > happens but data should be preserved for future use.  This could reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > > 
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > pages to evict early during memory pressure.
> > > > > 
> > > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > > 
> > > > > 	active file page -> inactive file LRU
> > > > > 	active anon page -> inacdtive anon LRU
> > > > > 
> > > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > > files's head because MADV_COLD is a little bit different symantic.
> > > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > > overhead since we don't need to swap out and access afterward causes just
> > > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > > if system doesn't have a swap device.
> > > > > 
> > > > > All of error rule is same with MADV_DONTNEED.
> > > > > 
> > > > > Note:
> > > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > > because shared page could have more chance to be accessed from other
> > > > > processes sharing the page although the caller reset the reference bits.
> > > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > > 
> > > > > * RFCv2
> > > > >  * add more description - mhocko
> > > > > 
> > > > > * RFCv1
> > > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > > 
> > > > > * internal review
> > > > >  * use clear_page_youn in deactivate_page - joelaf
> > > > >  * Revise the description - surenb
> > > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > > 
> > > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > > ---
> > > > >  include/linux/page-flags.h             |   1 +
> > > > >  include/linux/page_idle.h              |  15 ++++
> > > > >  include/linux/swap.h                   |   1 +
> > > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > > >  mm/internal.h                          |   2 +-
> > > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > > >  mm/oom_kill.c                          |   2 +-
> > > > >  mm/swap.c                              |  43 +++++++++
> > > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > > --- a/include/linux/page-flags.h
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > > >  #endif
> > > > >  
> > > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > > index 1e894d34bdce..f3f43b317150 100644
> > > > > --- a/include/linux/page_idle.h
> > > > > +++ b/include/linux/page_idle.h
> > > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > > >  	SetPageYoung(page);
> > > > >  }
> > > > >  
> > > > > +static inline void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	ClearPageYoung(page);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	return TestClearPageYoung(page);
> > > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > >  }
> > > > >  
> > > > > +static void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > +
> > > > > +	if (unlikely(!page_ext))
> > > > > +		return;
> > > > > +
> > > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > > >  extern void lru_add_drain_all(void);
> > > > >  extern void rotate_reclaimable_page(struct page *page);
> > > > >  extern void deactivate_file_page(struct page *page);
> > > > > +extern void deactivate_page(struct page *page);
> > > > >  extern void mark_page_lazyfree(struct page *page);
> > > > >  extern void swap_setup(void);
> > > > >  
> > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > > @@ -43,6 +43,7 @@
> > > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > > >  
> > > > >  /* common parameters: try to keep these consistent across architectures */
> > > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > > --- a/mm/internal.h
> > > > > +++ b/mm/internal.h
> > > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > > >  		unsigned long floor, unsigned long ceiling);
> > > > >  
> > > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > > >  {
> > > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > > >  }
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 628022e674a7..ab158766858a 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > > >  	case MADV_REMOVE:
> > > > >  	case MADV_WILLNEED:
> > > > >  	case MADV_DONTNEED:
> > > > > +	case MADV_COLD:
> > > > >  	case MADV_FREE:
> > > > >  		return 0;
> > > > >  	default:
> > > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > +{
> > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > +	spinlock_t *ptl;
> > > > > +	struct page *page;
> > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > +	unsigned long next;
> > > > > +
> > > > > +	next = pmd_addr_end(addr, end);
> > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > +		if (!ptl)
> > > > > +			return 0;
> > > > > +
> > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		page = pmd_page(*pmd);
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > +			int err;
> > > > > +
> > > > > +			get_page(page);
> > > > > +			spin_unlock(ptl);
> > > > > +			lock_page(page);
> > > > > +			err = split_huge_page(page);
> > > > > +			unlock_page(page);
> > > > > +			put_page(page);
> > > > > +			if (!err)
> > > > > +				goto regular_page;
> > > > > +			return 0;
> > > > > +		}
> > > > > +
> > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > +		deactivate_page(page);
> > > > > +huge_unlock:
> > > > > +		spin_unlock(ptl);
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	if (pmd_trans_unstable(pmd))
> > > > > +		return 0;
> > > > > +
> > > > > +regular_page:
> > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > +		ptent = *pte;
> > > > > +
> > > > > +		if (pte_none(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		if (!pte_present(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > +		if (!page)
> > > > > +			continue;
> > > > > +
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			continue;
> > > > > +
> > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > 
> > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > if it was previously set since it does not know any more that a page was
> > > > actively referenced.
> > > 
> > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > doesn't affect.
> 
> You said *young flag* in the comment, which made me confused. I thought you meant
> PG_young flag but you mean PTE access bit.
> 
> > 
> > Clearing of the young bit in the PTE does affect idle tracking.
> > 
> > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > 
> > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > this hint is concurrently applied?
> > > 
> > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > Do I miss your point?
> > 
> > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > current series does not support remote process, but I am saying for future
> > when you post this). Because you cleared the PTE accessed bit through the
> > hint, idle tracking no longer will know that the page is referenced and the
> > user gets confused because accessed page appears to be idle.
> 
> Right.
> 
> > 
> > I think to fix this, what you should do is clear the PG_Idle flag if the
> > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > don't need to do anything.
> 
> I'm not sure. What does it make MADV_COLD special?
> How about MADV_FREE|MADV_DONTNEED?
> Why don't they clear PG_Idle if pte was young at tearing down pte? 

Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

> The page could be shared by other processes so if we miss to clear out
> PG_idle in there, page idle tracking could miss the access history forever.

I did not understand this. So say a page X is shared process P and Q and
assume the PG_idle flag is set on the page.

P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
hint and forgets to clear the idle flag while clearing the pte accessed bit.

Now the page appears to be idle, even though it was not. This has nothing to
do with Q and whether the page is shared or not.

> If it's not what you want, maybe we need to fix all places all at once.
> However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> even though pte was accesssed because the process now gives strong hint
> "The page is idle from now on". It's valid because he knows himself better than
> others, even admin. IOW, he declare the page is not workingset any more.

Even if the PG_idle flag is not cleared - it is not a strong hint for working
set size IMHO, because the page *was* accessed so the process definitely needed the
page at some point even though now it says it is MADV_COLD. So that is part
of working set. I don't think we should implicitly provide such hints and we
should fix it.

Also I was saying in previous email, if process_madvise (future extension) is
called from say activity manager, then the process and the user running the
idle tracking feature has no idea that the page was accessed because the idle
flag is still set. That is a bit weird and is loss of information.

It may not be a big deal in the long run if the page is accessed a lot, since
the PTE accessed bit will be set again and idle-tracking feature may not miss
it, but why leave it to chance if it is a simple fix?

> What's the problem if page idle tracking feature miss it?

What's the problem if PG_idle flag is cleared here? It is just a software
flag.

> If other processs still have access bit of their pte for the page, page idle
> tracking could find the page as non-idle so it's no problem, either.

Yes, but if the other process also does not access the page, then the access
information is lost.

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-19 17:13           ` Joel Fernandes
@ 2019-06-20  5:01             ` Minchan Kim
  2019-06-20 15:57               ` Joel Fernandes
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2019-06-20  5:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, vdavydov.dev

On Wed, Jun 19, 2019 at 01:13:40PM -0400, Joel Fernandes wrote:
< snip >

Ccing Vladimir

> > > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > > +{
> > > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > > +	spinlock_t *ptl;
> > > > > > +	struct page *page;
> > > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > > +	unsigned long next;
> > > > > > +
> > > > > > +	next = pmd_addr_end(addr, end);
> > > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > > +		if (!ptl)
> > > > > > +			return 0;
> > > > > > +
> > > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		page = pmd_page(*pmd);
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > > +			int err;
> > > > > > +
> > > > > > +			get_page(page);
> > > > > > +			spin_unlock(ptl);
> > > > > > +			lock_page(page);
> > > > > > +			err = split_huge_page(page);
> > > > > > +			unlock_page(page);
> > > > > > +			put_page(page);
> > > > > > +			if (!err)
> > > > > > +				goto regular_page;
> > > > > > +			return 0;
> > > > > > +		}
> > > > > > +
> > > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > > +		deactivate_page(page);
> > > > > > +huge_unlock:
> > > > > > +		spin_unlock(ptl);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (pmd_trans_unstable(pmd))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +regular_page:
> > > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > > +		ptent = *pte;
> > > > > > +
> > > > > > +		if (pte_none(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (!pte_present(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > > +		if (!page)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > > 
> > > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > > if it was previously set since it does not know any more that a page was
> > > > > actively referenced.
> > > > 
> > > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > > doesn't affect.
> > 
> > You said *young flag* in the comment, which made me confused. I thought you meant
> > PG_young flag but you mean PTE access bit.
> > 
> > > 
> > > Clearing of the young bit in the PTE does affect idle tracking.
> > > 
> > > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > > 
> > > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > > this hint is concurrently applied?
> > > > 
> > > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > > Do I miss your point?
> > > 
> > > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > > current series does not support remote process, but I am saying for future
> > > when you post this). Because you cleared the PTE accessed bit through the
> > > hint, idle tracking no longer will know that the page is referenced and the
> > > user gets confused because accessed page appears to be idle.
> > 
> > Right.
> > 
> > > 
> > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > don't need to do anything.
> > 
> > I'm not sure. What does it make MADV_COLD special?
> > How about MADV_FREE|MADV_DONTNEED?
> > Why don't they clear PG_Idle if pte was young at tearing down pte? 
> 
> Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
part, which is more general to cover every sites like munmap, get_user_pages and
so on. Anyway, I don't think it's related to this patchset.

> 
> > The page could be shared by other processes so if we miss to clear out
> > PG_idle in there, page idle tracking could miss the access history forever.
> 
> I did not understand this. So say a page X is shared process P and Q and
> assume the PG_idle flag is set on the page.
> 
> P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
> hint and forgets to clear the idle flag while clearing the pte accessed bit.
> 
> Now the page appears to be idle, even though it was not. This has nothing to
> do with Q and whether the page is shared or not.

What I meant was MADV_FREE|MADV_DONTNEED.

> 
> > If it's not what you want, maybe we need to fix all places all at once.
> > However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> > even though pte was accesssed because the process now gives strong hint
> > "The page is idle from now on". It's valid because he knows himself better than
> > others, even admin. IOW, he declare the page is not workingset any more.
> 
> Even if the PG_idle flag is not cleared - it is not a strong hint for working
> set size IMHO, because the page *was* accessed so the process definitely needed the
> page at some point even though now it says it is MADV_COLD. So that is part
> of working set. I don't think we should implicitly provide such hints and we
> should fix it.
> 
> Also I was saying in previous email, if process_madvise (future extension) is
> called from say activity manager, then the process and the user running the
> idle tracking feature has no idea that the page was accessed because the idle
> flag is still set. That is a bit weird and is loss of information.
> 
> It may not be a big deal in the long run if the page is accessed a lot, since
> the PTE accessed bit will be set again and idle-tracking feature may not miss
> it, but why leave it to chance if it is a simple fix?

Consistency with other madvise hints.

There are many places you could lose the information as I mentioned and I'm
really not conviced we need fixing because currently page-idle tracking
feature is biased to work with . If you believe we need to fix it,
it would be better to have a separate discussion, not here.

> 
> > What's the problem if page idle tracking feature miss it?
> 
> What's the problem if PG_idle flag is cleared here? It is just a software
> flag.

Again consistency. I don't think it's a MADV_PAGEOUT specific issue.
Since I pointed out other places idle tracking is missing(? not sure),
Let's discuss it separately if you feel we need fix.

Furthermore, once the page is reclaimed, that means the page could be
deallocated so you automatically don't see any PG_idle from the page.

> 
> > If other processs still have access bit of their pte for the page, page idle
> > tracking could find the page as non-idle so it's no problem, either.
> 
> Yes, but if the other process also does not access the page, then the access
> information is lost.
> 
> thanks!
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
  2019-06-20  5:01             ` Minchan Kim
@ 2019-06-20 15:57               ` Joel Fernandes
  0 siblings, 0 replies; 15+ messages in thread
From: Joel Fernandes @ 2019-06-20 15:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Linux API, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon,
	Jann Horn, Oleg Nesterov, Christian Brauner, oleksandr, hdanton,
	Vladimir Davydov

On Thu, Jun 20, 2019 at 1:01 AM Minchan Kim <minchan@kernel.org> wrote:
[snip]
> > > >
> > > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > > don't need to do anything.
> > >
> > > I'm not sure. What does it make MADV_COLD special?
> > > How about MADV_FREE|MADV_DONTNEED?
> > > Why don't they clear PG_Idle if pte was young at tearing down pte?
> >
> > Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?
>
> Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
> part, which is more general to cover every sites like munmap, get_user_pages and
> so on. Anyway, I don't think it's related to this patchset.

Ok, I can look into this issue on my own when I get time. I'll add it
to my list. No problems with your patch otherwise from my side.

 -Joel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-06-20 15:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-03  5:36 [PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
2019-06-03  5:36 ` [PATCH v1 1/4] mm: introduce MADV_COLD Minchan Kim
2019-06-04 20:38   ` Joel Fernandes
2019-06-10 10:09     ` Minchan Kim
2019-06-12 17:21       ` Joel Fernandes
2019-06-13  4:48         ` Minchan Kim
2019-06-19 17:13           ` Joel Fernandes
2019-06-20  5:01             ` Minchan Kim
2019-06-20 15:57               ` Joel Fernandes
2019-06-03  5:36 ` [PATCH v1 2/4] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
2019-06-03  5:36 ` [PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page Minchan Kim
2019-06-03  5:36 ` [PATCH v1 4/4] mm: introduce MADV_PAGEOUT Minchan Kim
2019-06-03 20:39   ` Johannes Weiner
2019-06-04  3:55     ` Minchan Kim
2019-06-04 15:38   ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).