LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/5] introduce memory hinting API for external process
@ 2020-01-16 23:59 Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 1/5] mm: factor out madvise's core functionality Minchan Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim

Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that,
application could give hints to kernel what memory range are preferred to be
reclaimed. However, in some platform(e.g., Android), the information
required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without any app
involvement.

To solve the concern, this patch introduces new syscall - process_madvise(2).
Bascially, it's same with madvise(2) syscall but it has some differences.

1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment.
   Other hints in madvise will be opened when there are explicit requests from
   community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's address
   space.

Minchan Kim (3):
  mm: factor out madvise's core functionality
  mm: introduce external memory hinting API
  mm: support both pid and pidfd for process_madvise

Oleksandr Natalenko (2):
  mm/madvise: employ mmget_still_valid for write lock
  mm/madvise: allow KSM hints for remote API

* from v1 - https://lore.kernel.org/linux-mm/20200110213433.94739-1-minchan@kernel.org/
  * fix syscall number - SeongJae
  * use get_pid_task - Kirill
  * extend API to support pid as well as pidfd - Kirill

 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 include/linux/pid.h                         |   1 +
 include/linux/syscalls.h                    |   3 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 kernel/exit.c                               |  17 --
 kernel/pid.c                                |  17 ++
 kernel/sys_ni.c                             |   1 +
 mm/madvise.c                                | 272 ++++++++++++++------
 24 files changed, 233 insertions(+), 102 deletions(-)

-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/5] mm: factor out madvise's core functionality
  2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
@ 2020-01-16 23:59 ` Minchan Kim
  2020-01-17 10:02   ` Kirill Tkhai
  2020-01-16 23:59 ` [PATCH v2 2/5] mm: introduce external memory hinting API Minchan Kim
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim

This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication. It shouldn't change any behavior.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 194 +++++++++++++++++++++++++++++----------------------
 1 file changed, 111 insertions(+), 83 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index bcdb6a042787..0c901de531e4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -35,6 +35,7 @@
 struct madvise_walk_private {
 	struct mmu_gather *tlb;
 	bool pageout;
+	struct task_struct *task;
 };
 
 /*
@@ -306,12 +307,13 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	bool pageout = private->pageout;
 	struct mm_struct *mm = tlb->mm;
 	struct vm_area_struct *vma = walk->vma;
+	struct task_struct *task = private->task;
 	pte_t *orig_pte, *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page = NULL;
 	LIST_HEAD(page_list);
 
-	if (fatal_signal_pending(current))
+	if (fatal_signal_pending(task))
 		return -EINTR;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -469,12 +471,14 @@ static const struct mm_walk_ops cold_walk_ops = {
 };
 
 static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct task_struct *task,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end)
 {
 	struct madvise_walk_private walk_private = {
 		.pageout = false,
 		.tlb = tlb,
+		.task = task,
 	};
 
 	tlb_start_vma(tlb, vma);
@@ -482,7 +486,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
-static long madvise_cold(struct vm_area_struct *vma,
+static long madvise_cold(struct task_struct *task, struct vm_area_struct *vma,
 			struct vm_area_struct **prev,
 			unsigned long start_addr, unsigned long end_addr)
 {
@@ -495,19 +499,21 @@ static long madvise_cold(struct vm_area_struct *vma,
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
-	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, task, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb, start_addr, end_addr);
 
 	return 0;
 }
 
 static void madvise_pageout_page_range(struct mmu_gather *tlb,
+			     struct task_struct *task,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end)
 {
 	struct madvise_walk_private walk_private = {
 		.pageout = true,
 		.tlb = tlb,
+		.task = task,
 	};
 
 	tlb_start_vma(tlb, vma);
@@ -531,9 +537,9 @@ static inline bool can_do_pageout(struct vm_area_struct *vma)
 		inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
 }
 
-static long madvise_pageout(struct vm_area_struct *vma,
-			struct vm_area_struct **prev,
-			unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct task_struct *task,
+		struct vm_area_struct *vma, struct vm_area_struct **prev,
+		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
@@ -547,7 +553,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
-	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+	madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb, start_addr, end_addr);
 
 	return 0;
@@ -751,7 +757,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 	return 0;
 }
 
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
 				  int behavior)
@@ -763,8 +770,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (!userfaultfd_remove(vma, start, end)) {
 		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
 
-		down_read(&current->mm->mmap_sem);
-		vma = find_vma(current->mm, start);
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, start);
 		if (!vma)
 			return -ENOMEM;
 		if (start < vma->vm_start) {
@@ -811,7 +818,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
  */
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end)
 {
@@ -845,13 +853,13 @@ static long madvise_remove(struct vm_area_struct *vma,
 	get_file(f);
 	if (userfaultfd_remove(vma, start, end)) {
 		/* mmap_sem was not released by userfaultfd_remove() */
-		up_read(&current->mm->mmap_sem);
+		up_read(&mm->mmap_sem);
 	}
 	error = vfs_fallocate(f,
 				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
 				offset, end - start);
 	fput(f);
-	down_read(&current->mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 	return error;
 }
 
@@ -925,21 +933,23 @@ static int madvise_inject_error(int behavior,
 #endif
 
 static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct mm_struct *mm,
+		struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
-		return madvise_remove(vma, prev, start, end);
+		return madvise_remove(mm, vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COLD:
-		return madvise_cold(vma, prev, start, end);
+		return madvise_cold(task, vma, prev, start, end);
 	case MADV_PAGEOUT:
-		return madvise_pageout(vma, prev, start, end);
+		return madvise_pageout(task, vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(mm, vma, prev, start,
+						end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -984,67 +994,19 @@ madvise_behavior_valid(int behavior)
 }
 
 /*
- * The madvise(2) system call.
+ * madvise_common - request behavior hint to address range of the target process
  *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area.  The idea is to help the kernel
- * use appropriate read-ahead and caching techniques.  The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
+ * @task: task_struct got behavior hint, not giving the hint
+ * @mm: mm_struct got behavior hint, not giving the hint
+ * @start: base address of the hinted range
+ * @len_in: length of the hinted range
+ * @behavior: requested hint
  *
- * behavior values:
- *  MADV_NORMAL - the default behavior is to read clusters.  This
- *		results in some read-ahead and read-behind.
- *  MADV_RANDOM - the system should read the minimum amount of data
- *		on any access, since it is unlikely that the appli-
- *		cation will need more than what it asks for.
- *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
- *		once, so they can be aggressively read ahead, and
- *		can be freed soon after they are accessed.
- *  MADV_WILLNEED - the application is notifying the system to read
- *		some pages ahead.
- *  MADV_DONTNEED - the application is finished with the given range,
- *		so the kernel can free resources associated with it.
- *  MADV_FREE - the application marks pages in the given range as lazy free,
- *		where actual purges are postponed until memory pressure happens.
- *  MADV_REMOVE - the application wants to free up the given range of
- *		pages and associated backing store.
- *  MADV_DONTFORK - omit this area from child's address space when forking:
- *		typically, to avoid COWing pages pinned by get_user_pages().
- *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
- *              range after a fork.
- *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- *  MADV_HWPOISON - trigger memory error handler as if the given memory range
- *		were corrupted by unrecoverable hardware memory failure.
- *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- *		this area with pages of identical content from other such areas.
- *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- *  MADV_HUGEPAGE - the application wants to back the given range by transparent
- *		huge pages in the future. Existing pages might be coalesced and
- *		new pages might be allocated as THP.
- *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- *		transparent huge pages so the existing pages will not be
- *		coalesced into THP and new pages will not be allocated as THP.
- *  MADV_DONTDUMP - the application wants to prevent pages in the given range
- *		from being included in its core dump.
- *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- *  zero    - success
- *  -EINVAL - start + len < 0, start is not page-aligned,
- *		"behavior" is not a valid value, or application
- *		is attempting to release locked or shared pages,
- *		or the specified address range includes file, Huge TLB,
- *		MAP_SHARED or VMPFNMAP range.
- *  -ENOMEM - addresses in the specified range are not currently
- *		mapped, or are outside the AS of the process.
- *  -EIO    - an I/O error occurred while paging in data.
- *  -EBADF  - map exists, but area maps something that isn't a file.
- *  -EAGAIN - a kernel resource was temporarily unavailable.
+ * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
+ * via task->mm is prohibited. Please use @mm instead of task->mm.
  */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int madvise_common(struct task_struct *task, struct mm_struct *mm,
+			unsigned long start, size_t len_in, int behavior)
 {
 	unsigned long end, tmp;
 	struct vm_area_struct *vma, *prev;
@@ -1082,10 +1044,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 	write = madvise_need_mmap_write(behavior);
 	if (write) {
-		if (down_write_killable(&current->mm->mmap_sem))
+		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 	} else {
-		down_read(&current->mm->mmap_sem);
+		down_read(&mm->mmap_sem);
 	}
 
 	/*
@@ -1093,7 +1055,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	 * ranges, just ignore them, but return -ENOMEM at the end.
 	 * - different from the way of handling in mlock etc.
 	 */
-	vma = find_vma_prev(current->mm, start, &prev);
+	vma = find_vma_prev(mm, start, &prev);
 	if (vma && start > vma->vm_start)
 		prev = vma;
 
@@ -1118,7 +1080,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
+		error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
 		if (error)
 			goto out;
 		start = tmp;
@@ -1130,14 +1092,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 		if (prev)
 			vma = prev->vm_next;
 		else	/* madvise_remove dropped mmap_sem */
-			vma = find_vma(current->mm, start);
+			vma = find_vma(mm, start);
 	}
 out:
 	blk_finish_plug(&plug);
 	if (write)
-		up_write(&current->mm->mmap_sem);
+		up_write(&mm->mmap_sem);
 	else
-		up_read(&current->mm->mmap_sem);
+		up_read(&mm->mmap_sem);
 
 	return error;
 }
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area.  The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques.  The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ *  MADV_NORMAL - the default behavior is to read clusters.  This
+ *		results in some read-ahead and read-behind.
+ *  MADV_RANDOM - the system should read the minimum amount of data
+ *		on any access, since it is unlikely that the appli-
+ *		cation will need more than what it asks for.
+ *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ *		once, so they can be aggressively read ahead, and
+ *		can be freed soon after they are accessed.
+ *  MADV_WILLNEED - the application is notifying the system to read
+ *		some pages ahead.
+ *  MADV_DONTNEED - the application is finished with the given range,
+ *		so the kernel can free resources associated with it.
+ *  MADV_FREE - the application marks pages in the given range as lazy free,
+ *		where actual purges are postponed until memory pressure happens.
+ *  MADV_REMOVE - the application wants to free up the given range of
+ *		pages and associated backing store.
+ *  MADV_DONTFORK - omit this area from child's address space when forking:
+ *		typically, to avoid COWing pages pinned by get_user_pages().
+ *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ *              range after a fork.
+ *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ *  MADV_HWPOISON - trigger memory error handler as if the given memory range
+ *		were corrupted by unrecoverable hardware memory failure.
+ *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ *		this area with pages of identical content from other such areas.
+ *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ *  MADV_HUGEPAGE - the application wants to back the given range by transparent
+ *		huge pages in the future. Existing pages might be coalesced and
+ *		new pages might be allocated as THP.
+ *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ *		transparent huge pages so the existing pages will not be
+ *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_DONTDUMP - the application wants to prevent pages in the given range
+ *		from being included in its core dump.
+ *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ *  zero    - success
+ *  -EINVAL - start + len < 0, start is not page-aligned,
+ *		"behavior" is not a valid value, or application
+ *		is attempting to release locked or shared pages,
+ *		or the specified address range includes file, Huge TLB,
+ *		MAP_SHARED or VMPFNMAP range.
+ *  -ENOMEM - addresses in the specified range are not currently
+ *		mapped, or are outside the AS of the process.
+ *  -EIO    - an I/O error occurred while paging in data.
+ *  -EBADF  - map exists, but area maps something that isn't a file.
+ *  -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+	return madvise_common(current, current->mm, start, len_in, behavior);
+}
-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 1/5] mm: factor out madvise's core functionality Minchan Kim
@ 2020-01-16 23:59 ` Minchan Kim
  2020-01-17 11:52   ` Michal Hocko
  2020-01-16 23:59 ` [PATCH v2 3/5] mm/madvise: employ mmget_still_valid for write lock Minchan Kim
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim

There is usecase that System Management Software(SMS) want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
in the case of Android, it is the ActivityManagerService.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2).
It uses pidfd of an external processs to give the hint.

 int process_madvise(int pidfd, void *addr, size_t length, int advise,
			unsigned long flag);

Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the
API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.
Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's more safe for maintainace rather than
introducing a buggy syscall but hard to fix it later.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  5 +-
 kernel/sys_ni.c                             |  1 +
 mm/madvise.c                                | 63 +++++++++++++++++++++
 21 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index e56950f23b49..776c61803315 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@
 # 545 reserved for clone3
 546	common	watch_devices			sys_watch_devices
 547	common	openat2				sys_openat2
+548	common	process_madvise			sys_process_madvise
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 7fb2f4d59210..a43381542276 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -451,3 +451,4 @@
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 8aa00ccb0b96..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		438
+#define __NR_compat_syscalls		439
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 31f0ce25719e..e3643d7fecc3 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -883,6 +883,8 @@ __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_watch_devices, sys_watch_devices)
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, process_madvise)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b9aa59931905..c156abc9a298 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -358,3 +358,4 @@
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 868c1ef89d35..5b6034b6650f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 544b4cef18b3..4bef584af09c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 05e8aee5dae7..7061b2103438 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -376,3 +376,4 @@
 435	n32	clone3				__sys_clone3
 436	n32	watch_devices			sys_watch_devices
 437	n32	openat2				sys_openat2
+438	n32	process_madivse			sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 24d6c01328fb..84042d57fbfb 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -352,3 +352,4 @@
 435	n64	clone3				__sys_clone3
 436	n64	watch_devices			sys_watch_devices
 437	n64	openat2				sys_openat2
+438	n64	process_madvise			sys_process_madvise
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 4b5f77a4e1a2..5bfd359c7e6f 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 435	common	clone3				sys_clone3_wrapper
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 9716dc85a517..ffa0e679aca0 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -519,3 +519,4 @@
 435	nospu	clone3				ppc_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 7da330f8b03e..c301717216ca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 435  common	clone3			sys_clone3			sys_clone3
 436  common	watch_devices		sys_watch_devices		sys_watch_devices
 437  common	openat2			sys_openat2			sys_openat2
+438  common	process_madvise		sys_process_madvise		sys_process_madvise
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bb7e68e25337..b8f15701f69f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 646a1fad7218..7ea95f37b222 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -483,3 +483,4 @@
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2			sys_openat2
+438	common	process_madvise		sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 57c53acee290..76a2c266fe7e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 436	i386	watch_devices		sys_watch_devices		__ia32_sys_watch_devices
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
+438	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1dd8d21f6500..b697cd8620cb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 436	common	watch_devices		__x64_sys_watch_devices
 437	common	openat2			__x64_sys_openat2
+438	common	process_madvise		__x64_sys_process_madvise
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 0f48ab7bd75b..2e9813ecfd7d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -408,3 +408,4 @@
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 433c8c85636e..1b58a11ff49f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,6 +877,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+			size_t len, int behavior, unsigned long flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 33f3856a9c3c..4a49fbaea013 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -856,8 +856,11 @@ __SYSCALL(__NR_watch_devices, sys_watch_devices)
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
 
+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
+
 #undef __NR_syscalls
-#define __NR_syscalls 438
+#define __NR_syscalls 439
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0e9b275260f8..10ce5eac8b4b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -281,6 +281,7 @@ COND_SYSCALL(mlockall);
 COND_SYSCALL(munlockall);
 COND_SYSCALL(mincore);
 COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 0c901de531e4..59b5cc99ef61 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,7 @@
 #include <linux/falloc.h>
 #include <linux/fadvise.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/ksm.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -993,6 +994,18 @@ madvise_behavior_valid(int behavior)
 	}
 }
 
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+	switch (behavior) {
+	case MADV_COLD:
+	case MADV_PAGEOUT:
+		return true;
+	default:
+		return false;
+	}
+}
+
 /*
  * madvise_common - request behavior hint to address range of the target process
  *
@@ -1151,6 +1164,10 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_COLD - the application uses the memory less so the kernel can deactivate
+ *  		the memory to evict them quickly when the memory pressure happen.
+ *  MADV_PAGEOUT - the application uses the memroy very rarely so kernel can
+ *  		page out the memory instantly.
  *
  * return values:
  *  zero    - success
@@ -1169,3 +1186,49 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
 	return madvise_common(current, current->mm, start, len_in, behavior);
 }
+
+SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
+		size_t, len_in, int, behavior, unsigned long, flags)
+{
+	int ret;
+	struct fd f;
+	struct pid *pid;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	if (!process_madvise_behavior_valid(behavior))
+		return -EINVAL;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+	if (IS_ERR(pid)) {
+		ret = PTR_ERR(pid);
+		goto fdput;
+	}
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		ret = -ESRCH;
+		goto fdput;
+	}
+
+	mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	if (IS_ERR_OR_NULL(mm)) {
+		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+		goto release_task;
+	}
+
+	ret = madvise_common(task, mm, start, len_in, behavior);
+	mmput(mm);
+release_task:
+	put_task_struct(task);
+fdput:
+	fdput(f);
+	return ret;
+}
-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 3/5] mm/madvise: employ mmget_still_valid for write lock
  2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 1/5] mm: factor out madvise's core functionality Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 2/5] mm: introduce external memory hinting API Minchan Kim
@ 2020-01-16 23:59 ` Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API Minchan Kim
  2020-01-16 23:59 ` [PATCH v2 5/5] mm: support both pid and pidfd for process_madvise Minchan Kim
  4 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim,
	Minchan Kim

From: Oleksandr Natalenko <oleksandr@redhat.com>

Do the very same trick as we already do since 04f5866e41fb. KSM hints
will require locking mmap_sem for write since they modify vm_flags, so
for remote KSM hinting this additional check is needed.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Minchan Kim <minchan@google.com>
---
 mm/madvise.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 59b5cc99ef61..84cffd0900f1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1059,6 +1059,8 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 	if (write) {
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
+		if (current->mm != mm && !mmget_still_valid(mm))
+			goto skip_mm;
 	} else {
 		down_read(&mm->mmap_sem);
 	}
@@ -1109,6 +1111,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm,
 	}
 out:
 	blk_finish_plug(&plug);
+skip_mm:
 	if (write)
 		up_write(&mm->mmap_sem);
 	else
-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API
  2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
                   ` (2 preceding siblings ...)
  2020-01-16 23:59 ` [PATCH v2 3/5] mm/madvise: employ mmget_still_valid for write lock Minchan Kim
@ 2020-01-16 23:59 ` Minchan Kim
  2020-01-17 10:13   ` Kirill Tkhai
  2020-01-16 23:59 ` [PATCH v2 5/5] mm: support both pid and pidfd for process_madvise Minchan Kim
  4 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim,
	Minchan Kim

From: Oleksandr Natalenko <oleksandr@redhat.com>

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets employ a new remote madvise API. This
can be used by some small userspace helper daemon that will do auto-KSM
job for us.

I think of two major consumers of remote KSM hints:

  * hosts, that run containers, especially similar ones and especially in
    a trusted environment, sharing the same runtime like Node.js;

  * heavy applications, that can be run in multiple instances, not
    limited to opensource ones like Firefox, but also those that cannot be
    modified since they are binary-only and, maybe, statically linked.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   410

2 FF instances, second one has 12 tabs (all the tabs are different):

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

[1] https://lore.kernel.org/patchwork/patch/1012142/

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Minchan Kim <minchan@google.com>
---
 mm/madvise.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 84cffd0900f1..89557998d287 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1000,6 +1000,8 @@ process_madvise_behavior_valid(int behavior)
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
 		return true;
 	default:
 		return false;
-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 5/5] mm: support both pid and pidfd for process_madvise
  2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
                   ` (3 preceding siblings ...)
  2020-01-16 23:59 ` [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API Minchan Kim
@ 2020-01-16 23:59 ` Minchan Kim
  4 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-16 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark, Minchan Kim

There is a demand[1] to support pid as well pidfd for process_madvise
to reduce unncessary syscall to get pidfd if the user has control of
the targer process(ie, they could gaurantee the process is not gone
or pid is not reused. Or, it might be okay to give a hint to wrong
process).

This patch aims for supporting both options like waitid(2). So, the
syscall is currently,

	int process_madvise(int which, pid_t pid, void *addr,
		size_t length, int advise, unsigned long flag);

@which is actually idtype_t for userspace libray and currently,
it supports P_PID and P_PIDFD.

[1]  https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/pid.h      |  1 +
 include/linux/syscalls.h |  3 ++-
 kernel/exit.c            | 17 -----------------
 kernel/pid.c             | 17 +++++++++++++++++
 mm/madvise.c             | 34 ++++++++++++++++++++++------------
 5 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 998ae7d24450..023d9c3a8edc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -75,6 +75,7 @@ extern const struct file_operations pidfd_fops;
 struct file;
 
 extern struct pid *pidfd_pid(const struct file *file);
+extern struct pid *pidfd_get_pid(unsigned int fd);
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1b58a11ff49f..27060e59db37 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,7 +877,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
-asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+
+asmlinkage long sys_process_madvise(int which, pid_t pid, unsigned long start,
 			size_t len, int behavior, unsigned long flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
diff --git a/kernel/exit.c b/kernel/exit.c
index bcbd59888e67..7698843b1411 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1466,23 +1466,6 @@ static long do_wait(struct wait_opts *wo)
 	return retval;
 }
 
-static struct pid *pidfd_get_pid(unsigned int fd)
-{
-	struct fd f;
-	struct pid *pid;
-
-	f = fdget(fd);
-	if (!f.file)
-		return ERR_PTR(-EBADF);
-
-	pid = pidfd_pid(f.file);
-	if (!IS_ERR(pid))
-		get_pid(pid);
-
-	fdput(f);
-	return pid;
-}
-
 static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 			  int options, struct rusage *ru)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 2278e249141d..a41a89d5dad2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -496,6 +496,23 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return idr_get_next(&ns->idr, &nr);
 }
 
+struct pid *pidfd_get_pid(unsigned int fd)
+{
+	struct fd f;
+	struct pid *pid;
+
+	f = fdget(fd);
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	pid = pidfd_pid(f.file);
+	if (!IS_ERR(pid))
+		get_pid(pid);
+
+	fdput(f);
+	return pid;
+}
+
 /**
  * pidfd_create() - Create a new pid file descriptor.
  *
diff --git a/mm/madvise.c b/mm/madvise.c
index 89557998d287..2ac62716e5b8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1192,11 +1192,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return madvise_common(current, current->mm, start, len_in, behavior);
 }
 
-SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
+SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
 		size_t, len_in, int, behavior, unsigned long, flags)
 {
 	int ret;
-	struct fd f;
 	struct pid *pid;
 	struct task_struct *task;
 	struct mm_struct *mm;
@@ -1207,20 +1206,31 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
 	if (!process_madvise_behavior_valid(behavior))
 		return -EINVAL;
 
-	f = fdget(pidfd);
-	if (!f.file)
-		return -EBADF;
+	switch (which) {
+	case P_PID:
+		if (upid <= 0)
+			return -EINVAL;
+
+		pid = find_get_pid(upid);
+		if (!pid)
+			return -ESRCH;
+		break;
+	case P_PIDFD:
+		if (upid < 0)
+			return -EINVAL;
 
-	pid = pidfd_pid(f.file);
-	if (IS_ERR(pid)) {
-		ret = PTR_ERR(pid);
-		goto fdput;
+		pid = pidfd_get_pid(upid);
+		if (IS_ERR(pid))
+			return PTR_ERR(pid);
+		break;
+	default:
+		return -EINVAL;
 	}
 
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task) {
 		ret = -ESRCH;
-		goto fdput;
+		goto put_pid;
 	}
 
 	mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
@@ -1233,7 +1243,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
 	mmput(mm);
 release_task:
 	put_task_struct(task);
-fdput:
-	fdput(f);
+put_pid:
+	put_pid(pid);
 	return ret;
 }
-- 
2.25.0.rc1.283.g88dfdc4193-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/5] mm: factor out madvise's core functionality
  2020-01-16 23:59 ` [PATCH v2 1/5] mm: factor out madvise's core functionality Minchan Kim
@ 2020-01-17 10:02   ` Kirill Tkhai
  2020-01-17 18:14     ` Minchan Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Kirill Tkhai @ 2020-01-17 10:02 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, christian.brauner, sjpark, Kirill A. Shutemov

On 17.01.2020 02:59, Minchan Kim wrote:
> This patch factor out madvise's core functionality so that upcoming
> patch can reuse it without duplication. It shouldn't change any behavior.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/madvise.c | 194 +++++++++++++++++++++++++++++----------------------
>  1 file changed, 111 insertions(+), 83 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bcdb6a042787..0c901de531e4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -35,6 +35,7 @@
>  struct madvise_walk_private {
>  	struct mmu_gather *tlb;
>  	bool pageout;
> +	struct task_struct *task;
>  };
>  
>  /*
> @@ -306,12 +307,13 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	bool pageout = private->pageout;
>  	struct mm_struct *mm = tlb->mm;
>  	struct vm_area_struct *vma = walk->vma;
> +	struct task_struct *task = private->task;
>  	pte_t *orig_pte, *pte, ptent;
>  	spinlock_t *ptl;
>  	struct page *page = NULL;
>  	LIST_HEAD(page_list);
>  
> -	if (fatal_signal_pending(current))
> +	if (fatal_signal_pending(task))
>  		return -EINTR;

This EINTR may confuse userspace. Users will think the syscall was interrupted,
and it may be restarted, but this is not true.

What we care here? Current task received fatal signal, while walk_page_range(..&cold_walk_ops..)
is a long cycle. So, this check allows to break the cycle faster.

Iteration over remote task's mm may also be long, and we still may need to break
it if current received a signal.

So, we'd better left fatal_signal_pending(current) here.

Maybe we need both tasks fatal_signal_pending() checks and different retvals here,
but it's up to you.

>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -469,12 +471,14 @@ static const struct mm_walk_ops cold_walk_ops = {
>  };
>  
>  static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct task_struct *task,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end)
>  {
>  	struct madvise_walk_private walk_private = {
>  		.pageout = false,
>  		.tlb = tlb,
> +		.task = task,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -482,7 +486,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
>  	tlb_end_vma(tlb, vma);
>  }
>  
> -static long madvise_cold(struct vm_area_struct *vma,
> +static long madvise_cold(struct task_struct *task, struct vm_area_struct *vma,
>  			struct vm_area_struct **prev,
>  			unsigned long start_addr, unsigned long end_addr)
>  {
> @@ -495,19 +499,21 @@ static long madvise_cold(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, task, vma, start_addr, end_addr);
>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
>  }
>  
>  static void madvise_pageout_page_range(struct mmu_gather *tlb,
> +			     struct task_struct *task,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end)
>  {
>  	struct madvise_walk_private walk_private = {
>  		.pageout = true,
>  		.tlb = tlb,
> +		.task = task,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -531,9 +537,9 @@ static inline bool can_do_pageout(struct vm_area_struct *vma)
>  		inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
>  }
>  
> -static long madvise_pageout(struct vm_area_struct *vma,
> -			struct vm_area_struct **prev,
> -			unsigned long start_addr, unsigned long end_addr)
> +static long madvise_pageout(struct task_struct *task,
> +		struct vm_area_struct *vma, struct vm_area_struct **prev,
> +		unsigned long start_addr, unsigned long end_addr)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> @@ -547,7 +553,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);

This is new remote VMA iteration.

I found Kirill Shutemov is not in CC. CC Kirill.

>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
> @@ -751,7 +757,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> +static long madvise_dontneed_free(struct mm_struct *mm,
> +				  struct vm_area_struct *vma,
>  				  struct vm_area_struct **prev,
>  				  unsigned long start, unsigned long end,
>  				  int behavior)
> @@ -763,8 +770,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	if (!userfaultfd_remove(vma, start, end)) {
>  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
>  
> -		down_read(&current->mm->mmap_sem);
> -		vma = find_vma(current->mm, start);
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, start);
>  		if (!vma)
>  			return -ENOMEM;
>  		if (start < vma->vm_start) {
> @@ -811,7 +818,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>   * Application wants to free up the pages and associated backing store.
>   * This is effectively punching a hole into the middle of a file.
>   */
> -static long madvise_remove(struct vm_area_struct *vma,
> +static long madvise_remove(struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				struct vm_area_struct **prev,
>  				unsigned long start, unsigned long end)
>  {
> @@ -845,13 +853,13 @@ static long madvise_remove(struct vm_area_struct *vma,
>  	get_file(f);
>  	if (userfaultfd_remove(vma, start, end)) {
>  		/* mmap_sem was not released by userfaultfd_remove() */
> -		up_read(&current->mm->mmap_sem);
> +		up_read(&mm->mmap_sem);
>  	}
>  	error = vfs_fallocate(f,
>  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>  				offset, end - start);
>  	fput(f);
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
>  	return error;
>  }
>  
> @@ -925,21 +933,23 @@ static int madvise_inject_error(int behavior,
>  #endif
>  
>  static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> +		struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		unsigned long start, unsigned long end, int behavior)
>  {
>  	switch (behavior) {
>  	case MADV_REMOVE:
> -		return madvise_remove(vma, prev, start, end);
> +		return madvise_remove(mm, vma, prev, start, end);
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COLD:
> -		return madvise_cold(vma, prev, start, end);
> +		return madvise_cold(task, vma, prev, start, end);
>  	case MADV_PAGEOUT:
> -		return madvise_pageout(vma, prev, start, end);
> +		return madvise_pageout(task, vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
> -		return madvise_dontneed_free(vma, prev, start, end, behavior);
> +		return madvise_dontneed_free(mm, vma, prev, start,
> +						end, behavior);
>  	default:
>  		return madvise_behavior(vma, prev, start, end, behavior);
>  	}
> @@ -984,67 +994,19 @@ madvise_behavior_valid(int behavior)
>  }
>  
>  /*
> - * The madvise(2) system call.
> + * madvise_common - request behavior hint to address range of the target process
>   *
> - * Applications can use madvise() to advise the kernel how it should
> - * handle paging I/O in this VM area.  The idea is to help the kernel
> - * use appropriate read-ahead and caching techniques.  The information
> - * provided is advisory only, and can be safely disregarded by the
> - * kernel without affecting the correct operation of the application.
> + * @task: task_struct got behavior hint, not giving the hint
> + * @mm: mm_struct got behavior hint, not giving the hint
> + * @start: base address of the hinted range
> + * @len_in: length of the hinted range
> + * @behavior: requested hint
>   *
> - * behavior values:
> - *  MADV_NORMAL - the default behavior is to read clusters.  This
> - *		results in some read-ahead and read-behind.
> - *  MADV_RANDOM - the system should read the minimum amount of data
> - *		on any access, since it is unlikely that the appli-
> - *		cation will need more than what it asks for.
> - *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> - *		once, so they can be aggressively read ahead, and
> - *		can be freed soon after they are accessed.
> - *  MADV_WILLNEED - the application is notifying the system to read
> - *		some pages ahead.
> - *  MADV_DONTNEED - the application is finished with the given range,
> - *		so the kernel can free resources associated with it.
> - *  MADV_FREE - the application marks pages in the given range as lazy free,
> - *		where actual purges are postponed until memory pressure happens.
> - *  MADV_REMOVE - the application wants to free up the given range of
> - *		pages and associated backing store.
> - *  MADV_DONTFORK - omit this area from child's address space when forking:
> - *		typically, to avoid COWing pages pinned by get_user_pages().
> - *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> - *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> - *              range after a fork.
> - *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> - *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> - *		were corrupted by unrecoverable hardware memory failure.
> - *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> - *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> - *		this area with pages of identical content from other such areas.
> - *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> - *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> - *		huge pages in the future. Existing pages might be coalesced and
> - *		new pages might be allocated as THP.
> - *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> - *		transparent huge pages so the existing pages will not be
> - *		coalesced into THP and new pages will not be allocated as THP.
> - *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> - *		from being included in its core dump.
> - *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> - *
> - * return values:
> - *  zero    - success
> - *  -EINVAL - start + len < 0, start is not page-aligned,
> - *		"behavior" is not a valid value, or application
> - *		is attempting to release locked or shared pages,
> - *		or the specified address range includes file, Huge TLB,
> - *		MAP_SHARED or VMPFNMAP range.
> - *  -ENOMEM - addresses in the specified range are not currently
> - *		mapped, or are outside the AS of the process.
> - *  -EIO    - an I/O error occurred while paging in data.
> - *  -EBADF  - map exists, but area maps something that isn't a file.
> - *  -EAGAIN - a kernel resource was temporarily unavailable.
> + * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
> + * via task->mm is prohibited. Please use @mm instead of task->mm.
>   */
> -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +static int madvise_common(struct task_struct *task, struct mm_struct *mm,
> +			unsigned long start, size_t len_in, int behavior)
>  {
>  	unsigned long end, tmp;
>  	struct vm_area_struct *vma, *prev;
> @@ -1082,10 +1044,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  
>  	write = madvise_need_mmap_write(behavior);
>  	if (write) {
> -		if (down_write_killable(&current->mm->mmap_sem))
> +		if (down_write_killable(&mm->mmap_sem))
>  			return -EINTR;
>  	} else {
> -		down_read(&current->mm->mmap_sem);
> +		down_read(&mm->mmap_sem);
>  	}
>  
>  	/*
> @@ -1093,7 +1055,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  	 * ranges, just ignore them, but return -ENOMEM at the end.
>  	 * - different from the way of handling in mlock etc.
>  	 */
> -	vma = find_vma_prev(current->mm, start, &prev);
> +	vma = find_vma_prev(mm, start, &prev);
>  	if (vma && start > vma->vm_start)
>  		prev = vma;
>  
> @@ -1118,7 +1080,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			tmp = end;
>  
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -		error = madvise_vma(vma, &prev, start, tmp, behavior);
> +		error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
>  		if (error)
>  			goto out;
>  		start = tmp;
> @@ -1130,14 +1092,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  		if (prev)
>  			vma = prev->vm_next;
>  		else	/* madvise_remove dropped mmap_sem */
> -			vma = find_vma(current->mm, start);
> +			vma = find_vma(mm, start);
>  	}
>  out:
>  	blk_finish_plug(&plug);
>  	if (write)
> -		up_write(&current->mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
>  	else
> -		up_read(&current->mm->mmap_sem);
> +		up_read(&mm->mmap_sem);
>  
>  	return error;
>  }
> +
> +/*
> + * The madvise(2) system call.
> + *
> + * Applications can use madvise() to advise the kernel how it should
> + * handle paging I/O in this VM area.  The idea is to help the kernel
> + * use appropriate read-ahead and caching techniques.  The information
> + * provided is advisory only, and can be safely disregarded by the
> + * kernel without affecting the correct operation of the application.
> + *
> + * behavior values:
> + *  MADV_NORMAL - the default behavior is to read clusters.  This
> + *		results in some read-ahead and read-behind.
> + *  MADV_RANDOM - the system should read the minimum amount of data
> + *		on any access, since it is unlikely that the appli-
> + *		cation will need more than what it asks for.
> + *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> + *		once, so they can be aggressively read ahead, and
> + *		can be freed soon after they are accessed.
> + *  MADV_WILLNEED - the application is notifying the system to read
> + *		some pages ahead.
> + *  MADV_DONTNEED - the application is finished with the given range,
> + *		so the kernel can free resources associated with it.
> + *  MADV_FREE - the application marks pages in the given range as lazy free,
> + *		where actual purges are postponed until memory pressure happens.
> + *  MADV_REMOVE - the application wants to free up the given range of
> + *		pages and associated backing store.
> + *  MADV_DONTFORK - omit this area from child's address space when forking:
> + *		typically, to avoid COWing pages pinned by get_user_pages().
> + *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> + *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> + *              range after a fork.
> + *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> + *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> + *		were corrupted by unrecoverable hardware memory failure.
> + *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> + *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> + *		this area with pages of identical content from other such areas.
> + *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> + *		huge pages in the future. Existing pages might be coalesced and
> + *		new pages might be allocated as THP.
> + *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> + *		transparent huge pages so the existing pages will not be
> + *		coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> + *		from being included in its core dump.
> + *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *
> + * return values:
> + *  zero    - success
> + *  -EINVAL - start + len < 0, start is not page-aligned,
> + *		"behavior" is not a valid value, or application
> + *		is attempting to release locked or shared pages,
> + *		or the specified address range includes file, Huge TLB,
> + *		MAP_SHARED or VMPFNMAP range.
> + *  -ENOMEM - addresses in the specified range are not currently
> + *		mapped, or are outside the AS of the process.
> + *  -EIO    - an I/O error occurred while paging in data.
> + *  -EBADF  - map exists, but area maps something that isn't a file.
> + *  -EAGAIN - a kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +{
> +	return madvise_common(current, current->mm, start, len_in, behavior);
> +}
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API
  2020-01-16 23:59 ` [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API Minchan Kim
@ 2020-01-17 10:13   ` Kirill Tkhai
  2020-01-17 12:34     ` Oleksandr Natalenko
  0 siblings, 1 reply; 39+ messages in thread
From: Kirill Tkhai @ 2020-01-17 10:13 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: LKML, linux-mm, linux-api, oleksandr, Suren Baghdasaryan,
	Tim Murray, Daniel Colascione, Sandeep Patil, Sonny Rao,
	Brian Geffon, Michal Hocko, Johannes Weiner, Shakeel Butt,
	John Dias, christian.brauner, sjpark, Minchan Kim

On 17.01.2020 02:59, Minchan Kim wrote:
> From: Oleksandr Natalenko <oleksandr@redhat.com>
> 
> It all began with the fact that KSM works only on memory that is marked
> by madvise(). And the only way to get around that is to either:
> 
>   * use LD_PRELOAD; or
>   * patch the kernel with something like UKSM or PKSM.
> 
> (i skip ptrace can of worms here intentionally)
> 
> To overcome this restriction, lets employ a new remote madvise API. This
> can be used by some small userspace helper daemon that will do auto-KSM
> job for us.
> 
> I think of two major consumers of remote KSM hints:
> 
>   * hosts, that run containers, especially similar ones and especially in
>     a trusted environment, sharing the same runtime like Node.js;
> 
>   * heavy applications, that can be run in multiple instances, not
>     limited to opensource ones like Firefox, but also those that cannot be
>     modified since they are binary-only and, maybe, statically linked.
> 
> Speaking of statistics, more numbers can be found in the very first
> submission, that is related to this one [1]. For my current setup with
> two Firefox instances I get 100 to 200 MiB saved for the second instance
> depending on the amount of tabs.
> 
> 1 FF instance with 15 tabs:
> 
>    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>    410
> 
> 2 FF instances, second one has 12 tabs (all the tabs are different):
> 
>    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>    592
> 
> At the very moment I do not have specific numbers for containerised
> workload, but those should be comparable in case the containers share
> similar/same runtime.
> 
> [1] https://lore.kernel.org/patchwork/patch/1012142/
> 
> Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> Signed-off-by: Minchan Kim <minchan@google.com>
> ---
>  mm/madvise.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 84cffd0900f1..89557998d287 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1000,6 +1000,8 @@ process_madvise_behavior_valid(int behavior)
>  	switch (behavior) {
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> +	case MADV_MERGEABLE:
> +	case MADV_UNMERGEABLE:
>  		return true;
>  	default:
>  		return false;

Remote madvise on KSM parameters should be OK.

One thing is madvise_behavior_valid() places MADV_MERGEABLE/UNMERGEABLE
in #ifdef brackes, so -EINVAL is returned by madvise() syscall if KSM
is not enabled. Here we should follow the same way for symmetry.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-16 23:59 ` [PATCH v2 2/5] mm: introduce external memory hinting API Minchan Kim
@ 2020-01-17 11:52   ` Michal Hocko
  2020-01-17 15:58     ` Kirill A. Shutemov
                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-17 11:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark

On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> There is usecase that System Management Software(SMS) want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> in the case of Android, it is the ActivityManagerService.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
> 
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It uses pidfd of an external processs to give the hint.
> 
>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> 			unsigned long flag);
> 
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
> The flag argument is reserved for future use if we need to extend the
> API.
> 
> I think supporting all hints madvise has/will supported/support to
> process_madvise is rather risky. Because we are not sure all hints make
> sense from external process and implementation for the hint may rely on
> the caller being in the current context so it could be error-prone.
> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> 
> If someone want to add other hints, we could hear hear the usecase and
> review it for each hint. It's more safe for maintainace rather than
> introducing a buggy syscall but hard to fix it later.

I have brought this up when we discussed this in the past but there is
no reflection on that here so let me bring that up again. 

I believe that the interface has an inherent problem that it is racy.
The external entity needs to know the address space layout of the target
process to do anyhing useful on it. The address space is however under
the full control of the target process though and the external entity
has no means to find out that the layout has changed. So
time-to-check-time-to-act is an inherent problem.

This is a serious design flaw and it should be explained why it doesn't
matter or how to use the interface properly to prevent that problem.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API
  2020-01-17 10:13   ` Kirill Tkhai
@ 2020-01-17 12:34     ` Oleksandr Natalenko
  2020-01-21 17:45       ` Minchan Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Oleksandr Natalenko @ 2020-01-17 12:34 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, linux-api,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Michal Hocko, Johannes Weiner,
	Shakeel Butt, John Dias, christian.brauner, sjpark, Minchan Kim

Hi.

On Fri, Jan 17, 2020 at 01:13:14PM +0300, Kirill Tkhai wrote:
> On 17.01.2020 02:59, Minchan Kim wrote:
> > From: Oleksandr Natalenko <oleksandr@redhat.com>
> > 
> > It all began with the fact that KSM works only on memory that is marked
> > by madvise(). And the only way to get around that is to either:
> > 
> >   * use LD_PRELOAD; or
> >   * patch the kernel with something like UKSM or PKSM.
> > 
> > (i skip ptrace can of worms here intentionally)
> > 
> > To overcome this restriction, lets employ a new remote madvise API. This
> > can be used by some small userspace helper daemon that will do auto-KSM
> > job for us.
> > 
> > I think of two major consumers of remote KSM hints:
> > 
> >   * hosts, that run containers, especially similar ones and especially in
> >     a trusted environment, sharing the same runtime like Node.js;
> > 
> >   * heavy applications, that can be run in multiple instances, not
> >     limited to opensource ones like Firefox, but also those that cannot be
> >     modified since they are binary-only and, maybe, statically linked.
> > 
> > Speaking of statistics, more numbers can be found in the very first
> > submission, that is related to this one [1]. For my current setup with
> > two Firefox instances I get 100 to 200 MiB saved for the second instance
> > depending on the amount of tabs.
> > 
> > 1 FF instance with 15 tabs:
> > 
> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >    410
> > 
> > 2 FF instances, second one has 12 tabs (all the tabs are different):
> > 
> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >    592
> > 
> > At the very moment I do not have specific numbers for containerised
> > workload, but those should be comparable in case the containers share
> > similar/same runtime.
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> > 
> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> > Signed-off-by: Minchan Kim <minchan@google.com>
> > ---
> >  mm/madvise.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 84cffd0900f1..89557998d287 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1000,6 +1000,8 @@ process_madvise_behavior_valid(int behavior)
> >  	switch (behavior) {
> >  	case MADV_COLD:
> >  	case MADV_PAGEOUT:
> > +	case MADV_MERGEABLE:
> > +	case MADV_UNMERGEABLE:
> >  		return true;
> >  	default:
> >  		return false;
> 
> Remote madvise on KSM parameters should be OK.
> 
> One thing is madvise_behavior_valid() places MADV_MERGEABLE/UNMERGEABLE
> in #ifdef brackes, so -EINVAL is returned by madvise() syscall if KSM
> is not enabled. Here we should follow the same way for symmetry.
> 

Thanks for the suggestion.

Minchan, shall you adopt it directly, or I should send a separate patch?

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 11:52   ` Michal Hocko
@ 2020-01-17 15:58     ` Kirill A. Shutemov
  2020-01-17 17:32       ` Minchan Kim
  2020-01-17 17:25     ` Minchan Kim
  2020-01-20 10:24     ` Kirill Tkhai
  2 siblings, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2020-01-17 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark

On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> > 
> >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > 			unsigned long flag);
> > 
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> > 
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > 
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

I agree, it looks flawed.

Also I don't see what System Management Software can generically do on
sub-process level. I mean how can it decide which part of address space is
less important than other.

I see how a manager can indicate that this process (or a group of
processes) is less important than other, but on per-addres-range basis?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 11:52   ` Michal Hocko
  2020-01-17 15:58     ` Kirill A. Shutemov
@ 2020-01-17 17:25     ` Minchan Kim
  2020-01-20  8:03       ` Michal Hocko
  2020-01-20 10:24     ` Kirill Tkhai
  2 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-17 17:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark

On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> > 
> >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > 			unsigned long flag);
> > 
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> > 
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > 
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Sorry for the missing that part.

It's not a particular problem of this API because other APIs already have
done with that(e.g., move_pages, process_vm_writev).

Point is userspace has several ways for the control of target process
like SIGSTOP, cgroup freezer or even no need to control since platform
is already aware of that the process will never run until he grant it
or it's resilient even though the race happens.

In future, if we want to support more fine-grained consistency model
like memory layout, we could provide some API to get cookie(e.g.,
seq count which is updated whenever vma of the process changes).  And then
we could feed the cookie to process_madvise's last argument so that
it can fail if founds it's not matched.
For that API, Daniel already posted RFC - process_getinfo[1].
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 15:58     ` Kirill A. Shutemov
@ 2020-01-17 17:32       ` Minchan Kim
  2020-01-17 21:26         ` Kirill A. Shutemov
  0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-17 17:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, linux-api,
	oleksandr, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > > 
> > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > 			unsigned long flag);
> > > 
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > > 
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > 
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> I agree, it looks flawed.
> 
> Also I don't see what System Management Software can generically do on
> sub-process level. I mean how can it decide which part of address space is
> less important than other.
> 
> I see how a manager can indicate that this process (or a group of
> processes) is less important than other, but on per-addres-range basis?

For example, memory ranges shared by several processes or critical for the
latency, we could avoid those ranges to be cold/pageout to prevent
unncecessary CPU burning/paging.

I also think people don't want to give an KSM hint to non-mergeable area.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/5] mm: factor out madvise's core functionality
  2020-01-17 10:02   ` Kirill Tkhai
@ 2020-01-17 18:14     ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-17 18:14 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Michal Hocko, Johannes Weiner,
	Shakeel Butt, John Dias, christian.brauner, sjpark,
	Kirill A. Shutemov

On Fri, Jan 17, 2020 at 01:02:34PM +0300, Kirill Tkhai wrote:
> On 17.01.2020 02:59, Minchan Kim wrote:
> > This patch factor out madvise's core functionality so that upcoming
> > patch can reuse it without duplication. It shouldn't change any behavior.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/madvise.c | 194 +++++++++++++++++++++++++++++----------------------
> >  1 file changed, 111 insertions(+), 83 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index bcdb6a042787..0c901de531e4 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -35,6 +35,7 @@
> >  struct madvise_walk_private {
> >  	struct mmu_gather *tlb;
> >  	bool pageout;
> > +	struct task_struct *task;
> >  };
> >  
> >  /*
> > @@ -306,12 +307,13 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >  	bool pageout = private->pageout;
> >  	struct mm_struct *mm = tlb->mm;
> >  	struct vm_area_struct *vma = walk->vma;
> > +	struct task_struct *task = private->task;
> >  	pte_t *orig_pte, *pte, ptent;
> >  	spinlock_t *ptl;
> >  	struct page *page = NULL;
> >  	LIST_HEAD(page_list);
> >  
> > -	if (fatal_signal_pending(current))
> > +	if (fatal_signal_pending(task))
> >  		return -EINTR;
> 
> This EINTR may confuse userspace. Users will think the syscall was interrupted,
> and it may be restarted, but this is not true.

madvise_[pageout|cold] doesn't propagate the error to userspace.

> 
> What we care here? Current task received fatal signal, while walk_page_range(..&cold_walk_ops..)
> is a long cycle. So, this check allows to break the cycle faster.
> 
> Iteration over remote task's mm may also be long, and we still may need to break
> it if current received a signal.
> 
> So, we'd better left fatal_signal_pending(current) here.
> 
> Maybe we need both tasks fatal_signal_pending() checks and different retvals here,
> but it's up to you.

Yub, let's check both processes here to bail out.

Thanks for the review!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 17:32       ` Minchan Kim
@ 2020-01-17 21:26         ` Kirill A. Shutemov
  2020-01-18  9:40           ` SeongJae Park
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2020-01-17 21:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, linux-api,
	oleksandr, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > There is usecase that System Management Software(SMS) want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > in the case of Android, it is the ActivityManagerService.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > > 
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It uses pidfd of an external processs to give the hint.
> > > > 
> > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > 			unsigned long flag);
> > > > 
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > > The flag argument is reserved for future use if we need to extend the
> > > > API.
> > > > 
> > > > I think supporting all hints madvise has/will supported/support to
> > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > sense from external process and implementation for the hint may rely on
> > > > the caller being in the current context so it could be error-prone.
> > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > 
> > > > If someone want to add other hints, we could hear hear the usecase and
> > > > review it for each hint. It's more safe for maintainace rather than
> > > > introducing a buggy syscall but hard to fix it later.
> > > 
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again. 
> > > 
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > > 
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> > 
> > I agree, it looks flawed.
> > 
> > Also I don't see what System Management Software can generically do on
> > sub-process level. I mean how can it decide which part of address space is
> > less important than other.
> > 
> > I see how a manager can indicate that this process (or a group of
> > processes) is less important than other, but on per-addres-range basis?
> 
> For example, memory ranges shared by several processes or critical for the
> latency, we could avoid those ranges to be cold/pageout to prevent
> unncecessary CPU burning/paging.

Hmm.. I still don't see why any external entity has a better (or any)
knowledge about the matter. The process has to do this, no?

> I also think people don't want to give an KSM hint to non-mergeable area.

And how the manager knows which data is mergable?

If you are intimate enough with the process' internal state feel free to
inject syscall into the process with ptrace. Why bother with half-measures?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 21:26         ` Kirill A. Shutemov
@ 2020-01-18  9:40           ` SeongJae Park
  2020-01-19 16:14           ` sspatil
  2020-01-21 18:11           ` Minchan Kim
  2 siblings, 0 replies; 39+ messages in thread
From: SeongJae Park @ 2020-01-18  9:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Minchan Kim, Michal Hocko, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, ktkhai,
	christian.brauner, sjpark

On Sat, 18 Jan 2020 00:26:53 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > > 
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > > 
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > > 
> > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > 			unsigned long flag);
> > > > > 
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > > 
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > 
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > > 
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again. 
> > > > 
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > > 
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > > 
> > > I agree, it looks flawed.
> > > 
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > > 
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> > 
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
> 
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?
> 
> > I also think people don't want to give an KSM hint to non-mergeable area.
> 
> And how the manager knows which data is mergable?

Couldn't 'idle_page_tracking' like features could be used by the external
manager processes to know that?


Thanks,
SeongJae Park

> 
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?
> 
> -- 
>  Kirill A. Shutemov
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 21:26         ` Kirill A. Shutemov
  2020-01-18  9:40           ` SeongJae Park
@ 2020-01-19 16:14           ` sspatil
  2020-01-20  7:58             ` Michal Hocko
  2020-01-21 18:11           ` Minchan Kim
  2 siblings, 1 reply; 39+ messages in thread
From: sspatil @ 2020-01-19 16:14 UTC (permalink / raw)
  To: kirill, minchan, mhocko, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark
  Cc: Minchan Kim, Michal Hocko, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > > 
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > > 
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > > 
> > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > 			unsigned long flag);
> > > > > 
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > > 
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > 
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > > 
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again. 
> > > > 
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > > 
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > > 
> > > I agree, it looks flawed.
> > > 
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > > 
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> > 
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
> 
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

FWIW, I totally agree with the time-to-check-time-to-react problem. However,
I'd like to clarify the ActivityManager/SystemServer case (I'll call it
SystemServer from now on)

For Android, every application (including the special SystemServer) are forked
from Zygote. The reason ofcourse is to share as many libraries and classes between
the two as possible to benefit from the preloading during boot.

After applications start, (almost) all of the APIs  end up calling into this
SystemServer process over IPC (binder) and back to the application.

In a fully running system, the SystemServer monitors every single process
periodically to calculate their PSS / RSS and also decides which process is
"important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the SystemServer
is looping to monitor each process, it does tend to *know* which address
range of the application is not used / useful.

Besides, we can never rely on applications to clean things up themselves.
We've had the "hey app1, the system is low on memory, please trim your
memory usage down" notifications for a long time[1]. They rely on
applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant memory in
these applications will be useful.


- ssp

1. https://developer.android.com/topic/performance/memory


> 
> > I also think people don't want to give an KSM hint to non-mergeable area.
> 
> And how the manager knows which data is mergable?
> 
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-19 16:14           ` sspatil
@ 2020-01-20  7:58             ` Michal Hocko
  2020-01-20 10:39               ` Kirill Tkhai
  2020-01-21 18:32               ` Minchan Kim
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-20  7:58 UTC (permalink / raw)
  To: sspatil
  Cc: kirill, minchan, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark

On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > > 
> > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > any app involvement.
> > > > > > 
> > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > It uses pidfd of an external processs to give the hint.
> > > > > > 
> > > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > 			unsigned long flag);
> > > > > > 
> > > > > > Since it could affect other process's address range, only privileged
> > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > API.
> > > > > > 
> > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > sense from external process and implementation for the hint may rely on
> > > > > > the caller being in the current context so it could be error-prone.
> > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > > 
> > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > introducing a buggy syscall but hard to fix it later.
> > > > > 
> > > > > I have brought this up when we discussed this in the past but there is
> > > > > no reflection on that here so let me bring that up again. 
> > > > > 
> > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > The external entity needs to know the address space layout of the target
> > > > > process to do anyhing useful on it. The address space is however under
> > > > > the full control of the target process though and the external entity
> > > > > has no means to find out that the layout has changed. So
> > > > > time-to-check-time-to-act is an inherent problem.
> > > > > 
> > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > matter or how to use the interface properly to prevent that problem.
> > > > 
> > > > I agree, it looks flawed.
> > > > 
> > > > Also I don't see what System Management Software can generically do on
> > > > sub-process level. I mean how can it decide which part of address space is
> > > > less important than other.
> > > > 
> > > > I see how a manager can indicate that this process (or a group of
> > > > processes) is less important than other, but on per-addres-range basis?
> > > 
> > > For example, memory ranges shared by several processes or critical for the
> > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > unncecessary CPU burning/paging.
> > 
> > Hmm.. I still don't see why any external entity has a better (or any)
> > knowledge about the matter. The process has to do this, no?
> 
> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> SystemServer from now on)
> 
> For Android, every application (including the special SystemServer) are forked
> from Zygote. The reason ofcourse is to share as many libraries and classes between
> the two as possible to benefit from the preloading during boot.
> 
> After applications start, (almost) all of the APIs  end up calling into this
> SystemServer process over IPC (binder) and back to the application.
> 
> In a fully running system, the SystemServer monitors every single process
> periodically to calculate their PSS / RSS and also decides which process is
> "important" to the user for interactivity.
> 
> So, because of how these processes start _and_ the fact that the SystemServer
> is looping to monitor each process, it does tend to *know* which address
> range of the application is not used / useful.
> 
> Besides, we can never rely on applications to clean things up themselves.
> We've had the "hey app1, the system is low on memory, please trim your
> memory usage down" notifications for a long time[1]. They rely on
> applications honoring the broadcasts and very few do.
> 
> So, if we want to avoid the inevitable killing of the application and
> restarting it, some way to be able to tell the OS about unimportant memory in
> these applications will be useful.

This is a useful information that should be a part of the changelog. I
do see how the current form of the API might fit into Android model
without many problems. But we are not designing an API for a single
usecase, right? In a highly cooperative environments you can use ptrace
code injection as mentioned by Kirill. Or is there any fundamental
problem about that?

The interface really has to be robust to future potential usecases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 17:25     ` Minchan Kim
@ 2020-01-20  8:03       ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-20  8:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, ktkhai, christian.brauner, sjpark

On Fri 17-01-20 09:25:42, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > > 
> > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > 			unsigned long flag);
> > > 
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > > 
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > 
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> Sorry for the missing that part.
> 
> It's not a particular problem of this API because other APIs already have
> done with that(e.g., move_pages, process_vm_writev).

I am sorry but this is not really an argument.

> Point is userspace has several ways for the control of target process
> like SIGSTOP, cgroup freezer or even no need to control since platform
> is already aware of that the process will never run until he grant it
> or it's resilient even though the race happens.

If you have that level of control then you can simply inject the code
via ptrace and you do not need a new syscall in the first place.

> In future, if we want to support more fine-grained consistency model
> like memory layout, we could provide some API to get cookie(e.g.,
> seq count which is updated whenever vma of the process changes).  And then
> we could feed the cookie to process_madvise's last argument so that
> it can fail if founds it's not matched.
> For that API, Daniel already posted RFC - process_getinfo[1].
> https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

So why do not we start with a clean API since the beginning? I do agree
that a remote madvise is an interesting feature and it opens gates to
all sorts of userspace memory management which is not possible this
days. But the syscall has to have a reasonable semantic to allow that.
We cannot simply start with a half proken symantic first based on an
Android usecase and then hit the wall as soon as others with a different
user space model want to use it as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 11:52   ` Michal Hocko
  2020-01-17 15:58     ` Kirill A. Shutemov
  2020-01-17 17:25     ` Minchan Kim
@ 2020-01-20 10:24     ` Kirill Tkhai
  2020-01-20 11:27       ` Michal Hocko
  2 siblings, 1 reply; 39+ messages in thread
From: Kirill Tkhai @ 2020-01-20 10:24 UTC (permalink / raw)
  To: Michal Hocko, Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, christian.brauner, sjpark

On 17.01.2020 14:52, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>> There is usecase that System Management Software(SMS) want to give
>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>> in the case of Android, it is the ActivityManagerService.
>>
>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>> required to make the reclaim decision is not known to the app. Instead,
>> it is known to the centralized userspace daemon(ActivityManagerService),
>> and that daemon must be able to initiate reclaim on its own without
>> any app involvement.
>>
>> To solve the issue, this patch introduces new syscall process_madvise(2).
>> It uses pidfd of an external processs to give the hint.
>>
>>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
>> 			unsigned long flag);
>>
>> Since it could affect other process's address range, only privileged
>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>> gives it the right to ptrace the process could use it successfully.
>> The flag argument is reserved for future use if we need to extend the
>> API.
>>
>> I think supporting all hints madvise has/will supported/support to
>> process_madvise is rather risky. Because we are not sure all hints make
>> sense from external process and implementation for the hint may rely on
>> the caller being in the current context so it could be error-prone.
>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>
>> If someone want to add other hints, we could hear hear the usecase and
>> review it for each hint. It's more safe for maintainace rather than
>> introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Really, any address space manipulation, where more than one process is
involved, is racy. Even two threads on common memory need a synchronization
to manage mappings in a sane way. Managing memory from two processes
is the same in principle, and the only difference is that another level
of synchronization is required.

I'm not fan and user for this feature (at least for now, nobody knows
what will be in the future), but design flaw argument does not look
fully valid.

Regards,
Kirill

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20  7:58             ` Michal Hocko
@ 2020-01-20 10:39               ` Kirill Tkhai
  2020-01-21 18:32               ` Minchan Kim
  1 sibling, 0 replies; 39+ messages in thread
From: Kirill Tkhai @ 2020-01-20 10:39 UTC (permalink / raw)
  To: Michal Hocko, sspatil
  Cc: kirill, minchan, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, christian.brauner, sjpark

On 20.01.2020 10:58, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
>> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
>>> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
>>>> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
>>>>>> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>>>>>>> There is usecase that System Management Software(SMS) want to give
>>>>>>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>>>>>>> in the case of Android, it is the ActivityManagerService.
>>>>>>>
>>>>>>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>>>>>>> required to make the reclaim decision is not known to the app. Instead,
>>>>>>> it is known to the centralized userspace daemon(ActivityManagerService),
>>>>>>> and that daemon must be able to initiate reclaim on its own without
>>>>>>> any app involvement.
>>>>>>>
>>>>>>> To solve the issue, this patch introduces new syscall process_madvise(2).
>>>>>>> It uses pidfd of an external processs to give the hint.
>>>>>>>
>>>>>>>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
>>>>>>> 			unsigned long flag);
>>>>>>>
>>>>>>> Since it could affect other process's address range, only privileged
>>>>>>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>>>>>>> gives it the right to ptrace the process could use it successfully.
>>>>>>> The flag argument is reserved for future use if we need to extend the
>>>>>>> API.
>>>>>>>
>>>>>>> I think supporting all hints madvise has/will supported/support to
>>>>>>> process_madvise is rather risky. Because we are not sure all hints make
>>>>>>> sense from external process and implementation for the hint may rely on
>>>>>>> the caller being in the current context so it could be error-prone.
>>>>>>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>>>>>>
>>>>>>> If someone want to add other hints, we could hear hear the usecase and
>>>>>>> review it for each hint. It's more safe for maintainace rather than
>>>>>>> introducing a buggy syscall but hard to fix it later.
>>>>>>
>>>>>> I have brought this up when we discussed this in the past but there is
>>>>>> no reflection on that here so let me bring that up again. 
>>>>>>
>>>>>> I believe that the interface has an inherent problem that it is racy.
>>>>>> The external entity needs to know the address space layout of the target
>>>>>> process to do anyhing useful on it. The address space is however under
>>>>>> the full control of the target process though and the external entity
>>>>>> has no means to find out that the layout has changed. So
>>>>>> time-to-check-time-to-act is an inherent problem.
>>>>>>
>>>>>> This is a serious design flaw and it should be explained why it doesn't
>>>>>> matter or how to use the interface properly to prevent that problem.
>>>>>
>>>>> I agree, it looks flawed.
>>>>>
>>>>> Also I don't see what System Management Software can generically do on
>>>>> sub-process level. I mean how can it decide which part of address space is
>>>>> less important than other.
>>>>>
>>>>> I see how a manager can indicate that this process (or a group of
>>>>> processes) is less important than other, but on per-addres-range basis?
>>>>
>>>> For example, memory ranges shared by several processes or critical for the
>>>> latency, we could avoid those ranges to be cold/pageout to prevent
>>>> unncecessary CPU burning/paging.
>>>
>>> Hmm.. I still don't see why any external entity has a better (or any)
>>> knowledge about the matter. The process has to do this, no?
>>
>> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
>> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
>> SystemServer from now on)
>>
>> For Android, every application (including the special SystemServer) are forked
>> from Zygote. The reason ofcourse is to share as many libraries and classes between
>> the two as possible to benefit from the preloading during boot.
>>
>> After applications start, (almost) all of the APIs  end up calling into this
>> SystemServer process over IPC (binder) and back to the application.
>>
>> In a fully running system, the SystemServer monitors every single process
>> periodically to calculate their PSS / RSS and also decides which process is
>> "important" to the user for interactivity.
>>
>> So, because of how these processes start _and_ the fact that the SystemServer
>> is looping to monitor each process, it does tend to *know* which address
>> range of the application is not used / useful.
>>
>> Besides, we can never rely on applications to clean things up themselves.
>> We've had the "hey app1, the system is low on memory, please trim your
>> memory usage down" notifications for a long time[1]. They rely on
>> applications honoring the broadcasts and very few do.
>>
>> So, if we want to avoid the inevitable killing of the application and
>> restarting it, some way to be able to tell the OS about unimportant memory in
>> these applications will be useful.
> 
> This is a useful information that should be a part of the changelog. I
> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

There could be only problems with multi-threads applications, which
poll the state of another threads (and exit with error, when they
found that someone is traced). But this may be workarounded by freezer
cgroup making all the thread group is frozen. It should guarantee
consistent state of the whole process after attaching.

Kirill


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 10:24     ` Kirill Tkhai
@ 2020-01-20 11:27       ` Michal Hocko
  2020-01-20 12:39         ` Kirill A. Shutemov
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2020-01-20 11:27 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, linux-api, oleksandr,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Johannes Weiner, Shakeel Butt,
	John Dias, christian.brauner, sjpark

On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> On 17.01.2020 14:52, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> >> There is usecase that System Management Software(SMS) want to give
> >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> >> in the case of Android, it is the ActivityManagerService.
> >>
> >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> >> required to make the reclaim decision is not known to the app. Instead,
> >> it is known to the centralized userspace daemon(ActivityManagerService),
> >> and that daemon must be able to initiate reclaim on its own without
> >> any app involvement.
> >>
> >> To solve the issue, this patch introduces new syscall process_madvise(2).
> >> It uses pidfd of an external processs to give the hint.
> >>
> >>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> >> 			unsigned long flag);
> >>
> >> Since it could affect other process's address range, only privileged
> >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> >> gives it the right to ptrace the process could use it successfully.
> >> The flag argument is reserved for future use if we need to extend the
> >> API.
> >>
> >> I think supporting all hints madvise has/will supported/support to
> >> process_madvise is rather risky. Because we are not sure all hints make
> >> sense from external process and implementation for the hint may rely on
> >> the caller being in the current context so it could be error-prone.
> >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> >>
> >> If someone want to add other hints, we could hear hear the usecase and
> >> review it for each hint. It's more safe for maintainace rather than
> >> introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> Really, any address space manipulation, where more than one process is
> involved, is racy.

They are, indeed. But that is not the point I wanted to make.

> Even two threads on common memory need a synchronization
> to manage mappings in a sane way. Managing memory from two processes
> is the same in principle, and the only difference is that another level
> of synchronization is required.

Well, not really. The operation might simply attempt to perform an
operation on a specific memory area and get a failure if it doesn't
reference the same object anymore. What I think we need is some form of
a handle to operate on. In the past we have discussed several
directions. I was proposing /proc/self/map_anon/ (analogous to
map_files) where you could inspect anonymous memory and get a file
handle for it. madvise would then operate on the fd and then there
shouldn't be a real problem to revalidate that the object is still
valid. But there was no general enthusiasm about that approach. There
are likely some land mines on the way.

There was another approach mentioned by Minchan in other email in this
thread.

What I want to say is that I believe a remove madvise can have a
sensible semantic even without a strong synchronization between the
monitor and the target task. We just have to make sure that the monitor
never operates on a different object then it believes it acts on.

On the other hand, there will never be any way to make this interface
reasonable for destructive operations though because the content of the
memory needs a strong synchronization IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 11:27       ` Michal Hocko
@ 2020-01-20 12:39         ` Kirill A. Shutemov
  2020-01-20 13:24           ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2020-01-20 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill Tkhai, Minchan Kim, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, christian.brauner,
	sjpark

On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > On 17.01.2020 14:52, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > >> There is usecase that System Management Software(SMS) want to give
> > >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > >> in the case of Android, it is the ActivityManagerService.
> > >>
> > >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > >> required to make the reclaim decision is not known to the app. Instead,
> > >> it is known to the centralized userspace daemon(ActivityManagerService),
> > >> and that daemon must be able to initiate reclaim on its own without
> > >> any app involvement.
> > >>
> > >> To solve the issue, this patch introduces new syscall process_madvise(2).
> > >> It uses pidfd of an external processs to give the hint.
> > >>
> > >>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > >> 			unsigned long flag);
> > >>
> > >> Since it could affect other process's address range, only privileged
> > >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > >> gives it the right to ptrace the process could use it successfully.
> > >> The flag argument is reserved for future use if we need to extend the
> > >> API.
> > >>
> > >> I think supporting all hints madvise has/will supported/support to
> > >> process_madvise is rather risky. Because we are not sure all hints make
> > >> sense from external process and implementation for the hint may rely on
> > >> the caller being in the current context so it could be error-prone.
> > >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > >>
> > >> If someone want to add other hints, we could hear hear the usecase and
> > >> review it for each hint. It's more safe for maintainace rather than
> > >> introducing a buggy syscall but hard to fix it later.
> > > 
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again. 
> > > 
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > > 
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> > 
> > Really, any address space manipulation, where more than one process is
> > involved, is racy.
> 
> They are, indeed. But that is not the point I wanted to make.
> 
> > Even two threads on common memory need a synchronization
> > to manage mappings in a sane way. Managing memory from two processes
> > is the same in principle, and the only difference is that another level
> > of synchronization is required.
> 
> Well, not really. The operation might simply attempt to perform an
> operation on a specific memory area and get a failure if it doesn't
> reference the same object anymore. What I think we need is some form of
> a handle to operate on. In the past we have discussed several
> directions. I was proposing /proc/self/map_anon/ (analogous to
> map_files) where you could inspect anonymous memory and get a file
> handle for it. madvise would then operate on the fd and then there
> shouldn't be a real problem to revalidate that the object is still
> valid. But there was no general enthusiasm about that approach. There
> are likely some land mines on the way.

Converting anon memory to file-backed is bad idea and going to backfire.
It will break many things around rmap (for THP in particular). We have
assumption that an anon page cannot be mapped twice in the same process.
Breaking rules around memory inheritance is not helpful too.

It is a major re-design of the subsystem and I don't see any real need for
this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 12:39         ` Kirill A. Shutemov
@ 2020-01-20 13:24           ` Michal Hocko
  2020-01-20 14:21             ` Kirill A. Shutemov
  2020-01-21 18:43             ` Minchan Kim
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-20 13:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill Tkhai, Minchan Kim, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, christian.brauner,
	sjpark

On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
[...]
> > > Even two threads on common memory need a synchronization
> > > to manage mappings in a sane way. Managing memory from two processes
> > > is the same in principle, and the only difference is that another level
> > > of synchronization is required.
> > 
> > Well, not really. The operation might simply attempt to perform an
> > operation on a specific memory area and get a failure if it doesn't
> > reference the same object anymore. What I think we need is some form of
> > a handle to operate on. In the past we have discussed several
> > directions. I was proposing /proc/self/map_anon/ (analogous to
> > map_files) where you could inspect anonymous memory and get a file
> > handle for it. madvise would then operate on the fd and then there
> > shouldn't be a real problem to revalidate that the object is still
> > valid. But there was no general enthusiasm about that approach. There
> > are likely some land mines on the way.
> 
> Converting anon memory to file-backed is bad idea and going to backfire.

I didn't mean to convert. I meant to expose that information via proc
the same way we do for file backed mappings. That shouldn't really
require to re-design the way how anonymous vma work IMO. But I haven't
tried that so there might be many gotchas there.

There are obvious things to think about though. Such fd cannot be sent
to other processes (SCM stuff), mmap of the file would have to be
disallowed and many others I am not aware of. I am not even pushing this
direction because I am not convinced about how viable it is myself. But
it would sound like a nice extension of the existing mechanism we have
and a file based madvise sounds attractive to me as well because we
already have that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 13:24           ` Michal Hocko
@ 2020-01-20 14:21             ` Kirill A. Shutemov
  2020-01-20 15:44               ` Michal Hocko
  2020-01-21 18:43             ` Minchan Kim
  1 sibling, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2020-01-20 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill Tkhai, Minchan Kim, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, christian.brauner,
	sjpark

On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > > 
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> > 
> > Converting anon memory to file-backed is bad idea and going to backfire.
> 
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
> 
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

If the fd cannot be passed around or mmaped what does it represent?
And how is it different from plain address?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 14:21             ` Kirill A. Shutemov
@ 2020-01-20 15:44               ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-20 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill Tkhai, Minchan Kim, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, christian.brauner,
	sjpark

On Mon 20-01-20 17:21:32, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > [...]
> > > > > Even two threads on common memory need a synchronization
> > > > > to manage mappings in a sane way. Managing memory from two processes
> > > > > is the same in principle, and the only difference is that another level
> > > > > of synchronization is required.
> > > > 
> > > > Well, not really. The operation might simply attempt to perform an
> > > > operation on a specific memory area and get a failure if it doesn't
> > > > reference the same object anymore. What I think we need is some form of
> > > > a handle to operate on. In the past we have discussed several
> > > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > > map_files) where you could inspect anonymous memory and get a file
> > > > handle for it. madvise would then operate on the fd and then there
> > > > shouldn't be a real problem to revalidate that the object is still
> > > > valid. But there was no general enthusiasm about that approach. There
> > > > are likely some land mines on the way.
> > > 
> > > Converting anon memory to file-backed is bad idea and going to backfire.
> > 
> > I didn't mean to convert. I meant to expose that information via proc
> > the same way we do for file backed mappings. That shouldn't really
> > require to re-design the way how anonymous vma work IMO. But I haven't
> > tried that so there might be many gotchas there.
> > 
> > There are obvious things to think about though. Such fd cannot be sent
> > to other processes (SCM stuff), mmap of the file would have to be
> > disallowed and many others I am not aware of. I am not even pushing this
> > direction because I am not convinced about how viable it is myself. But
> > it would sound like a nice extension of the existing mechanism we have
> > and a file based madvise sounds attractive to me as well because we
> > already have that.
> 
> If the fd cannot be passed around or mmaped what does it represent?

Because it is a cookie maintained by the kernel.

> And how is it different from plain address?

Kernel can revalidate that the given fd is denoting the vma it was
created for and simply fail with ENOENT or whatever suits if somebody
did unmap and mmap to the same address.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API
  2020-01-17 12:34     ` Oleksandr Natalenko
@ 2020-01-21 17:45       ` Minchan Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-21 17:45 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Kirill Tkhai, Andrew Morton, LKML, linux-mm, linux-api,
	Suren Baghdasaryan, Tim Murray, Daniel Colascione, Sandeep Patil,
	Sonny Rao, Brian Geffon, Michal Hocko, Johannes Weiner,
	Shakeel Butt, John Dias, christian.brauner, sjpark

On Fri, Jan 17, 2020 at 01:34:00PM +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> On Fri, Jan 17, 2020 at 01:13:14PM +0300, Kirill Tkhai wrote:
> > On 17.01.2020 02:59, Minchan Kim wrote:
> > > From: Oleksandr Natalenko <oleksandr@redhat.com>
> > > 
> > > It all began with the fact that KSM works only on memory that is marked
> > > by madvise(). And the only way to get around that is to either:
> > > 
> > >   * use LD_PRELOAD; or
> > >   * patch the kernel with something like UKSM or PKSM.
> > > 
> > > (i skip ptrace can of worms here intentionally)
> > > 
> > > To overcome this restriction, lets employ a new remote madvise API. This
> > > can be used by some small userspace helper daemon that will do auto-KSM
> > > job for us.
> > > 
> > > I think of two major consumers of remote KSM hints:
> > > 
> > >   * hosts, that run containers, especially similar ones and especially in
> > >     a trusted environment, sharing the same runtime like Node.js;
> > > 
> > >   * heavy applications, that can be run in multiple instances, not
> > >     limited to opensource ones like Firefox, but also those that cannot be
> > >     modified since they are binary-only and, maybe, statically linked.
> > > 
> > > Speaking of statistics, more numbers can be found in the very first
> > > submission, that is related to this one [1]. For my current setup with
> > > two Firefox instances I get 100 to 200 MiB saved for the second instance
> > > depending on the amount of tabs.
> > > 
> > > 1 FF instance with 15 tabs:
> > > 
> > >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > >    410
> > > 
> > > 2 FF instances, second one has 12 tabs (all the tabs are different):
> > > 
> > >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > >    592
> > > 
> > > At the very moment I do not have specific numbers for containerised
> > > workload, but those should be comparable in case the containers share
> > > similar/same runtime.
> > > 
> > > [1] https://lore.kernel.org/patchwork/patch/1012142/
> > > 
> > > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> > > Signed-off-by: Minchan Kim <minchan@google.com>
> > > ---
> > >  mm/madvise.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 84cffd0900f1..89557998d287 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1000,6 +1000,8 @@ process_madvise_behavior_valid(int behavior)
> > >  	switch (behavior) {
> > >  	case MADV_COLD:
> > >  	case MADV_PAGEOUT:
> > > +	case MADV_MERGEABLE:
> > > +	case MADV_UNMERGEABLE:
> > >  		return true;
> > >  	default:
> > >  		return false;
> > 
> > Remote madvise on KSM parameters should be OK.
> > 
> > One thing is madvise_behavior_valid() places MADV_MERGEABLE/UNMERGEABLE
> > in #ifdef brackes, so -EINVAL is returned by madvise() syscall if KSM
> > is not enabled. Here we should follow the same way for symmetry.
> > 
> 
> Thanks for the suggestion.
> 
> Minchan, shall you adopt it directly, or I should send a separate patch?

I will handle it in next spin.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-17 21:26         ` Kirill A. Shutemov
  2020-01-18  9:40           ` SeongJae Park
  2020-01-19 16:14           ` sspatil
@ 2020-01-21 18:11           ` Minchan Kim
  2020-01-22 10:44             ` Oleksandr Natalenko
  2 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-21 18:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, linux-api,
	oleksandr, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > > 
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > > 
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > > 
> > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > 			unsigned long flag);
> > > > > 
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > > 
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > 
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > > 
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again. 
> > > > 
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > > 
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > > 
> > > I agree, it looks flawed.
> > > 
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > > 
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> > 
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
> 
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

I think Sandeep already gave enough information in other thread.

> 
> > I also think people don't want to give an KSM hint to non-mergeable area.
> 
> And how the manager knows which data is mergable?

Oleksandr, could you say your thought why you need address range based
API?

> 
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?

Concern is we want to act the hint in caller's context, not calle because
calle is usually very limited in cpuset/cgroups or even freezed state so
they couldn't act by themselves quick enough, which makes many problems.
One of efforts to solve the issue was "Expedited memory reclaim"

	https://lwn.net/Articles/785709/

That could be also a good candidate for process_madvise API.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20  7:58             ` Michal Hocko
  2020-01-20 10:39               ` Kirill Tkhai
@ 2020-01-21 18:32               ` Minchan Kim
  2020-01-22  8:28                 ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-21 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: sspatil, kirill, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark

On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
> > On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > > > 
> > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > > any app involvement.
> > > > > > > 
> > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > > It uses pidfd of an external processs to give the hint.
> > > > > > > 
> > > > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > > 			unsigned long flag);
> > > > > > > 
> > > > > > > Since it could affect other process's address range, only privileged
> > > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > > API.
> > > > > > > 
> > > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > > sense from external process and implementation for the hint may rely on
> > > > > > > the caller being in the current context so it could be error-prone.
> > > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > > > 
> > > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > > introducing a buggy syscall but hard to fix it later.
> > > > > > 
> > > > > > I have brought this up when we discussed this in the past but there is
> > > > > > no reflection on that here so let me bring that up again. 
> > > > > > 
> > > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > > The external entity needs to know the address space layout of the target
> > > > > > process to do anyhing useful on it. The address space is however under
> > > > > > the full control of the target process though and the external entity
> > > > > > has no means to find out that the layout has changed. So
> > > > > > time-to-check-time-to-act is an inherent problem.
> > > > > > 
> > > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > > matter or how to use the interface properly to prevent that problem.
> > > > > 
> > > > > I agree, it looks flawed.
> > > > > 
> > > > > Also I don't see what System Management Software can generically do on
> > > > > sub-process level. I mean how can it decide which part of address space is
> > > > > less important than other.
> > > > > 
> > > > > I see how a manager can indicate that this process (or a group of
> > > > > processes) is less important than other, but on per-addres-range basis?
> > > > 
> > > > For example, memory ranges shared by several processes or critical for the
> > > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > > unncecessary CPU burning/paging.
> > > 
> > > Hmm.. I still don't see why any external entity has a better (or any)
> > > knowledge about the matter. The process has to do this, no?
> > 
> > FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> > I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> > SystemServer from now on)
> > 
> > For Android, every application (including the special SystemServer) are forked
> > from Zygote. The reason ofcourse is to share as many libraries and classes between
> > the two as possible to benefit from the preloading during boot.
> > 
> > After applications start, (almost) all of the APIs  end up calling into this
> > SystemServer process over IPC (binder) and back to the application.
> > 
> > In a fully running system, the SystemServer monitors every single process
> > periodically to calculate their PSS / RSS and also decides which process is
> > "important" to the user for interactivity.
> > 
> > So, because of how these processes start _and_ the fact that the SystemServer
> > is looping to monitor each process, it does tend to *know* which address
> > range of the application is not used / useful.
> > 
> > Besides, we can never rely on applications to clean things up themselves.
> > We've had the "hey app1, the system is low on memory, please trim your
> > memory usage down" notifications for a long time[1]. They rely on
> > applications honoring the broadcasts and very few do.
> > 
> > So, if we want to avoid the inevitable killing of the application and
> > restarting it, some way to be able to tell the OS about unimportant memory in
> > these applications will be useful.

Thanks for adding more useful description, Sandeep.

> 
> This is a useful information that should be a part of the changelog. I

I will incldue it in next respin.

> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

I replied it at Kirill's thread so let's discuss there.

> 
> The interface really has to be robust to future potential usecases.

I do understand your concern but for me, it's chicken and egg problem.
We usually do best effort to make something perfect as far as possible
but we also don't do over-engineering without real usecase from the
beginning.

I already told you how we could synchronize among processes and potential
way to be extended Daniel suggested(That's why current API has extra field
for the cookie) even though we don't need it right now.

If you want to suggest the other way, please explain why your idea is
better and why we need it at this moment. I don't think that is a
blocker for the progress of this API since we already have several ways
to synchronize processes.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-20 13:24           ` Michal Hocko
  2020-01-20 14:21             ` Kirill A. Shutemov
@ 2020-01-21 18:43             ` Minchan Kim
  1 sibling, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2020-01-21 18:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Kirill Tkhai, Andrew Morton, LKML, linux-mm,
	linux-api, oleksandr, Suren Baghdasaryan, Tim Murray,
	Daniel Colascione, Sandeep Patil, Sonny Rao, Brian Geffon,
	Johannes Weiner, Shakeel Butt, John Dias, christian.brauner,
	sjpark

On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > > 
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> > 
> > Converting anon memory to file-backed is bad idea and going to backfire.
> 
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
> 
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

I am not a fan of fd based approach but I already reserved last argument
of the API as extendable field so we could use the field as "fd" when we
really need that kinds of fine-grained synchronization model if it's not
enough with SGISTOP, freezer and so.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-21 18:32               ` Minchan Kim
@ 2020-01-22  8:28                 ` Michal Hocko
  2020-01-22  9:36                   ` SeongJae Park
  2020-01-23  1:41                   ` Minchan Kim
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-22  8:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: sspatil, kirill, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark

On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
[...]
> > The interface really has to be robust to future potential usecases.
> 
> I do understand your concern but for me, it's chicken and egg problem.
> We usually do best effort to make something perfect as far as possible
> but we also don't do over-engineering without real usecase from the
> beginning.
> 
> I already told you how we could synchronize among processes and potential
> way to be extended Daniel suggested(That's why current API has extra field
> for the cookie) even though we don't need it right now.

If you can synchronize with the target task then you do not need a
remote interface. Just use ptrace and you are done with it.

> If you want to suggest the other way, please explain why your idea is
> better and why we need it at this moment.

I believe I have explained my concerns and why they matter. All you are
saying is that you do not care because your particular usecase doesn't
care. And that is a first signal of a future disaster when we end up
with a broken and unfixable interface we have to maintain for ever.

I will not go as far as to nack this but you should seriously think
about other potential usecases and how they would work and what we are
going to do when a first non-cooperative userspace memory management
usecase materializes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-22  8:28                 ` Michal Hocko
@ 2020-01-22  9:36                   ` SeongJae Park
  2020-01-22 10:02                     ` Michal Hocko
  2020-01-23  1:41                   ` Minchan Kim
  1 sibling, 1 reply; 39+ messages in thread
From: SeongJae Park @ 2020-01-22  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, sspatil, kirill, akpm, linux-kernel, linux-mm,
	linux-api, oleksandr, surenb, timmurray, dancol, sonnyrao,
	bgeffon, hannes, shakeelb, joaodias, ktkhai, christian.brauner,
	sjpark

On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> [...]
> > > The interface really has to be robust to future potential usecases.
> > 
> > I do understand your concern but for me, it's chicken and egg problem.
> > We usually do best effort to make something perfect as far as possible
> > but we also don't do over-engineering without real usecase from the
> > beginning.
> > 
> > I already told you how we could synchronize among processes and potential
> > way to be extended Daniel suggested(That's why current API has extra field
> > for the cookie) even though we don't need it right now.
> 
> If you can synchronize with the target task then you do not need a
> remote interface. Just use ptrace and you are done with it.
> 
> > If you want to suggest the other way, please explain why your idea is
> > better and why we need it at this moment.
> 
> I believe I have explained my concerns and why they matter. All you are
> saying is that you do not care because your particular usecase doesn't
> care. And that is a first signal of a future disaster when we end up
> with a broken and unfixable interface we have to maintain for ever.
> 
> I will not go as far as to nack this but you should seriously think
> about other potential usecases and how they would work and what we are
> going to do when a first non-cooperative userspace memory management
> usecase materializes.

Beside of the specific environment of Android, I think there are many ways to
know the address space layout and access patterns of other processes.  The
idle_page_tracking might be an example that widelay available.

Of course, the information might not strictly correct due to the timing issue,
but could be still worth to be used under some extreme situations, such as
memory pressure or fragmentation.  For the same reason, ptrace() would not be
sufficient, as we have no perfect control, but only some level of control that
would be useful under specific situations.

I assume the users of this systemcall would understand the tradeoff and make
decisions.  Also, as the users already have the right to do the tradeoff, I
think it's fair.  In other words, I think the caller has both the power and the
responsibility to deal with the time-to-check-time-to-react problem.

Nonetheless, I also agree this is important concern and the patch would be
better if it adds more detailed documentation regarding this issue.

If I'm missing some points or you have different opinions, please let me know.


Thanks,
SeongJae Park

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-22  9:36                   ` SeongJae Park
@ 2020-01-22 10:02                     ` Michal Hocko
  2020-01-22 13:28                       ` SeongJae Park
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2020-01-22 10:02 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Minchan Kim, sspatil, kirill, akpm, linux-kernel, linux-mm,
	linux-api, oleksandr, surenb, timmurray, dancol, sonnyrao,
	bgeffon, hannes, shakeelb, joaodias, ktkhai, christian.brauner,
	sjpark

On Wed 22-01-20 10:36:24, SeongJae Park wrote:
> On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> > [...]
> > > > The interface really has to be robust to future potential usecases.
> > > 
> > > I do understand your concern but for me, it's chicken and egg problem.
> > > We usually do best effort to make something perfect as far as possible
> > > but we also don't do over-engineering without real usecase from the
> > > beginning.
> > > 
> > > I already told you how we could synchronize among processes and potential
> > > way to be extended Daniel suggested(That's why current API has extra field
> > > for the cookie) even though we don't need it right now.
> > 
> > If you can synchronize with the target task then you do not need a
> > remote interface. Just use ptrace and you are done with it.
> > 
> > > If you want to suggest the other way, please explain why your idea is
> > > better and why we need it at this moment.
> > 
> > I believe I have explained my concerns and why they matter. All you are
> > saying is that you do not care because your particular usecase doesn't
> > care. And that is a first signal of a future disaster when we end up
> > with a broken and unfixable interface we have to maintain for ever.
> > 
> > I will not go as far as to nack this but you should seriously think
> > about other potential usecases and how they would work and what we are
> > going to do when a first non-cooperative userspace memory management
> > usecase materializes.
> 
> Beside of the specific environment of Android, I think there are many ways to
> know the address space layout and access patterns of other processes.  The
> idle_page_tracking might be an example that widelay available.
> 
> Of course, the information might not strictly correct due to the timing issue,
> but could be still worth to be used under some extreme situations, such as
> memory pressure or fragmentation.  For the same reason, ptrace() would not be
> sufficient, as we have no perfect control, but only some level of control that
> would be useful under specific situations.

I am not sure I see your point. I am talking about races where a remote
task is operating on a completely different object because the one it
checked for has been unmapped and new one mapped over it. Memory
pressure or a fragmentation will not change the object itself. Sure the
memory might be reclaimed but that should be completely OK unless I am
missing something.

> I assume the users of this systemcall would understand the tradeoff and make
> decisions.

I disagree. My experience tells me that users tend to squeeze the
maximum and beyond and hope they get what they want.

> Also, as the users already have the right to do the tradeoff, I
> think it's fair.  In other words, I think the caller has both the power and the
> responsibility to deal with the time-to-check-time-to-react problem.
> 
> Nonetheless, I also agree this is important concern and the patch would be
> better if it adds more detailed documentation regarding this issue.

If there is _really_ a strong consensus that the racy interface is
reasonable then it absolutely has to be described with a clearly state
that those races might result in hard to predict behavior unless all
tasks sharing the address space are blocked between the check and the
madvise call.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-21 18:11           ` Minchan Kim
@ 2020-01-22 10:44             ` Oleksandr Natalenko
  2020-01-23  1:43               ` Minchan Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Oleksandr Natalenko @ 2020-01-22 10:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Michal Hocko, Andrew Morton, LKML, linux-mm,
	linux-api, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

Hello.

On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > I also think people don't want to give an KSM hint to non-mergeable area.
> > 
> > And how the manager knows which data is mergable?
> 
> Oleksandr, could you say your thought why you need address range based
> API?

It seems I've overlooked an important piece of this submission: one
cannot apply the hint to all the anonymous mapping regardless of address
range. For KSM I'd rather either have a possibility to hint all the
anonymous mappings, or, as it was suggested previously, be able to iterate
over existing mappings using some (fd-based?) API.

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Re: Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-22 10:02                     ` Michal Hocko
@ 2020-01-22 13:28                       ` SeongJae Park
  0 siblings, 0 replies; 39+ messages in thread
From: SeongJae Park @ 2020-01-22 13:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: SeongJae Park, Minchan Kim, sspatil, kirill, akpm, linux-kernel,
	linux-mm, linux-api, oleksandr, surenb, timmurray, dancol,
	sonnyrao, bgeffon, hannes, shakeelb, joaodias, ktkhai,
	christian.brauner, sjpark

On Wed, 22 Jan 2020 11:02:33 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 22-01-20 10:36:24, SeongJae Park wrote:
> > On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > The interface really has to be robust to future potential usecases.
> > > > 
> > > > I do understand your concern but for me, it's chicken and egg problem.
> > > > We usually do best effort to make something perfect as far as possible
> > > > but we also don't do over-engineering without real usecase from the
> > > > beginning.
> > > > 
> > > > I already told you how we could synchronize among processes and potential
> > > > way to be extended Daniel suggested(That's why current API has extra field
> > > > for the cookie) even though we don't need it right now.
> > > 
> > > If you can synchronize with the target task then you do not need a
> > > remote interface. Just use ptrace and you are done with it.
> > > 
> > > > If you want to suggest the other way, please explain why your idea is
> > > > better and why we need it at this moment.
> > > 
> > > I believe I have explained my concerns and why they matter. All you are
> > > saying is that you do not care because your particular usecase doesn't
> > > care. And that is a first signal of a future disaster when we end up
> > > with a broken and unfixable interface we have to maintain for ever.
> > > 
> > > I will not go as far as to nack this but you should seriously think
> > > about other potential usecases and how they would work and what we are
> > > going to do when a first non-cooperative userspace memory management
> > > usecase materializes.
> > 
> > Beside of the specific environment of Android, I think there are many ways to
> > know the address space layout and access patterns of other processes.  The
> > idle_page_tracking might be an example that widelay available.
> > 
> > Of course, the information might not strictly correct due to the timing issue,
> > but could be still worth to be used under some extreme situations, such as
> > memory pressure or fragmentation.  For the same reason, ptrace() would not be
> > sufficient, as we have no perfect control, but only some level of control that
> > would be useful under specific situations.
> 
> I am not sure I see your point. I am talking about races where a remote
> task is operating on a completely different object because the one it
> checked for has been unmapped and new one mapped over it. Memory
> pressure or a fragmentation will not change the object itself. Sure the
> memory might be reclaimed but that should be completely OK unless I am
> missing something.

Thank you for pointing out your concerns in more detail.  I was assuming a case
using MADV_PAGEOUT or MADV_HUGEPAGE like hints under access frequency
monitoring for better performance under memory pressure or fragmentation,
respectively.  Under the race, such hints might incur some performance
degradation, but no critical problem such as SEGV.  I previously implemented
such optimization for research purpose and it was worthy.  Nonetheless, it was
just a research purpose hack.

MADV_FREE like hints might result in SEGV and thus of course should be avoided.
But, to my perspective, the 4 hints madvise_process() is currently supporting
(COLD, PAGEOUT, MERGEABLE, UNMERGEABLE) are not too risky even under the race.
That's why I said the incorrect information could be worth to be used under
some extreme situations.

> 
> > I assume the users of this systemcall would understand the tradeoff and make
> > decisions.
> 
> I disagree. My experience tells me that users tend to squeeze the
> maximum and beyond and hope they get what they want.
> 
> > Also, as the users already have the right to do the tradeoff, I
> > think it's fair.  In other words, I think the caller has both the power and the
> > responsibility to deal with the time-to-check-time-to-react problem.
> > 
> > Nonetheless, I also agree this is important concern and the patch would be
> > better if it adds more detailed documentation regarding this issue.
> 
> If there is _really_ a strong consensus that the racy interface is
> reasonable then it absolutely has to be described with a clearly state
> that those races might result in hard to predict behavior unless all
> tasks sharing the address space are blocked between the check and the
> madvise call.

So, it's still too risky to simply believe users to do the things well on their
responsibility, but a strong real consensus on needs and clear description
might justify this.  I also agreed.


Thanks,
SeongJae Park

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-22  8:28                 ` Michal Hocko
  2020-01-22  9:36                   ` SeongJae Park
@ 2020-01-23  1:41                   ` Minchan Kim
  2020-01-23  9:13                     ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-23  1:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: sspatil, kirill, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark

On Wed, Jan 22, 2020 at 09:28:53AM +0100, Michal Hocko wrote:
> On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> [...]
> > > The interface really has to be robust to future potential usecases.
> > 
> > I do understand your concern but for me, it's chicken and egg problem.
> > We usually do best effort to make something perfect as far as possible
> > but we also don't do over-engineering without real usecase from the
> > beginning.
> > 
> > I already told you how we could synchronize among processes and potential
> > way to be extended Daniel suggested(That's why current API has extra field
> > for the cookie) even though we don't need it right now.
> 
> If you can synchronize with the target task then you do not need a
> remote interface. Just use ptrace and you are done with it.

As I mentioned in other reply, we want to do in caller's context, not
callee's one because target processes stay in little cores, which are
much slower than the core the manager lives in.
The other reason is the apps are already freezed so they couldn't response
by ptrace.

> 
> > If you want to suggest the other way, please explain why your idea is
> > better and why we need it at this moment.
> 
> I believe I have explained my concerns and why they matter. All you are
> saying is that you do not care because your particular usecase doesn't
> care. And that is a first signal of a future disaster when we end up
> with a broken and unfixable interface we have to maintain for ever.

We already had suggested cookie and fd based approaches so I reserved a
argument for that to make the API extendable.
Thing is currently it's a just optimization idea since we have several
ways to sychronize processes(e.g., signal, cgroup freezer, userfaultfd
and so). It's a just matter of granularity, not necessary one we should
introduce it from the beginnig.
If someone needs that kinds of fine-grained consistency, we could extend
it then. And that's the usual way we make progress when we couldn't
know the future.

What do you want to see further?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-22 10:44             ` Oleksandr Natalenko
@ 2020-01-23  1:43               ` Minchan Kim
  2020-01-23  7:29                 ` Oleksandr Natalenko
  0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2020-01-23  1:43 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Kirill A. Shutemov, Michal Hocko, Andrew Morton, LKML, linux-mm,
	linux-api, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Wed, Jan 22, 2020 at 11:44:24AM +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > > I also think people don't want to give an KSM hint to non-mergeable area.
> > > 
> > > And how the manager knows which data is mergable?
> > 
> > Oleksandr, could you say your thought why you need address range based
> > API?
> 
> It seems I've overlooked an important piece of this submission: one
> cannot apply the hint to all the anonymous mapping regardless of address
> range. For KSM I'd rather either have a possibility to hint all the
> anonymous mappings, or, as it was suggested previously, be able to iterate
> over existing mappings using some (fd-based?) API.

Thing is how you could identify a certan range is better for KSM than
others from external process?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-23  1:43               ` Minchan Kim
@ 2020-01-23  7:29                 ` Oleksandr Natalenko
  0 siblings, 0 replies; 39+ messages in thread
From: Oleksandr Natalenko @ 2020-01-23  7:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Michal Hocko, Andrew Morton, LKML, linux-mm,
	linux-api, Suren Baghdasaryan, Tim Murray, Daniel Colascione,
	Sandeep Patil, Sonny Rao, Brian Geffon, Johannes Weiner,
	Shakeel Butt, John Dias, ktkhai, christian.brauner, sjpark

On Wed, Jan 22, 2020 at 05:43:16PM -0800, Minchan Kim wrote:
> > It seems I've overlooked an important piece of this submission: one
> > cannot apply the hint to all the anonymous mapping regardless of address
> > range. For KSM I'd rather either have a possibility to hint all the
> > anonymous mappings, or, as it was suggested previously, be able to iterate
> > over existing mappings using some (fd-based?) API.
> 
> Thing is how you could identify a certan range is better for KSM than
> others from external process?

I think the info like this is kinda available via /proc/pid/smaps. It
lists the ranges and the vmflags. But using it raises 2 concerns: one is
the absence of guarantee the mappings won't change after smaps is read
and the second one is that there's no separate vmflag for marking a vma
as non-meregable (and IIRC from previous attempts on addressing this,
we've already exhausted all the flags on 32-bit arches, so it is not
something that can be trivially addressed).

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] mm: introduce external memory hinting API
  2020-01-23  1:41                   ` Minchan Kim
@ 2020-01-23  9:13                     ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2020-01-23  9:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: sspatil, kirill, akpm, linux-kernel, linux-mm, linux-api,
	oleksandr, surenb, timmurray, dancol, sonnyrao, bgeffon, hannes,
	shakeelb, joaodias, ktkhai, christian.brauner, sjpark

On Wed 22-01-20 17:41:31, Minchan Kim wrote:
[...]
> What do you want to see further?

Either a consensus that it is sufficient to have an inherently racy
interface that requires some form of external sychronization or
a robust interface that can cope with races in a sensible way.
And no, having a flag for future extension is definitely not the
way how a new API should be added. Seriously!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-01-23  9:13 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16 23:59 [PATCH v2 0/5] introduce memory hinting API for external process Minchan Kim
2020-01-16 23:59 ` [PATCH v2 1/5] mm: factor out madvise's core functionality Minchan Kim
2020-01-17 10:02   ` Kirill Tkhai
2020-01-17 18:14     ` Minchan Kim
2020-01-16 23:59 ` [PATCH v2 2/5] mm: introduce external memory hinting API Minchan Kim
2020-01-17 11:52   ` Michal Hocko
2020-01-17 15:58     ` Kirill A. Shutemov
2020-01-17 17:32       ` Minchan Kim
2020-01-17 21:26         ` Kirill A. Shutemov
2020-01-18  9:40           ` SeongJae Park
2020-01-19 16:14           ` sspatil
2020-01-20  7:58             ` Michal Hocko
2020-01-20 10:39               ` Kirill Tkhai
2020-01-21 18:32               ` Minchan Kim
2020-01-22  8:28                 ` Michal Hocko
2020-01-22  9:36                   ` SeongJae Park
2020-01-22 10:02                     ` Michal Hocko
2020-01-22 13:28                       ` SeongJae Park
2020-01-23  1:41                   ` Minchan Kim
2020-01-23  9:13                     ` Michal Hocko
2020-01-21 18:11           ` Minchan Kim
2020-01-22 10:44             ` Oleksandr Natalenko
2020-01-23  1:43               ` Minchan Kim
2020-01-23  7:29                 ` Oleksandr Natalenko
2020-01-17 17:25     ` Minchan Kim
2020-01-20  8:03       ` Michal Hocko
2020-01-20 10:24     ` Kirill Tkhai
2020-01-20 11:27       ` Michal Hocko
2020-01-20 12:39         ` Kirill A. Shutemov
2020-01-20 13:24           ` Michal Hocko
2020-01-20 14:21             ` Kirill A. Shutemov
2020-01-20 15:44               ` Michal Hocko
2020-01-21 18:43             ` Minchan Kim
2020-01-16 23:59 ` [PATCH v2 3/5] mm/madvise: employ mmget_still_valid for write lock Minchan Kim
2020-01-16 23:59 ` [PATCH v2 4/5] mm/madvise: allow KSM hints for remote API Minchan Kim
2020-01-17 10:13   ` Kirill Tkhai
2020-01-17 12:34     ` Oleksandr Natalenko
2020-01-21 17:45       ` Minchan Kim
2020-01-16 23:59 ` [PATCH v2 5/5] mm: support both pid and pidfd for process_madvise Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).