LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 1/1] mm: introduce process_reap system call
@ 2021-06-23 19:28 Suren Baghdasaryan
  2021-06-23 19:34 ` Suren Baghdasaryan
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-23 19:28 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	christian, hch, oleg, david, jannh, shakeelb, timmurray,
	linux-api, linux-mm, linux-kernel, kernel-team, surenb

In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory
quickly and efficiently. Unfortunately the time process takes to free
up its memory after receiving a SIGKILL might vary based on the state
of the process (uninterruptible sleep), size and OPP level of the core
the process is running. A mechanism to free resources of the target
process in a more predictable way would improve system's ability to
control its memory pressure.
Introduce process_reap system call that reclaims memory of a dying process
from the context of the caller. This way the memory in freed in a more
controllable way with CPU affinity and priority of the caller. The workload
of freeing the memory will also be charged to the caller.
The operation is allowed only on a dying process.

Previously I proposed a number of alternatives to accomplish this:
- https://lore.kernel.org/patchwork/patch/1060407 extending
pidfd_send_signal to allow memory reaping using oom_reaper thread;
- https://lore.kernel.org/patchwork/patch/1338196 extending
pidfd_send_signal to reap memory of the target process synchronously from
the context of the caller;
- https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
support for process_madvise implementing synchronous memory reaping.

The end of the last discussion culminated with suggestion to introduce a
dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
The reasoning was that the new variant of process_madvise
  a) does not work on an address range
  b) is destructive
  c) doesn't share much code at all with the rest of process_madvise
From the userspace point of view it was awkward and inconvenient to provide
memory range for this operation that operates on the entire address space.
Using special flags or address values to specify the entire address space
was too hacky.

The API is as follows,

          int process_reap(int pidfd, unsigned int flags);

        DESCRIPTION
          The process_reap() system call is used to free the memory of a
          dying process.

          The pidfd selects the process referred to by the PID file
          descriptor.
          (See pidofd_open(2) for further information)

          The flags argument is reserved for future use; currently, this
          argument must be specified as 0.

        RETURN VALUE
          On success, process_reap() returns 0. On error, -1 is returned
          and errno is set to indicate the error.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 kernel/sys_ni.c                             |  1 +
 mm/oom_kill.c                               | 50 +++++++++++++++++++++
 22 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..14b9e81d2fc4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@
 554	common	landlock_create_ruleset		sys_landlock_create_ruleset
 555	common	landlock_add_rule		sys_landlock_add_rule
 556	common	landlock_restrict_self		sys_landlock_restrict_self
+557	common	process_reap			sys_process_reap
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 28e03b5fec00..889b78d0f63f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -460,3 +460,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 727bfc3be99b..fb7a0be2f3d9 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		447
+#define __NR_compat_syscalls		448
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 5dab69d2c22b..80593454173e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
 __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_process_reap 447
+__SYSCALL(__NR_process_reap, sys_process_reap)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index bb11fe4c875a..6c94feedf086 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -367,3 +367,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 79c2d24c89dd..e80a7fa55696 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b11395a20c20..511b2bd61fc1 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -452,3 +452,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9220909526f9..1775704c6a24 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -385,3 +385,4 @@
 444	n32	landlock_create_ruleset		sys_landlock_create_ruleset
 445	n32	landlock_add_rule		sys_landlock_add_rule
 446	n32	landlock_restrict_self		sys_landlock_restrict_self
+447	n32	process_reap			sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9cd1c34f31b5..d769daca3f79 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -361,3 +361,4 @@
 444	n64	landlock_create_ruleset		sys_landlock_create_ruleset
 445	n64	landlock_add_rule		sys_landlock_add_rule
 446	n64	landlock_restrict_self		sys_landlock_restrict_self
+447	n64	process_reap			sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index d560c467a8c6..1bd2fc056677 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -434,3 +434,4 @@
 444	o32	landlock_create_ruleset		sys_landlock_create_ruleset
 445	o32	landlock_add_rule		sys_landlock_add_rule
 446	o32	landlock_restrict_self		sys_landlock_restrict_self
+447	o32	process_reap			sys_process_reap
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index aabc37f8cae3..0012561ca557 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 8f052ff4058c..89cbcc732b18 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -526,3 +526,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 0690263df1dd..7ebd4d809b5e 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@
 444  common	landlock_create_ruleset	sys_landlock_create_ruleset	sys_landlock_create_ruleset
 445  common	landlock_add_rule	sys_landlock_add_rule		sys_landlock_add_rule
 446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
+447  common	process_reap		sys_process_reap		sys_process_reap
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 0b91499ebdcf..178fd47b372e 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e34cc30ef22c..faee121b7ae2 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -492,3 +492,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4bbc267fb36b..cbe070de9884 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -451,3 +451,4 @@
 444	i386	landlock_create_ruleset	sys_landlock_create_ruleset
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
+447	i386	process_reap		sys_process_reap
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..e6765646731b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,7 @@
 444	common	landlock_create_ruleset	sys_landlock_create_ruleset
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
+447	common	process_reap		sys_process_reap
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index fd2f30227d96..f0e9dbee1a5b 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -417,3 +417,4 @@
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+447	common	process_reap			sys_process_reap
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..b6659e09bf0d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
 			size_t vlen, int behavior, unsigned int flags);
+asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d2a942086fcb..b3bf57b928af 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
 __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_process_reap 447
+__SYSCALL(__NR_process_reap, sys_process_reap)
 
 #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 448
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..56eb7c9f8356 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
 COND_SYSCALL(mincore);
 COND_SYSCALL(madvise);
 COND_SYSCALL(process_madvise);
+COND_SYSCALL(process_reap);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eefd3f5fde46..0f85a0442fa5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,7 @@
 #include <linux/sched/task.h>
 #include <linux/sched/debug.h>
 #include <linux/swap.h>
+#include <linux/syscalls.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
 #include <linux/cpuset.h>
@@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
 	out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 }
+
+SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
+{
+	struct pid *pid;
+	struct task_struct *task;
+	struct mm_struct *mm = NULL;
+	unsigned int f_flags;
+	long ret = 0;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	pid = pidfd_get_pid(pidfd, &f_flags);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		ret = -ESRCH;
+		goto put_pid;
+	}
+
+	/*
+	 * If the task is dying and in the process of releasing its memory
+	 * then get its mm.
+	 */
+	task_lock(task);
+	if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
+		mm = task->mm;
+		mmget(mm);
+	}
+	task_unlock(task);
+	if (!mm) {
+		ret = -EINVAL;
+		goto put_task;
+	}
+
+	mmap_read_lock(mm);
+	if (!__oom_reap_task_mm(mm))
+		ret = -EAGAIN;
+	mmap_read_unlock(mm);
+
+	mmput(mm);
+put_task:
+	put_task_struct(task);
+put_pid:
+	put_pid(pid);
+	return ret;
+}
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
@ 2021-06-23 19:34 ` Suren Baghdasaryan
  2021-06-29 13:13 ` Christian Brauner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-23 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Minchan Kim,
	Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
>
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
>
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
>   a) does not work on an address range
>   b) is destructive
>   c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
>
> The API is as follows,
>
>           int process_reap(int pidfd, unsigned int flags);
>
>         DESCRIPTION
>           The process_reap() system call is used to free the memory of a
>           dying process.
>
>           The pidfd selects the process referred to by the PID file
>           descriptor.
>           (See pidofd_open(2) for further information)
>
>           The flags argument is reserved for future use; currently, this
>           argument must be specified as 0.
>
>         RETURN VALUE
>           On success, process_reap() returns 0. On error, -1 is returned
>           and errno is set to indicate the error.
>

I noticed that the patch does not apply to linux-next because of the
new memfd_secret syscall introduced on x86 architecture only. It still
applies to Linus' ToT. If needed I can change it to apply on top of
linux-next.

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>  arch/arm/tools/syscall.tbl                  |  1 +
>  arch/arm64/include/asm/unistd.h             |  2 +-
>  arch/arm64/include/asm/unistd32.h           |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
>  include/linux/syscalls.h                    |  1 +
>  include/uapi/asm-generic/unistd.h           |  4 +-
>  kernel/sys_ni.c                             |  1 +
>  mm/oom_kill.c                               | 50 +++++++++++++++++++++
>  22 files changed, 74 insertions(+), 2 deletions(-)
>
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..14b9e81d2fc4 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
>  554    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  555    common  landlock_add_rule               sys_landlock_add_rule
>  556    common  landlock_restrict_self          sys_landlock_restrict_self
> +557    common  process_reap                    sys_process_reap
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..889b78d0f63f 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..fb7a0be2f3d9 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls                (__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END            (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls           447
> +#define __NR_compat_syscalls           448
>  #endif
>
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..80593454173e 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>  #define __NR_landlock_restrict_self 446
>  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..6c94feedf086 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..e80a7fa55696 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..511b2bd61fc1 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 9220909526f9..1775704c6a24 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
>  444    n32     landlock_create_ruleset         sys_landlock_create_ruleset
>  445    n32     landlock_add_rule               sys_landlock_add_rule
>  446    n32     landlock_restrict_self          sys_landlock_restrict_self
> +447    n32     process_reap                    sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 9cd1c34f31b5..d769daca3f79 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
>  444    n64     landlock_create_ruleset         sys_landlock_create_ruleset
>  445    n64     landlock_add_rule               sys_landlock_add_rule
>  446    n64     landlock_restrict_self          sys_landlock_restrict_self
> +447    n64     process_reap                    sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..1bd2fc056677 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
>  444    o32     landlock_create_ruleset         sys_landlock_create_ruleset
>  445    o32     landlock_add_rule               sys_landlock_add_rule
>  446    o32     landlock_restrict_self          sys_landlock_restrict_self
> +447    o32     process_reap                    sys_process_reap
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..0012561ca557 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..89cbcc732b18 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..7ebd4d809b5e 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>  444  common    landlock_create_ruleset sys_landlock_create_ruleset     sys_landlock_create_ruleset
>  445  common    landlock_add_rule       sys_landlock_add_rule           sys_landlock_add_rule
>  446  common    landlock_restrict_self  sys_landlock_restrict_self      sys_landlock_restrict_self
> +447  common    process_reap            sys_process_reap                sys_process_reap
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..178fd47b372e 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..faee121b7ae2 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..cbe070de9884 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -451,3 +451,4 @@
>  444    i386    landlock_create_ruleset sys_landlock_create_ruleset
>  445    i386    landlock_add_rule       sys_landlock_add_rule
>  446    i386    landlock_restrict_self  sys_landlock_restrict_self
> +447    i386    process_reap            sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ce18119ea0d0..e6765646731b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,7 @@
>  444    common  landlock_create_ruleset sys_landlock_create_ruleset
>  445    common  landlock_add_rule       sys_landlock_add_rule
>  446    common  landlock_restrict_self  sys_landlock_restrict_self
> +447    common  process_reap            sys_process_reap
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..f0e9dbee1a5b 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
>  444    common  landlock_create_ruleset         sys_landlock_create_ruleset
>  445    common  landlock_add_rule               sys_landlock_add_rule
>  446    common  landlock_restrict_self          sys_landlock_restrict_self
> +447    common  process_reap                    sys_process_reap
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..b6659e09bf0d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
>  asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
>                         size_t vlen, int behavior, unsigned int flags);
> +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>                         unsigned long prot, unsigned long pgoff,
>                         unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index d2a942086fcb..b3bf57b928af 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>  #define __NR_landlock_restrict_self 446
>  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 448
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..56eb7c9f8356 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
>  COND_SYSCALL(process_madvise);
> +COND_SYSCALL(process_reap);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..0f85a0442fa5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched/task.h>
>  #include <linux/sched/debug.h>
>  #include <linux/swap.h>
> +#include <linux/syscalls.h>
>  #include <linux/timex.h>
>  #include <linux/jiffies.h>
>  #include <linux/cpuset.h>
> @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
>         out_of_memory(&oc);
>         mutex_unlock(&oom_lock);
>  }
> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> +{
> +       struct pid *pid;
> +       struct task_struct *task;
> +       struct mm_struct *mm = NULL;
> +       unsigned int f_flags;
> +       long ret = 0;
> +
> +       if (flags != 0)
> +               return -EINVAL;
> +
> +       pid = pidfd_get_pid(pidfd, &f_flags);
> +       if (IS_ERR(pid))
> +               return PTR_ERR(pid);
> +
> +       task = get_pid_task(pid, PIDTYPE_PID);
> +       if (!task) {
> +               ret = -ESRCH;
> +               goto put_pid;
> +       }
> +
> +       /*
> +        * If the task is dying and in the process of releasing its memory
> +        * then get its mm.
> +        */
> +       task_lock(task);
> +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> +               mm = task->mm;
> +               mmget(mm);
> +       }
> +       task_unlock(task);
> +       if (!mm) {
> +               ret = -EINVAL;
> +               goto put_task;
> +       }
> +
> +       mmap_read_lock(mm);
> +       if (!__oom_reap_task_mm(mm))
> +               ret = -EAGAIN;
> +       mmap_read_unlock(mm);
> +
> +       mmput(mm);
> +put_task:
> +       put_task_struct(task);
> +put_pid:
> +       put_pid(pid);
> +       return ret;
> +}
> --
> 2.32.0.93.g670b81a890-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
  2021-06-23 19:34 ` Suren Baghdasaryan
@ 2021-06-29 13:13 ` Christian Brauner
  2021-06-29 16:15   ` Suren Baghdasaryan
  2021-06-30 18:00 ` Shakeel Butt
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-06-29 13:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, hch, oleg, david, jannh, shakeelb, timmurray,
	linux-api, linux-mm, linux-kernel, kernel-team

On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote:
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
> 
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
> 
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
>   a) does not work on an address range
>   b) is destructive
>   c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
> 
> The API is as follows,
> 
>           int process_reap(int pidfd, unsigned int flags);
> 
>         DESCRIPTION
>           The process_reap() system call is used to free the memory of a
>           dying process.
> 
>           The pidfd selects the process referred to by the PID file
>           descriptor.
>           (See pidofd_open(2) for further information)
> 
>           The flags argument is reserved for future use; currently, this
>           argument must be specified as 0.
> 
>         RETURN VALUE
>           On success, process_reap() returns 0. On error, -1 is returned
>           and errno is set to indicate the error.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>  arch/arm/tools/syscall.tbl                  |  1 +
>  arch/arm64/include/asm/unistd.h             |  2 +-
>  arch/arm64/include/asm/unistd32.h           |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
>  include/linux/syscalls.h                    |  1 +
>  include/uapi/asm-generic/unistd.h           |  4 +-
>  kernel/sys_ni.c                             |  1 +
>  mm/oom_kill.c                               | 50 +++++++++++++++++++++
>  22 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..14b9e81d2fc4 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
>  554	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  555	common	landlock_add_rule		sys_landlock_add_rule
>  556	common	landlock_restrict_self		sys_landlock_restrict_self
> +557	common	process_reap			sys_process_reap
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..889b78d0f63f 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..fb7a0be2f3d9 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		447
> +#define __NR_compat_syscalls		448
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..80593454173e 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>  #define __NR_landlock_restrict_self 446
>  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>  
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..6c94feedf086 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..e80a7fa55696 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..511b2bd61fc1 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 9220909526f9..1775704c6a24 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
>  444	n32	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	n32	landlock_add_rule		sys_landlock_add_rule
>  446	n32	landlock_restrict_self		sys_landlock_restrict_self
> +447	n32	process_reap			sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 9cd1c34f31b5..d769daca3f79 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
>  444	n64	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	n64	landlock_add_rule		sys_landlock_add_rule
>  446	n64	landlock_restrict_self		sys_landlock_restrict_self
> +447	n64	process_reap			sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..1bd2fc056677 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
>  444	o32	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	o32	landlock_add_rule		sys_landlock_add_rule
>  446	o32	landlock_restrict_self		sys_landlock_restrict_self
> +447	o32	process_reap			sys_process_reap
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..0012561ca557 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..89cbcc732b18 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..7ebd4d809b5e 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>  444  common	landlock_create_ruleset	sys_landlock_create_ruleset	sys_landlock_create_ruleset
>  445  common	landlock_add_rule	sys_landlock_add_rule		sys_landlock_add_rule
>  446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
> +447  common	process_reap		sys_process_reap		sys_process_reap
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..178fd47b372e 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..faee121b7ae2 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..cbe070de9884 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -451,3 +451,4 @@
>  444	i386	landlock_create_ruleset	sys_landlock_create_ruleset
>  445	i386	landlock_add_rule	sys_landlock_add_rule
>  446	i386	landlock_restrict_self	sys_landlock_restrict_self
> +447	i386	process_reap		sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ce18119ea0d0..e6765646731b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,7 @@
>  444	common	landlock_create_ruleset	sys_landlock_create_ruleset
>  445	common	landlock_add_rule	sys_landlock_add_rule
>  446	common	landlock_restrict_self	sys_landlock_restrict_self
> +447	common	process_reap		sys_process_reap
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..f0e9dbee1a5b 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
>  444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>  445	common	landlock_add_rule		sys_landlock_add_rule
>  446	common	landlock_restrict_self		sys_landlock_restrict_self
> +447	common	process_reap			sys_process_reap
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..b6659e09bf0d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
>  asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
>  			size_t vlen, int behavior, unsigned int flags);
> +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index d2a942086fcb..b3bf57b928af 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>  #define __NR_landlock_restrict_self 446
>  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 448
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..56eb7c9f8356 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
>  COND_SYSCALL(process_madvise);
> +COND_SYSCALL(process_reap);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..0f85a0442fa5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched/task.h>
>  #include <linux/sched/debug.h>
>  #include <linux/swap.h>
> +#include <linux/syscalls.h>
>  #include <linux/timex.h>
>  #include <linux/jiffies.h>
>  #include <linux/cpuset.h>
> @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
>  	out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  }
> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)

Hey Suren,

Wouldn't
- process_memory_reap()
- process_reap_memory()
- process_mreap()
be better names?

> +{
> +	struct pid *pid;
> +	struct task_struct *task;
> +	struct mm_struct *mm = NULL;
> +	unsigned int f_flags;
> +	long ret = 0;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	pid = pidfd_get_pid(pidfd, &f_flags);
> +	if (IS_ERR(pid))
> +		return PTR_ERR(pid);
> +
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task) {
> +		ret = -ESRCH;
> +		goto put_pid;
> +	}

You have a similar pattern in process_madvise():

	pid = pidfd_get_pid(pidfd, &f_flags);
	if (IS_ERR(pid)) {
		ret = PTR_ERR(pid);
		goto free_iov;
	}

	task = get_pid_task(pid, PIDTYPE_PID);
	if (!task) {
		ret = -ESRCH;
		goto put_pid;
	}

I'd suggest you add a tiny helper to kernel/pid.c and call it in both
places.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-29 13:13 ` Christian Brauner
@ 2021-06-29 16:15   ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-29 16:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

On Tue, Jun 29, 2021 at 6:14 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote:
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> >   a) does not work on an address range
> >   b) is destructive
> >   c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> >           int process_reap(int pidfd, unsigned int flags);
> >
> >         DESCRIPTION
> >           The process_reap() system call is used to free the memory of a
> >           dying process.
> >
> >           The pidfd selects the process referred to by the PID file
> >           descriptor.
> >           (See pidofd_open(2) for further information)
> >
> >           The flags argument is reserved for future use; currently, this
> >           argument must be specified as 0.
> >
> >         RETURN VALUE
> >           On success, process_reap() returns 0. On error, -1 is returned
> >           and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/arm/tools/syscall.tbl                  |  1 +
> >  arch/arm64/include/asm/unistd.h             |  2 +-
> >  arch/arm64/include/asm/unistd32.h           |  2 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
> >  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
> >  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
> >  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
> >  include/linux/syscalls.h                    |  1 +
> >  include/uapi/asm-generic/unistd.h           |  4 +-
> >  kernel/sys_ni.c                             |  1 +
> >  mm/oom_kill.c                               | 50 +++++++++++++++++++++
> >  22 files changed, 74 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> > index 3000a2e8ee21..14b9e81d2fc4 100644
> > --- a/arch/alpha/kernel/syscalls/syscall.tbl
> > +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> > @@ -486,3 +486,4 @@
> >  554  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  555  common  landlock_add_rule               sys_landlock_add_rule
> >  556  common  landlock_restrict_self          sys_landlock_restrict_self
> > +557  common  process_reap                    sys_process_reap
> > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> > index 28e03b5fec00..889b78d0f63f 100644
> > --- a/arch/arm/tools/syscall.tbl
> > +++ b/arch/arm/tools/syscall.tbl
> > @@ -460,3 +460,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 727bfc3be99b..fb7a0be2f3d9 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> >  #define __ARM_NR_compat_set_tls              (__ARM_NR_COMPAT_BASE + 5)
> >  #define __ARM_NR_COMPAT_END          (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls         447
> > +#define __NR_compat_syscalls         448
> >  #endif
> >
> >  #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index 5dab69d2c22b..80593454173e 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> >  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> >  #define __NR_landlock_restrict_self 446
> >  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_process_reap 447
> > +__SYSCALL(__NR_process_reap, sys_process_reap)
> >
> >  /*
> >   * Please add new compat syscalls above this comment and update
> > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> > index bb11fe4c875a..6c94feedf086 100644
> > --- a/arch/ia64/kernel/syscalls/syscall.tbl
> > +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> > @@ -367,3 +367,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> > index 79c2d24c89dd..e80a7fa55696 100644
> > --- a/arch/m68k/kernel/syscalls/syscall.tbl
> > +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> > @@ -446,3 +446,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> > index b11395a20c20..511b2bd61fc1 100644
> > --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> > @@ -452,3 +452,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > index 9220909526f9..1775704c6a24 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > @@ -385,3 +385,4 @@
> >  444  n32     landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  n32     landlock_add_rule               sys_landlock_add_rule
> >  446  n32     landlock_restrict_self          sys_landlock_restrict_self
> > +447  n32     process_reap                    sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > index 9cd1c34f31b5..d769daca3f79 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > @@ -361,3 +361,4 @@
> >  444  n64     landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  n64     landlock_add_rule               sys_landlock_add_rule
> >  446  n64     landlock_restrict_self          sys_landlock_restrict_self
> > +447  n64     process_reap                    sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > index d560c467a8c6..1bd2fc056677 100644
> > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > @@ -434,3 +434,4 @@
> >  444  o32     landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  o32     landlock_add_rule               sys_landlock_add_rule
> >  446  o32     landlock_restrict_self          sys_landlock_restrict_self
> > +447  o32     process_reap                    sys_process_reap
> > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> > index aabc37f8cae3..0012561ca557 100644
> > --- a/arch/parisc/kernel/syscalls/syscall.tbl
> > +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> > @@ -444,3 +444,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> > index 8f052ff4058c..89cbcc732b18 100644
> > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > @@ -526,3 +526,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 0690263df1dd..7ebd4d809b5e 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >  444  common  landlock_create_ruleset sys_landlock_create_ruleset     sys_landlock_create_ruleset
> >  445  common  landlock_add_rule       sys_landlock_add_rule           sys_landlock_add_rule
> >  446  common  landlock_restrict_self  sys_landlock_restrict_self      sys_landlock_restrict_self
> > +447  common  process_reap            sys_process_reap                sys_process_reap
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> > index 0b91499ebdcf..178fd47b372e 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> > index e34cc30ef22c..faee121b7ae2 100644
> > --- a/arch/sparc/kernel/syscalls/syscall.tbl
> > +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> > @@ -492,3 +492,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4bbc267fb36b..cbe070de9884 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -451,3 +451,4 @@
> >  444  i386    landlock_create_ruleset sys_landlock_create_ruleset
> >  445  i386    landlock_add_rule       sys_landlock_add_rule
> >  446  i386    landlock_restrict_self  sys_landlock_restrict_self
> > +447  i386    process_reap            sys_process_reap
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index ce18119ea0d0..e6765646731b 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -368,6 +368,7 @@
> >  444  common  landlock_create_ruleset sys_landlock_create_ruleset
> >  445  common  landlock_add_rule       sys_landlock_add_rule
> >  446  common  landlock_restrict_self  sys_landlock_restrict_self
> > +447  common  process_reap            sys_process_reap
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> > index fd2f30227d96..f0e9dbee1a5b 100644
> > --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> > @@ -417,3 +417,4 @@
> >  444  common  landlock_create_ruleset         sys_landlock_create_ruleset
> >  445  common  landlock_add_rule               sys_landlock_add_rule
> >  446  common  landlock_restrict_self          sys_landlock_restrict_self
> > +447  common  process_reap                    sys_process_reap
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 050511e8f1f8..b6659e09bf0d 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> >  asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
> >                       size_t vlen, int behavior, unsigned int flags);
> > +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >                       unsigned long prot, unsigned long pgoff,
> >                       unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index d2a942086fcb..b3bf57b928af 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> >  __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> >  #define __NR_landlock_restrict_self 446
> >  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_process_reap 447
> > +__SYSCALL(__NR_process_reap, sys_process_reap)
> >
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 447
> > +#define __NR_syscalls 448
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 0ea8128468c3..56eb7c9f8356 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> >  COND_SYSCALL(process_madvise);
> > +COND_SYSCALL(process_reap);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eefd3f5fde46..0f85a0442fa5 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -28,6 +28,7 @@
> >  #include <linux/sched/task.h>
> >  #include <linux/sched/debug.h>
> >  #include <linux/swap.h>
> > +#include <linux/syscalls.h>
> >  #include <linux/timex.h>
> >  #include <linux/jiffies.h>
> >  #include <linux/cpuset.h>
> > @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
> >       out_of_memory(&oc);
> >       mutex_unlock(&oom_lock);
> >  }
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
>
> Hey Suren,
>
> Wouldn't
> - process_memory_reap()
> - process_reap_memory()
> - process_mreap()
> be better names?

Hi Christian,
I'm open to other names, whichever sounds better. From the list
process_reap_memory() sounds best to me but I'm open to others as
well.

>
> > +{
> > +     struct pid *pid;
> > +     struct task_struct *task;
> > +     struct mm_struct *mm = NULL;
> > +     unsigned int f_flags;
> > +     long ret = 0;
> > +
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     pid = pidfd_get_pid(pidfd, &f_flags);
> > +     if (IS_ERR(pid))
> > +             return PTR_ERR(pid);
> > +
> > +     task = get_pid_task(pid, PIDTYPE_PID);
> > +     if (!task) {
> > +             ret = -ESRCH;
> > +             goto put_pid;
> > +     }
>
> You have a similar pattern in process_madvise():
>
>         pid = pidfd_get_pid(pidfd, &f_flags);
>         if (IS_ERR(pid)) {
>                 ret = PTR_ERR(pid);
>                 goto free_iov;
>         }
>
>         task = get_pid_task(pid, PIDTYPE_PID);
>         if (!task) {
>                 ret = -ESRCH;
>                 goto put_pid;
>         }
>
> I'd suggest you add a tiny helper to kernel/pid.c and call it in both
> places.

Agree. I'll post the new rev next week to give some more time for
reviews of this version.
Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
  2021-06-23 19:34 ` Suren Baghdasaryan
  2021-06-29 13:13 ` Christian Brauner
@ 2021-06-30 18:00 ` Shakeel Butt
  2021-06-30 18:43   ` Suren Baghdasaryan
  2021-06-30 18:26 ` Andy Lutomirski
  2021-07-07  9:46 ` Florian Weimer
  4 siblings, 1 reply; 31+ messages in thread
From: Shakeel Butt @ 2021-06-30 18:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Tim Murray, Linux API, Linux MM,
	LKML, kernel-team

Hi Suren,

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
>
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
>
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
>   a) does not work on an address range
>   b) is destructive
>   c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
>
> The API is as follows,
>
>           int process_reap(int pidfd, unsigned int flags);
>
>         DESCRIPTION
>           The process_reap() system call is used to free the memory of a
>           dying process.
>
>           The pidfd selects the process referred to by the PID file
>           descriptor.
>           (See pidofd_open(2) for further information)

*pidfd_open

>
>           The flags argument is reserved for future use; currently, this
>           argument must be specified as 0.
>
>         RETURN VALUE
>           On success, process_reap() returns 0. On error, -1 is returned
>           and errno is set to indicate the error.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Thanks for continuously pushing this. One question I have is how do
you envision this syscall to be used for the cgroup based workloads.
Traverse the target tree, read pids from cgroup.procs files,
pidfd_open them, send SIGKILL and then process_reap them. Is that
right?

Orthogonal to this patch I wonder if we should have an optimized way
to reap processes from a cgroup. Something similar to cgroup.kill (or
maybe overload cgroup.kill with reaping as well).

[...]

> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> +{
> +       struct pid *pid;
> +       struct task_struct *task;
> +       struct mm_struct *mm = NULL;
> +       unsigned int f_flags;
> +       long ret = 0;
> +
> +       if (flags != 0)
> +               return -EINVAL;
> +
> +       pid = pidfd_get_pid(pidfd, &f_flags);
> +       if (IS_ERR(pid))
> +               return PTR_ERR(pid);
> +
> +       task = get_pid_task(pid, PIDTYPE_PID);
> +       if (!task) {
> +               ret = -ESRCH;
> +               goto put_pid;
> +       }
> +
> +       /*
> +        * If the task is dying and in the process of releasing its memory
> +        * then get its mm.
> +        */
> +       task_lock(task);
> +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {

task_will_free_mem() is fine here but I think in parallel we should
optimize this function. At the moment it is traversing all the
processes on the machine. It is very normal to have tens of thousands
of processes on big machines, so it would be really costly when
reaping a bunch of processes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2021-06-30 18:00 ` Shakeel Butt
@ 2021-06-30 18:26 ` Andy Lutomirski
  2021-06-30 18:51   ` Suren Baghdasaryan
  2021-07-07  9:46 ` Florian Weimer
  4 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2021-06-30 18:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.

At the risk of asking a potentially silly question, should this just
be a file in procfs?

Also, please consider removing all mention of the word "reap" from the
user API.  For better or for worse, "reap" in UNIX refers to what
happens when a dead task gets wait()ed.  I sincerely wish I could go
back in time and gently encourage whomever invented that particular
abomination to change their mind, but my time machine doesn't work.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 18:00 ` Shakeel Butt
@ 2021-06-30 18:43   ` Suren Baghdasaryan
  2021-06-30 19:00     ` Shakeel Butt
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-30 18:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Tim Murray, Linux API, Linux MM,
	LKML, kernel-team

On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Suren,
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> >   a) does not work on an address range
> >   b) is destructive
> >   c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> >           int process_reap(int pidfd, unsigned int flags);
> >
> >         DESCRIPTION
> >           The process_reap() system call is used to free the memory of a
> >           dying process.
> >
> >           The pidfd selects the process referred to by the PID file
> >           descriptor.
> >           (See pidofd_open(2) for further information)
>
> *pidfd_open

Ack

>
> >
> >           The flags argument is reserved for future use; currently, this
> >           argument must be specified as 0.
> >
> >         RETURN VALUE
> >           On success, process_reap() returns 0. On error, -1 is returned
> >           and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks for continuously pushing this. One question I have is how do
> you envision this syscall to be used for the cgroup based workloads.
> Traverse the target tree, read pids from cgroup.procs files,
> pidfd_open them, send SIGKILL and then process_reap them. Is that
> right?

Yes, at least that's how Android does that. It's a bit more involved
but it's a technical detail. Userspace low memory killer kills a
process (sends SIGKILL and calls process_reap) and another system
component detects that a process died and will kill all processes
belonging to the same cgroup (that's how we identify related
processes).

>
> Orthogonal to this patch I wonder if we should have an optimized way
> to reap processes from a cgroup. Something similar to cgroup.kill (or
> maybe overload cgroup.kill with reaping as well).

Seems reasonable to me. We could use that in the above scenario.

>
> [...]
>
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> > +{
> > +       struct pid *pid;
> > +       struct task_struct *task;
> > +       struct mm_struct *mm = NULL;
> > +       unsigned int f_flags;
> > +       long ret = 0;
> > +
> > +       if (flags != 0)
> > +               return -EINVAL;
> > +
> > +       pid = pidfd_get_pid(pidfd, &f_flags);
> > +       if (IS_ERR(pid))
> > +               return PTR_ERR(pid);
> > +
> > +       task = get_pid_task(pid, PIDTYPE_PID);
> > +       if (!task) {
> > +               ret = -ESRCH;
> > +               goto put_pid;
> > +       }
> > +
> > +       /*
> > +        * If the task is dying and in the process of releasing its memory
> > +        * then get its mm.
> > +        */
> > +       task_lock(task);
> > +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
>
> task_will_free_mem() is fine here but I think in parallel we should
> optimize this function. At the moment it is traversing all the
> processes on the machine. It is very normal to have tens of thousands
> of processes on big machines, so it would be really costly when
> reaping a bunch of processes.

Hmm. But I think we still need to make sure that the mm is not shared
with another non-dying process. IIUC that's the point of that
traversal. Am I mistaken?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 18:26 ` Andy Lutomirski
@ 2021-06-30 18:51   ` Suren Baghdasaryan
  2021-06-30 21:45     ` Johannes Weiner
  2021-07-01  0:45     ` Andy Lutomirski
  0 siblings, 2 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-30 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
>
> At the risk of asking a potentially silly question, should this just
> be a file in procfs?

Hmm. I guess it's doable if procfs will not disappear too soon before
memory is released... syscall also supports parameters, in this case
flags can be used in the future to support PIDs in addition to PIDFDs
for example.
Before looking more in that direction, a silly question from my side:
why procfs interface would be preferable to a syscall?

>
> Also, please consider removing all mention of the word "reap" from the
> user API.  For better or for worse, "reap" in UNIX refers to what
> happens when a dead task gets wait()ed.  I sincerely wish I could go
> back in time and gently encourage whomever invented that particular
> abomination to change their mind, but my time machine doesn't work.

I see. Thanks for the note. How about process_mem_release() and
replacing reap with release everywhere?

>
> --Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 18:43   ` Suren Baghdasaryan
@ 2021-06-30 19:00     ` Shakeel Butt
  2021-06-30 19:06       ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Shakeel Butt @ 2021-06-30 19:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Tim Murray, Linux API, Linux MM,
	LKML, kernel-team

On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
[...]
> > > +       /*
> > > +        * If the task is dying and in the process of releasing its memory
> > > +        * then get its mm.
> > > +        */
> > > +       task_lock(task);
> > > +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> >
> > task_will_free_mem() is fine here but I think in parallel we should
> > optimize this function. At the moment it is traversing all the
> > processes on the machine. It is very normal to have tens of thousands
> > of processes on big machines, so it would be really costly when
> > reaping a bunch of processes.
>
> Hmm. But I think we still need to make sure that the mm is not shared
> with another non-dying process. IIUC that's the point of that
> traversal. Am I mistaken?

You are right. I am talking about efficiently finding all processes
which are sharing mm (maybe linked into another list) instead of
traversing all the processes on the system.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 19:00     ` Shakeel Butt
@ 2021-06-30 19:06       ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-06-30 19:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Tim Murray, Linux API, Linux MM,
	LKML, kernel-team

On Wed, Jun 30, 2021 at 12:00 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> [...]
> > > > +       /*
> > > > +        * If the task is dying and in the process of releasing its memory
> > > > +        * then get its mm.
> > > > +        */
> > > > +       task_lock(task);
> > > > +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> > >
> > > task_will_free_mem() is fine here but I think in parallel we should
> > > optimize this function. At the moment it is traversing all the
> > > processes on the machine. It is very normal to have tens of thousands
> > > of processes on big machines, so it would be really costly when
> > > reaping a bunch of processes.
> >
> > Hmm. But I think we still need to make sure that the mm is not shared
> > with another non-dying process. IIUC that's the point of that
> > traversal. Am I mistaken?
>
> You are right. I am talking about efficiently finding all processes
> which are sharing mm (maybe linked into another list) instead of
> traversing all the processes on the system.

Oh, I see. I think that's a good idea but belongs to a separate patch
as an optimization for task_will_free_mem().
Thanks for reviewing and for good suggestions!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 18:51   ` Suren Baghdasaryan
@ 2021-06-30 21:45     ` Johannes Weiner
  2021-07-01  0:44       ` Andy Lutomirski
  2021-07-01  0:45     ` Andy Lutomirski
  1 sibling, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2021-06-30 21:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andy Lutomirski, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> > Also, please consider removing all mention of the word "reap" from the
> > user API.  For better or for worse, "reap" in UNIX refers to what
> > happens when a dead task gets wait()ed.  I sincerely wish I could go
> > back in time and gently encourage whomever invented that particular
> > abomination to change their mind, but my time machine doesn't work.
> 
> I see. Thanks for the note. How about process_mem_release() and
> replacing reap with release everywhere?

I don't quite understand the objection. This syscall works on tasks
that are at the end of their life, right? Isn't something like
process_mreap() establishing exactly the mental link we want here?
Release is less descriptive for what this thing is to be used for.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 21:45     ` Johannes Weiner
@ 2021-07-01  0:44       ` Andy Lutomirski
  2021-07-01 22:59         ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2021-07-01  0:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Andy Lutomirski, Andrew Morton, Michal Hocko,
	Michal Hocko, David Rientjes, Matthew Wilcox, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > Also, please consider removing all mention of the word "reap" from the
> > > user API.  For better or for worse, "reap" in UNIX refers to what
> > > happens when a dead task gets wait()ed.  I sincerely wish I could go
> > > back in time and gently encourage whomever invented that particular
> > > abomination to change their mind, but my time machine doesn't work.
> >
> > I see. Thanks for the note. How about process_mem_release() and
> > replacing reap with release everywhere?
>
> I don't quite understand the objection. This syscall works on tasks
> that are at the end of their life, right? Isn't something like
> process_mreap() establishing exactly the mental link we want here?
> Release is less descriptive for what this thing is to be used for.

For better or for worse, "reap" means to make a zombie pid go away.
From the description, this new operation takes a dying process (not
necessarily a zombie yet) and aggressively frees its memory.  This is
a different optioneration.

How about "free_dying_process_memory"?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-30 18:51   ` Suren Baghdasaryan
  2021-06-30 21:45     ` Johannes Weiner
@ 2021-07-01  0:45     ` Andy Lutomirski
  2021-07-01 23:08       ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2021-07-01  0:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andy Lutomirski, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > In modern systems it's not unusual to have a system component monitoring
> > > memory conditions of the system and tasked with keeping system memory
> > > pressure under control. One way to accomplish that is to kill
> > > non-essential processes to free up memory for more important ones.
> > > Examples of this are Facebook's OOM killer daemon called oomd and
> > > Android's low memory killer daemon called lmkd.
> > > For such system component it's important to be able to free memory
> > > quickly and efficiently. Unfortunately the time process takes to free
> > > up its memory after receiving a SIGKILL might vary based on the state
> > > of the process (uninterruptible sleep), size and OPP level of the core
> > > the process is running. A mechanism to free resources of the target
> > > process in a more predictable way would improve system's ability to
> > > control its memory pressure.
> > > Introduce process_reap system call that reclaims memory of a dying process
> > > from the context of the caller. This way the memory in freed in a more
> > > controllable way with CPU affinity and priority of the caller. The workload
> > > of freeing the memory will also be charged to the caller.
> > > The operation is allowed only on a dying process.
> >
> > At the risk of asking a potentially silly question, should this just
> > be a file in procfs?
>
> Hmm. I guess it's doable if procfs will not disappear too soon before
> memory is released... syscall also supports parameters, in this case
> flags can be used in the future to support PIDs in addition to PIDFDs
> for example.
> Before looking more in that direction, a silly question from my side:
> why procfs interface would be preferable to a syscall?

It avoids using a syscall nr.  (Admittedly a syscall nr is not *that*
precious of a resource.)  It also makes it possible to use a shell
script to do this, which is maybe useful.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-01  0:44       ` Andy Lutomirski
@ 2021-07-01 22:59         ` Suren Baghdasaryan
  2021-07-02 15:27           ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-01 22:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > Also, please consider removing all mention of the word "reap" from the
> > > > user API.  For better or for worse, "reap" in UNIX refers to what
> > > > happens when a dead task gets wait()ed.  I sincerely wish I could go
> > > > back in time and gently encourage whomever invented that particular
> > > > abomination to change their mind, but my time machine doesn't work.
> > >
> > > I see. Thanks for the note. How about process_mem_release() and
> > > replacing reap with release everywhere?
> >
> > I don't quite understand the objection. This syscall works on tasks
> > that are at the end of their life, right? Isn't something like
> > process_mreap() establishing exactly the mental link we want here?
> > Release is less descriptive for what this thing is to be used for.
>
> For better or for worse, "reap" means to make a zombie pid go away.
> From the description, this new operation takes a dying process (not
> necessarily a zombie yet) and aggressively frees its memory.  This is
> a different optioneration.
>
> How about "free_dying_process_memory"?

process_mreap sounds definitely better and in line with names like
process_madvise. So maybe we can use it?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-01  0:45     ` Andy Lutomirski
@ 2021-07-01 23:08       ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-01 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, Linux-MM, LKML, Android Kernel Team

)

On Wed, Jun 30, 2021 at 5:46 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > In modern systems it's not unusual to have a system component monitoring
> > > > memory conditions of the system and tasked with keeping system memory
> > > > pressure under control. One way to accomplish that is to kill
> > > > non-essential processes to free up memory for more important ones.
> > > > Examples of this are Facebook's OOM killer daemon called oomd and
> > > > Android's low memory killer daemon called lmkd.
> > > > For such system component it's important to be able to free memory
> > > > quickly and efficiently. Unfortunately the time process takes to free
> > > > up its memory after receiving a SIGKILL might vary based on the state
> > > > of the process (uninterruptible sleep), size and OPP level of the core
> > > > the process is running. A mechanism to free resources of the target
> > > > process in a more predictable way would improve system's ability to
> > > > control its memory pressure.
> > > > Introduce process_reap system call that reclaims memory of a dying process
> > > > from the context of the caller. This way the memory in freed in a more
> > > > controllable way with CPU affinity and priority of the caller. The workload
> > > > of freeing the memory will also be charged to the caller.
> > > > The operation is allowed only on a dying process.
> > >
> > > At the risk of asking a potentially silly question, should this just
> > > be a file in procfs?
> >
> > Hmm. I guess it's doable if procfs will not disappear too soon before
> > memory is released... syscall also supports parameters, in this case
> > flags can be used in the future to support PIDs in addition to PIDFDs
> > for example.
> > Before looking more in that direction, a silly question from my side:
> > why procfs interface would be preferable to a syscall?
>
> It avoids using a syscall nr.  (Admittedly a syscall nr is not *that*
> precious of a resource.)  It also makes it possible to use a shell
> script to do this, which is maybe useful.

I see. Not really sure if the shell usage is a big usecase for this
operation but let's see if more people like that approach. For my
specific usecase one syscall (process_reap) is better than three
syscalls (open, write, close) and the possibility to extend the
functionality using flags might be of value for the future.

>
> --Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-01 22:59         ` Suren Baghdasaryan
@ 2021-07-02 15:27           ` Christian Brauner
  2021-07-05  7:41             ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-07-02 15:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andy Lutomirski, Johannes Weiner, Andrew Morton, Michal Hocko,
	Michal Hocko, David Rientjes, Matthew Wilcox, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote:
> On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > Also, please consider removing all mention of the word "reap" from the
> > > > > user API.  For better or for worse, "reap" in UNIX refers to what
> > > > > happens when a dead task gets wait()ed.  I sincerely wish I could go
> > > > > back in time and gently encourage whomever invented that particular
> > > > > abomination to change their mind, but my time machine doesn't work.
> > > >
> > > > I see. Thanks for the note. How about process_mem_release() and
> > > > replacing reap with release everywhere?
> > >
> > > I don't quite understand the objection. This syscall works on tasks
> > > that are at the end of their life, right? Isn't something like
> > > process_mreap() establishing exactly the mental link we want here?
> > > Release is less descriptive for what this thing is to be used for.
> >
> > For better or for worse, "reap" means to make a zombie pid go away.
> > From the description, this new operation takes a dying process (not
> > necessarily a zombie yet) and aggressively frees its memory.  This is
> > a different optioneration.
> >
> > How about "free_dying_process_memory"?
> 
> process_mreap sounds definitely better and in line with names like
> process_madvise. So maybe we can use it?

That one was my favorite from the list I gave too but maybe we can
satisfy Andy too if we use one of:
- process_mfree()
- process_mrelease()

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-02 15:27           ` Christian Brauner
@ 2021-07-05  7:41             ` David Hildenbrand
  2021-07-07 12:38               ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2021-07-05  7:41 UTC (permalink / raw)
  To: Christian Brauner, Suren Baghdasaryan
  Cc: Andy Lutomirski, Johannes Weiner, Andrew Morton, Michal Hocko,
	Michal Hocko, David Rientjes, Matthew Wilcox, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, Jann Horn, Shakeel Butt, Tim Murray, Linux API,
	Linux-MM, LKML, Android Kernel Team

On 02.07.21 17:27, Christian Brauner wrote:
> On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote:
>> On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>
>>>> On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
>>>>> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>> Also, please consider removing all mention of the word "reap" from the
>>>>>> user API.  For better or for worse, "reap" in UNIX refers to what
>>>>>> happens when a dead task gets wait()ed.  I sincerely wish I could go
>>>>>> back in time and gently encourage whomever invented that particular
>>>>>> abomination to change their mind, but my time machine doesn't work.
>>>>>
>>>>> I see. Thanks for the note. How about process_mem_release() and
>>>>> replacing reap with release everywhere?
>>>>
>>>> I don't quite understand the objection. This syscall works on tasks
>>>> that are at the end of their life, right? Isn't something like
>>>> process_mreap() establishing exactly the mental link we want here?
>>>> Release is less descriptive for what this thing is to be used for.
>>>
>>> For better or for worse, "reap" means to make a zombie pid go away.
>>>  From the description, this new operation takes a dying process (not
>>> necessarily a zombie yet) and aggressively frees its memory.  This is
>>> a different optioneration.
>>>
>>> How about "free_dying_process_memory"?
>>
>> process_mreap sounds definitely better and in line with names like
>> process_madvise. So maybe we can use it?
> 
> That one was my favorite from the list I gave too but maybe we can
> satisfy Andy too if we use one of:
> - process_mfree()
> - process_mrelease()
> 

FWIW, I tend to like process_mrelease(), due to the implied "release" 
("free the memory if there are no other references") semantics. Further, 
a new syscall feels cleaner than some magic sysfs/procfs toggle. Just my 
2 cents.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2021-06-30 18:26 ` Andy Lutomirski
@ 2021-07-07  9:46 ` Florian Weimer
  2021-07-07 21:07   ` Suren Baghdasaryan
  4 siblings, 1 reply; 31+ messages in thread
From: Florian Weimer @ 2021-07-07  9:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, hch, oleg, david, jannh, shakeelb, timmurray,
	linux-api, linux-mm, linux-kernel, kernel-team

* Suren Baghdasaryan:

> The API is as follows,
>
>           int process_reap(int pidfd, unsigned int flags);
>
>         DESCRIPTION
>           The process_reap() system call is used to free the memory of a
>           dying process.
>
>           The pidfd selects the process referred to by the PID file
>           descriptor.
>           (See pidofd_open(2) for further information)
>
>           The flags argument is reserved for future use; currently, this
>           argument must be specified as 0.
>
>         RETURN VALUE
>           On success, process_reap() returns 0. On error, -1 is returned
>           and errno is set to indicate the error.

I think the manual page should mention what it means for a process to be
“dying”, and how to move a process to this state.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-05  7:41             ` David Hildenbrand
@ 2021-07-07 12:38               ` Michal Hocko
  2021-07-07 21:14                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2021-07-07 12:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christian Brauner, Suren Baghdasaryan, Andy Lutomirski,
	Johannes Weiner, Andrew Morton, David Rientjes, Matthew Wilcox,
	Roman Gushchin, Rik van Riel, Minchan Kim, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> On 02.07.21 17:27, Christian Brauner wrote:
[...]
> > That one was my favorite from the list I gave too but maybe we can
> > satisfy Andy too if we use one of:
> > - process_mfree()
> > - process_mrelease()
> > 
> 
> FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> the memory if there are no other references") semantics.

Agreed.

> Further, a new
> syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.

Yeah, proc based interface is both tricky to use and kinda ugly now that
pidfd can solve all at in once.

My original preference was a more generic kill syscall to allow flags
but a dedicated syscall doesn't look really bad either.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-07  9:46 ` Florian Weimer
@ 2021-07-07 21:07   ` Suren Baghdasaryan
  2021-07-08  5:40     ` Florian Weimer
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-07 21:07 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Suren Baghdasaryan:
>
> > The API is as follows,
> >
> >           int process_reap(int pidfd, unsigned int flags);
> >
> >         DESCRIPTION
> >           The process_reap() system call is used to free the memory of a
> >           dying process.
> >
> >           The pidfd selects the process referred to by the PID file
> >           descriptor.
> >           (See pidofd_open(2) for further information)
> >
> >           The flags argument is reserved for future use; currently, this
> >           argument must be specified as 0.
> >
> >         RETURN VALUE
> >           On success, process_reap() returns 0. On error, -1 is returned
> >           and errno is set to indicate the error.
>
> I think the manual page should mention what it means for a process to be
> “dying”, and how to move a process to this state.

Thanks for the suggestion, Florian! Would replacing "dying process"
with "process which was sent a SIGKILL signal" be sufficient?

>
> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-07 12:38               ` Michal Hocko
@ 2021-07-07 21:14                 ` Suren Baghdasaryan
  2021-07-09  8:58                   ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-07 21:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Christian Brauner, Andy Lutomirski,
	Johannes Weiner, Andrew Morton, David Rientjes, Matthew Wilcox,
	Roman Gushchin, Rik van Riel, Minchan Kim, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > On 02.07.21 17:27, Christian Brauner wrote:
> [...]
> > > That one was my favorite from the list I gave too but maybe we can
> > > satisfy Andy too if we use one of:
> > > - process_mfree()
> > > - process_mrelease()
> > >
> >
> > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > the memory if there are no other references") semantics.
>
> Agreed.

Ok, sounds like process_mrelease() would be an acceptable compromise.

>
> > Further, a new
> > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
>
> Yeah, proc based interface is both tricky to use and kinda ugly now that
> pidfd can solve all at in once.

Sounds good. Will keep it as is then.

> My original preference was a more generic kill syscall to allow flags
> but a dedicated syscall doesn't look really bad either.

Yeah, I have tried that direction unsuccessfully before arriving at
this one. Hopefully it represents the right compromise which can
satisfy everyone's usecase.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-07 21:07   ` Suren Baghdasaryan
@ 2021-07-08  5:40     ` Florian Weimer
  2021-07-08  6:05       ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Florian Weimer @ 2021-07-08  5:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

* Suren Baghdasaryan:

> On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Suren Baghdasaryan:
>>
>> > The API is as follows,
>> >
>> >           int process_reap(int pidfd, unsigned int flags);
>> >
>> >         DESCRIPTION
>> >           The process_reap() system call is used to free the memory of a
>> >           dying process.
>> >
>> >           The pidfd selects the process referred to by the PID file
>> >           descriptor.
>> >           (See pidofd_open(2) for further information)
>> >
>> >           The flags argument is reserved for future use; currently, this
>> >           argument must be specified as 0.
>> >
>> >         RETURN VALUE
>> >           On success, process_reap() returns 0. On error, -1 is returned
>> >           and errno is set to indicate the error.
>>
>> I think the manual page should mention what it means for a process to be
>> “dying”, and how to move a process to this state.
>
> Thanks for the suggestion, Florian! Would replacing "dying process"
> with "process which was sent a SIGKILL signal" be sufficient?

That explains very clearly the requirement, but it raises the question
why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
system call.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-08  5:40     ` Florian Weimer
@ 2021-07-08  6:05       ` Suren Baghdasaryan
  2021-07-08  6:14         ` Florian Weimer
  2021-07-12 12:51         ` Jan Engelhardt
  0 siblings, 2 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-08  6:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Suren Baghdasaryan:
>
> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Suren Baghdasaryan:
> >>
> >> > The API is as follows,
> >> >
> >> >           int process_reap(int pidfd, unsigned int flags);
> >> >
> >> >         DESCRIPTION
> >> >           The process_reap() system call is used to free the memory of a
> >> >           dying process.
> >> >
> >> >           The pidfd selects the process referred to by the PID file
> >> >           descriptor.
> >> >           (See pidofd_open(2) for further information)
> >> >
> >> >           The flags argument is reserved for future use; currently, this
> >> >           argument must be specified as 0.
> >> >
> >> >         RETURN VALUE
> >> >           On success, process_reap() returns 0. On error, -1 is returned
> >> >           and errno is set to indicate the error.
> >>
> >> I think the manual page should mention what it means for a process to be
> >> “dying”, and how to move a process to this state.
> >
> > Thanks for the suggestion, Florian! Would replacing "dying process"
> > with "process which was sent a SIGKILL signal" be sufficient?
>
> That explains very clearly the requirement, but it raises the question
> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> system call.

I think you are suggesting to use sigqueue() to deliver the signal and
perform the reaping when a special value accompanies it. This would be
somewhat similar to my early suggestion to use a flag in
pidfd_send_signal() (see:
https://lore.kernel.org/patchwork/patch/1060407) to implement memory
reaping which has another advantage of operation on PIDFDs instead of
PIDs which can be recycled.
kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
signal and return without blocking. Changing that behavior was
considered unacceptable in these discussions. On the other hand using
some kthread to do the reaping asynchronously has its disadvantages:
userspace can't control the priority and cpu affinity of the thread
doing the reaping and this work is not charged towards the caller. In
the end a separate blocking syscall was deemed appropriate for this
operation. More details can be found in the links I posted in the
description of the patch.

>
> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-08  6:05       ` Suren Baghdasaryan
@ 2021-07-08  6:14         ` Florian Weimer
  2021-07-08  6:39           ` Suren Baghdasaryan
  2021-07-12 12:51         ` Jan Engelhardt
  1 sibling, 1 reply; 31+ messages in thread
From: Florian Weimer @ 2021-07-08  6:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

* Suren Baghdasaryan:

> On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Suren Baghdasaryan:
>>
>> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote:
>> >>
>> >> * Suren Baghdasaryan:
>> >>
>> >> > The API is as follows,
>> >> >
>> >> >           int process_reap(int pidfd, unsigned int flags);
>> >> >
>> >> >         DESCRIPTION
>> >> >           The process_reap() system call is used to free the memory of a
>> >> >           dying process.
>> >> >
>> >> >           The pidfd selects the process referred to by the PID file
>> >> >           descriptor.
>> >> >           (See pidofd_open(2) for further information)
>> >> >
>> >> >           The flags argument is reserved for future use; currently, this
>> >> >           argument must be specified as 0.
>> >> >
>> >> >         RETURN VALUE
>> >> >           On success, process_reap() returns 0. On error, -1 is returned
>> >> >           and errno is set to indicate the error.
>> >>
>> >> I think the manual page should mention what it means for a process to be
>> >> “dying”, and how to move a process to this state.
>> >
>> > Thanks for the suggestion, Florian! Would replacing "dying process"
>> > with "process which was sent a SIGKILL signal" be sufficient?
>>
>> That explains very clearly the requirement, but it raises the question
>> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
>> system call.
>
> I think you are suggesting to use sigqueue() to deliver the signal and
> perform the reaping when a special value accompanies it. This would be
> somewhat similar to my early suggestion to use a flag in
> pidfd_send_signal() (see:
> https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> reaping which has another advantage of operation on PIDFDs instead of
> PIDs which can be recycled.
> kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> signal and return without blocking. Changing that behavior was
> considered unacceptable in these discussions.

Does this mean that you need two threads, one that sends SIGKILL, and
one that calls process_reap?  Given that sending SIGKILL is blocking
with the existing interfaces?

Please also note that asynchronous deallocation of resources leads to
bugs and can cause unrelated workloads to fail.  For example, in some
configurations, clone can fail with EAGAIN even in cases where the total
number of tasks is clearly bounded because the kernel signals task exit
to applications before all resources are deallocated.  I'm worried that
the new interface makes things quite a bit worse in this regard.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-08  6:14         ` Florian Weimer
@ 2021-07-08  6:39           ` Suren Baghdasaryan
  2021-07-08  7:13             ` Florian Weimer
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-08  6:39 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

On Wed, Jul 7, 2021 at 11:15 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Suren Baghdasaryan:
>
> > On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Suren Baghdasaryan:
> >>
> >> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote:
> >> >>
> >> >> * Suren Baghdasaryan:
> >> >>
> >> >> > The API is as follows,
> >> >> >
> >> >> >           int process_reap(int pidfd, unsigned int flags);
> >> >> >
> >> >> >         DESCRIPTION
> >> >> >           The process_reap() system call is used to free the memory of a
> >> >> >           dying process.
> >> >> >
> >> >> >           The pidfd selects the process referred to by the PID file
> >> >> >           descriptor.
> >> >> >           (See pidofd_open(2) for further information)
> >> >> >
> >> >> >           The flags argument is reserved for future use; currently, this
> >> >> >           argument must be specified as 0.
> >> >> >
> >> >> >         RETURN VALUE
> >> >> >           On success, process_reap() returns 0. On error, -1 is returned
> >> >> >           and errno is set to indicate the error.
> >> >>
> >> >> I think the manual page should mention what it means for a process to be
> >> >> “dying”, and how to move a process to this state.
> >> >
> >> > Thanks for the suggestion, Florian! Would replacing "dying process"
> >> > with "process which was sent a SIGKILL signal" be sufficient?
> >>
> >> That explains very clearly the requirement, but it raises the question
> >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> >> system call.
> >
> > I think you are suggesting to use sigqueue() to deliver the signal and
> > perform the reaping when a special value accompanies it. This would be
> > somewhat similar to my early suggestion to use a flag in
> > pidfd_send_signal() (see:
> > https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> > reaping which has another advantage of operation on PIDFDs instead of
> > PIDs which can be recycled.
> > kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> > signal and return without blocking. Changing that behavior was
> > considered unacceptable in these discussions.
>
> Does this mean that you need two threads, one that sends SIGKILL, and
> one that calls process_reap?  Given that sending SIGKILL is blocking
> with the existing interfaces?

Sending SIGKILL is blocking in terms of delivering the signal, but it
does not block waiting for SIGKILL to be processed by the signal
recipient and memory to be released. When I was talking about
"blocking", I meant that current kill() and friends do not block to
wait for SIGKILL to be processed.
process_reap() will block until the memory is released. Whether the
userspace caller is using it right after sending a SIGKILL to reclaim
the memory synchronously or spawns a separate thread to reclaim memory
asynchronously is up to the user. Both patterns are supported.

> Please also note that asynchronous deallocation of resources leads to
> bugs and can cause unrelated workloads to fail.  For example, in some
> configurations, clone can fail with EAGAIN even in cases where the total
> number of tasks is clearly bounded because the kernel signals task exit
> to applications before all resources are deallocated.  I'm worried that
> the new interface makes things quite a bit worse in this regard.

The process_reap() releases memory synchronously, no kthreads are
being used. If asynchronous release is required, the userspace would
need to spawn a userspace thread and issue this syscall from it. I
hope this clears your concerns, which I think are about asynchronous
deallocations within the kernel.
Thanks!

>
> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-08  6:39           ` Suren Baghdasaryan
@ 2021-07-08  7:13             ` Florian Weimer
  0 siblings, 0 replies; 31+ messages in thread
From: Florian Weimer @ 2021-07-08  7:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Christoph Hellwig, Oleg Nesterov,
	David Hildenbrand, Jann Horn, Shakeel Butt, Tim Murray,
	Linux API, linux-mm, LKML, kernel-team

* Suren Baghdasaryan:

> Sending SIGKILL is blocking in terms of delivering the signal, but it
> does not block waiting for SIGKILL to be processed by the signal
> recipient and memory to be released. When I was talking about
> "blocking", I meant that current kill() and friends do not block to
> wait for SIGKILL to be processed.
> process_reap() will block until the memory is released. Whether the
> userspace caller is using it right after sending a SIGKILL to reclaim
> the memory synchronously or spawns a separate thread to reclaim memory
> asynchronously is up to the user. Both patterns are supported.

I see, this makes sense.

Considering that the pidfd sticks around after process_reap returns, the
issue described in bug 154011 probably does not apply to process_reap.
(This relates to asynchronous resource deallocation, as discussed before.)

Thanks,
Florian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-07 21:14                 ` Suren Baghdasaryan
@ 2021-07-09  8:58                   ` Christian Brauner
  2021-07-09 20:05                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-07-09  8:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, David Hildenbrand, Andy Lutomirski,
	Johannes Weiner, Andrew Morton, David Rientjes, Matthew Wilcox,
	Roman Gushchin, Rik van Riel, Minchan Kim, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > > On 02.07.21 17:27, Christian Brauner wrote:
> > [...]
> > > > That one was my favorite from the list I gave too but maybe we can
> > > > satisfy Andy too if we use one of:
> > > > - process_mfree()
> > > > - process_mrelease()
> > > >
> > >
> > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > > the memory if there are no other references") semantics.
> >
> > Agreed.
> 
> Ok, sounds like process_mrelease() would be an acceptable compromise.
> 
> >
> > > Further, a new
> > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
> >
> > Yeah, proc based interface is both tricky to use and kinda ugly now that
> > pidfd can solve all at in once.
> 
> Sounds good. Will keep it as is then.
> 
> > My original preference was a more generic kill syscall to allow flags
> > but a dedicated syscall doesn't look really bad either.
> 
> Yeah, I have tried that direction unsuccessfully before arriving at
> this one. Hopefully it represents the right compromise which can
> satisfy everyone's usecase.

I think a syscall is fine and it's not we're running out of numbers
(anymore). :)

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-09  8:58                   ` Christian Brauner
@ 2021-07-09 20:05                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-09 20:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michal Hocko, David Hildenbrand, Andy Lutomirski,
	Johannes Weiner, Andrew Morton, David Rientjes, Matthew Wilcox,
	Roman Gushchin, Rik van Riel, Minchan Kim, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, Linux-MM, LKML, Android Kernel Team

On Fri, Jul 9, 2021 at 1:59 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > > > On 02.07.21 17:27, Christian Brauner wrote:
> > > [...]
> > > > > That one was my favorite from the list I gave too but maybe we can
> > > > > satisfy Andy too if we use one of:
> > > > > - process_mfree()
> > > > > - process_mrelease()
> > > > >
> > > >
> > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > > > the memory if there are no other references") semantics.
> > >
> > > Agreed.
> >
> > Ok, sounds like process_mrelease() would be an acceptable compromise.
> >
> > >
> > > > Further, a new
> > > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
> > >
> > > Yeah, proc based interface is both tricky to use and kinda ugly now that
> > > pidfd can solve all at in once.
> >
> > Sounds good. Will keep it as is then.
> >
> > > My original preference was a more generic kill syscall to allow flags
> > > but a dedicated syscall doesn't look really bad either.
> >
> > Yeah, I have tried that direction unsuccessfully before arriving at
> > this one. Hopefully it represents the right compromise which can
> > satisfy everyone's usecase.
>
> I think a syscall is fine and it's not we're running out of numbers
> (anymore). :)

Thanks everyone for the input!
So far I collected:
1. rename the syscall to process_mrelease()
2. replace "dying process" with "process which was sent a SIGKILL
signal" in the manual page text

I'll respin a v2 with these changes next week.
Have a great weekend!
Suren.

>
> Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-08  6:05       ` Suren Baghdasaryan
  2021-07-08  6:14         ` Florian Weimer
@ 2021-07-12 12:51         ` Jan Engelhardt
  2021-07-12 18:39           ` Suren Baghdasaryan
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Engelhardt @ 2021-07-12 12:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Florian Weimer, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, linux-mm, LKML, kernel-team


On Thursday 2021-07-08 08:05, Suren Baghdasaryan wrote:
>>
>> That explains very clearly the requirement, but it raises the question
>> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
>> system call.
>
>I think you are suggesting to use sigqueue() to deliver the signal and
>perform the reaping when a special value accompanies it. This would be
>somewhat similar to my early suggestion to use a flag in
>pidfd_send_signal() (see:
>https://lore.kernel.org/patchwork/patch/1060407) to implement memory
>reaping which has another advantage of operation on PIDFDs instead of
>PIDs which can be recycled.
>kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
>signal and return without blocking. Changing that behavior was
>considered unacceptable in these discussions.

The way I understood the request is that a userspace program (or perhaps two,
if so desired) should issue _two_ calls, one to deliver the signal,
one to perform the reap portion:

	uinfo.si_code = SI_QUEUE;
	sigqueue(pid, SIGKILL, &uinfo);
	uinfo.si_code = SI_REAP;
	sigqueue(pid, SIGKILL, &uinfo);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-12 12:51         ` Jan Engelhardt
@ 2021-07-12 18:39           ` Suren Baghdasaryan
  2021-07-12 19:16             ` Jan Engelhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Suren Baghdasaryan @ 2021-07-12 18:39 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Florian Weimer, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, linux-mm, LKML, kernel-team

On Mon, Jul 12, 2021 at 5:51 AM Jan Engelhardt <jengelh@inai.de> wrote:
>
>
> On Thursday 2021-07-08 08:05, Suren Baghdasaryan wrote:
> >>
> >> That explains very clearly the requirement, but it raises the question
> >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> >> system call.
> >
> >I think you are suggesting to use sigqueue() to deliver the signal and
> >perform the reaping when a special value accompanies it. This would be
> >somewhat similar to my early suggestion to use a flag in
> >pidfd_send_signal() (see:
> >https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> >reaping which has another advantage of operation on PIDFDs instead of
> >PIDs which can be recycled.
> >kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> >signal and return without blocking. Changing that behavior was
> >considered unacceptable in these discussions.
>
> The way I understood the request is that a userspace program (or perhaps two,
> if so desired) should issue _two_ calls, one to deliver the signal,
> one to perform the reap portion:
>
>         uinfo.si_code = SI_QUEUE;
>         sigqueue(pid, SIGKILL, &uinfo);
>         uinfo.si_code = SI_REAP;
>         sigqueue(pid, SIGKILL, &uinfo);

This approach would still lead to the same discussion: by design,
sigqueue/kill/pidfd_send_signal deliver the signal but do not wait for
the signal to be processed by the recipient. Changing that would be a
behavior change. Therefore we would have to follow this pattern and
implement memory reaping in an asynchronous manner using a
kthread/workqueue and it won't be done in the context of the calling
process. This is undesirable because we lose the ability to control
priority and cpu affinity for this operation and work won't be charged
to the caller.
That's why the proposed syscall performs memory reaping in the
caller's context and blocks until the operation is done. In this
proposal, your sequence looks like this:

pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
process_reap(pidfd, 0);

except we decided to rename process_reap() to process_mrelease() in
the next revision.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] mm: introduce process_reap system call
  2021-07-12 18:39           ` Suren Baghdasaryan
@ 2021-07-12 19:16             ` Jan Engelhardt
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Engelhardt @ 2021-07-12 19:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Florian Weimer, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Minchan Kim, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Tim Murray, Linux API, linux-mm, LKML, kernel-team


On Monday 2021-07-12 20:39, Suren Baghdasaryan wrote:
>>
>> The way I understood the request is that a userspace program (or perhaps two,
>> if so desired) should issue _two_ calls, one to deliver the signal,
>> one to perform the reap portion:
>>
>>         uinfo.si_code = SI_QUEUE;
>>         sigqueue(pid, SIGKILL, &uinfo);
>>         uinfo.si_code = SI_REAP;
>>         sigqueue(pid, SIGKILL, &uinfo);
>
>This approach would still lead to the same discussion: by design,
>sigqueue/kill/pidfd_send_signal deliver the signal but do not wait for
>the signal to be processed by the recipient.

Oh, so the only reason not to do that is because there is some POSIX
specification that says the sigqueue API should be non-waiting for all
possible parameter values (with an implied "present and future
values!"), not because there's some hurdle to actually add a wait
inside within rt_sigqueueinfo if the REAP flag is set.
Gotcha.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-07-12 19:16 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 19:28 [PATCH 1/1] mm: introduce process_reap system call Suren Baghdasaryan
2021-06-23 19:34 ` Suren Baghdasaryan
2021-06-29 13:13 ` Christian Brauner
2021-06-29 16:15   ` Suren Baghdasaryan
2021-06-30 18:00 ` Shakeel Butt
2021-06-30 18:43   ` Suren Baghdasaryan
2021-06-30 19:00     ` Shakeel Butt
2021-06-30 19:06       ` Suren Baghdasaryan
2021-06-30 18:26 ` Andy Lutomirski
2021-06-30 18:51   ` Suren Baghdasaryan
2021-06-30 21:45     ` Johannes Weiner
2021-07-01  0:44       ` Andy Lutomirski
2021-07-01 22:59         ` Suren Baghdasaryan
2021-07-02 15:27           ` Christian Brauner
2021-07-05  7:41             ` David Hildenbrand
2021-07-07 12:38               ` Michal Hocko
2021-07-07 21:14                 ` Suren Baghdasaryan
2021-07-09  8:58                   ` Christian Brauner
2021-07-09 20:05                     ` Suren Baghdasaryan
2021-07-01  0:45     ` Andy Lutomirski
2021-07-01 23:08       ` Suren Baghdasaryan
2021-07-07  9:46 ` Florian Weimer
2021-07-07 21:07   ` Suren Baghdasaryan
2021-07-08  5:40     ` Florian Weimer
2021-07-08  6:05       ` Suren Baghdasaryan
2021-07-08  6:14         ` Florian Weimer
2021-07-08  6:39           ` Suren Baghdasaryan
2021-07-08  7:13             ` Florian Weimer
2021-07-12 12:51         ` Jan Engelhardt
2021-07-12 18:39           ` Suren Baghdasaryan
2021-07-12 19:16             ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).