LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer
@ 2021-10-04 19:21 Mike Christie
  2021-10-04 19:21 ` [PATCH V3 1/9] fork: Make IO worker options flag based Mike Christie
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

The following patches were made over Linus's tree but also apply over
Jens's for-next io_uring branch and Michaels' vhost/next branch.

This is V3 of the patchset. It should handle all the review comments
posted in V1 and V2. If I missed a comment, please let me know.

This patchset allows the vhost layer to do a copy_process on the thread
that does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process
against its userspace app (Jens, the patches make create_io_thread more
generic so that's why you are cc'd). This allows the vhost layer's worker
threads to inherit cgroups, namespaces, address space, etc and this worker
thread will also be accounted for against that owner/parent process's
RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times. For example, the current approach does not
support cgroups v2 so commands like virsh emulatorpin do not work. The
qemu process can go over its RLIMIT_NPROC. And for future vhost interfaces
where we export the vhost thread pid we will want the namespace info.

V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 and xtensa which were doing the reverse of other
  archs in their check for PF_IO_WORKER.
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 1/9] fork: Make IO worker options flag based
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 19:21 ` [PATCH V3 2/9] fork: pass worker_flags to copy_thread Mike Christie
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

This patchset adds a couple new options to kernel_clone_args for IO thread
like/related users. Instead of adding new fields to kernel_clone_args for
each option, this moves us to a flags based approach by first converting
io_thread.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/sched/task.h | 4 +++-
 kernel/fork.c              | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..48417c735438 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
+#define KERN_WORKER_IO		BIT(0)
+
 struct kernel_clone_args {
 	u64 flags;
+	u32 worker_flags;
 	int __user *pidfd;
 	int __user *child_tid;
 	int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
-	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..3988106e9609 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,7 +2026,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
-	if (args->io_thread) {
+	if (args->worker_flags & KERN_WORKER_IO) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
 		 * fatal or STOP
@@ -2526,7 +2526,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.io_thread	= 1,
+		.worker_flags	= KERN_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 2/9] fork: pass worker_flags to copy_thread
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
  2021-10-04 19:21 ` [PATCH V3 1/9] fork: Make IO worker options flag based Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 19:21 ` [PATCH V3 3/9] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

We need to break up PF_IO_WORKER into the parts that are used for
scheduling and signal handling and the part that tells copy_thread to
treat it as a special type of thread during setup. This patch passes the
worker_flags to copy_thread, so in the next patch we can add new worker
flags that function can see.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 arch/alpha/kernel/process.c      | 2 +-
 arch/arc/kernel/process.c        | 2 +-
 arch/arm/kernel/process.c        | 3 ++-
 arch/arm64/kernel/process.c      | 3 ++-
 arch/csky/kernel/process.c       | 3 ++-
 arch/h8300/kernel/process.c      | 3 ++-
 arch/hexagon/kernel/process.c    | 2 +-
 arch/ia64/kernel/process.c       | 3 ++-
 arch/m68k/kernel/process.c       | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c       | 2 +-
 arch/nds32/kernel/process.c      | 3 ++-
 arch/nios2/kernel/process.c      | 2 +-
 arch/openrisc/kernel/process.c   | 3 ++-
 arch/parisc/kernel/process.c     | 3 ++-
 arch/powerpc/kernel/process.c    | 2 +-
 arch/riscv/kernel/process.c      | 2 +-
 arch/s390/kernel/process.c       | 3 ++-
 arch/sh/kernel/process_32.c      | 2 +-
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c         | 3 ++-
 arch/x86/kernel/process.c        | 2 +-
 arch/xtensa/kernel/process.c     | 2 +-
 include/linux/sched/task.h       | 2 +-
 kernel/fork.c                    | 3 ++-
 26 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index a5123ea426ce..6005b0dfe7e2 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -235,7 +235,7 @@ release_thread(struct task_struct *dead_task)
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long kthread_arg, struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls, u32 worker_flags)
 {
 	extern void ret_from_fork(void);
 	extern void ret_from_kernel_thread(void);
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 3793876f42d9..4e307e5b5205 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -164,7 +164,7 @@ asmlinkage void ret_from_fork(void);
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long kthread_arg, struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *c_regs;        /* child's pt_regs */
 	unsigned long *childksp;       /* to unwind out of __switch_to() */
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 0e2d3051741e..07ae4444b6ab 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -230,7 +230,8 @@ void release_thread(struct task_struct *dead_task)
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-		unsigned long stk_sz, struct task_struct *p, unsigned long tls)
+		unsigned long stk_sz, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	struct thread_info *thread = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 40adb8cdbf5a..7979ec253c29 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -316,7 +316,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 asmlinkage void ret_from_fork(void) asm("ret_from_fork");
 
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-		unsigned long stk_sz, struct task_struct *p, unsigned long tls)
+		unsigned long stk_sz, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..f38b668515ae 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -34,7 +34,8 @@ int copy_thread(unsigned long clone_flags,
 		unsigned long usp,
 		unsigned long kthread_arg,
 		struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls,
+		u32 worker_flags)
 {
 	struct switch_stack *childstack;
 	struct pt_regs *childregs = task_pt_regs(p);
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 2ac27e4248a4..9a8f6c033ad1 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -106,7 +106,8 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long topstk, struct task_struct *p, unsigned long tls)
+		unsigned long topstk, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	struct pt_regs *childregs;
 
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 6a6835fb4242..664367be55e5 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -51,7 +51,7 @@ void arch_cpu_idle(void)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct hexagon_switch_stack *ss;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index e56d63f4abf9..a69cc33b5e32 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -296,7 +296,8 @@ ia64_load_extra (struct task_struct *task)
  */
 int
 copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
-	    unsigned long user_stack_size, struct task_struct *p, unsigned long tls)
+	    unsigned long user_stack_size, struct task_struct *p, unsigned long tls,
+	    u32 worker_flags)
 {
 	extern char ia64_ret_from_clone;
 	struct switch_stack *child_stack, *stack;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 1ab692b952cd..7587291793fb 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -139,7 +139,7 @@ asmlinkage int m68k_clone3(struct pt_regs *regs)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct fork_frame {
 		struct switch_stack sw;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 62aa237180b6..b8eb544e1fd6 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -54,7 +54,7 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 95aa86fa6077..d494e1d76e71 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -107,7 +107,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long kthread_arg, struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls, u32 worker_flags)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 391895b54d13..1ca8900f9d07 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -150,7 +150,8 @@ DEFINE_PER_CPU(struct task_struct *, __entry_task);
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-		unsigned long stk_sz, struct task_struct *p, unsigned long tls)
+		unsigned long stk_sz, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index 9ff37ba2bb60..b49dc6500118 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -101,7 +101,7 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct pt_regs *regs;
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index b0698d9ce14f..7b356a9a8dc7 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -127,6 +127,7 @@ extern asmlinkage void ret_from_fork(void);
  * @arg: arg to fn for kernel thread; always NULL for userspace thread
  * @p: the newly created task
  * @tls: the Thread Local Storage pointer for the new process
+ * @worker_flags: kernel_clone_args's worker_flags
  *
  * At the top of a newly initialized kernel stack are two stacked pt_reg
  * structures.  The first (topmost) is the userspace context of the thread.
@@ -153,7 +154,7 @@ extern asmlinkage void ret_from_fork(void);
 
 int
 copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-	    struct task_struct *p, unsigned long tls)
+	    struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *userregs;
 	struct pt_regs *kregs;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index 38ec4ae81239..d9555ccf1e9c 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -186,7 +186,8 @@ arch_initcall(parisc_idle_init);
  */
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	    unsigned long kthread_arg, struct task_struct *p, unsigned long tls)
+	    unsigned long kthread_arg, struct task_struct *p, unsigned long tls,
+	    u32 worker_flags)
 {
 	struct pt_regs *cregs = &(p->thread.regs);
 	void *stack = task_stack_page(p);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..d2f2301b0ad1 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1682,7 +1682,7 @@ static void setup_ksp_vsid(struct task_struct *p, unsigned long sp)
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long kthread_arg, struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *childregs, *kregs;
 	extern void ret_from_fork(void);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..3d0e6390f34c 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -120,7 +120,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 350e94d0cac2..01b969bb868e 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -95,7 +95,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
-		unsigned long arg, struct task_struct *p, unsigned long tls)
+		unsigned long arg, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	struct fake_frame
 	{
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 717de05c81f4..d199805552c0 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -93,7 +93,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index bbbe0cfef746..6e04cfc64b99 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -260,7 +260,7 @@ extern void ret_from_fork(void);
 extern void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index d1cc410d2f64..b339eaa1f890 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -575,7 +575,7 @@ void fault_in_user_windows(struct pt_regs *regs)
  * Child  -->  %o0 == parents pid, %o1 == 1
  */
 int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct thread_info *t = task_thread_info(p);
 	struct pt_regs *regs = current_pt_regs();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 457a38db368b..0815a43b9f4a 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -154,7 +154,8 @@ void fork_handler(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct * p, unsigned long tls)
+		unsigned long arg, struct task_struct *p, unsigned long tls,
+		u32 worker_flags)
 {
 	void (*handler)(void);
 	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..20d9bab61b14 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -117,7 +117,7 @@ static int set_new_tls(struct task_struct *p, unsigned long tls)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, u32 worker_flags)
 {
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 060165340612..a0ad9f0cc0cf 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -203,7 +203,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 
 int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 		unsigned long thread_fn_arg, struct task_struct *p,
-		unsigned long tls)
+		unsigned long tls, u32 worker_flags)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 48417c735438..ffc7c6a384ad 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -68,7 +68,7 @@ extern void fork_init(void);
 extern void release_task(struct task_struct * p);
 
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
-		       struct task_struct *, unsigned long);
+		       struct task_struct *, unsigned long, u32);
 
 extern void flush_thread(void);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3988106e9609..3c3624786e4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2200,7 +2200,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
+	retval = copy_thread(clone_flags, args->stack, args->stack_size, p,
+			     args->tls, args->worker_flags);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 3/9] fork: move PF_IO_WORKER's kernel frame setup to new flag
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
  2021-10-04 19:21 ` [PATCH V3 1/9] fork: Make IO worker options flag based Mike Christie
  2021-10-04 19:21 ` [PATCH V3 2/9] fork: pass worker_flags to copy_thread Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 19:21 ` [PATCH V3 4/9] fork: add option to not clone or dup files Mike Christie
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The vhost worker threads need the same frame setup as io_uring's worker
threads, but handle signals differently and do not need the same
scheduling behavior. This patch separate's the frame setup parts of
PF_IO_WORKER into a kernel_clone_args flag, KERN_WORKER_USER.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/alpha/kernel/process.c      | 3 ++-
 arch/arc/kernel/process.c        | 3 ++-
 arch/arm/kernel/process.c        | 3 ++-
 arch/arm64/kernel/process.c      | 3 ++-
 arch/csky/kernel/process.c       | 3 ++-
 arch/h8300/kernel/process.c      | 3 ++-
 arch/hexagon/kernel/process.c    | 3 ++-
 arch/ia64/kernel/process.c       | 3 ++-
 arch/m68k/kernel/process.c       | 3 ++-
 arch/microblaze/kernel/process.c | 3 ++-
 arch/mips/kernel/process.c       | 3 ++-
 arch/nds32/kernel/process.c      | 3 ++-
 arch/nios2/kernel/process.c      | 3 ++-
 arch/openrisc/kernel/process.c   | 3 ++-
 arch/parisc/kernel/process.c     | 3 ++-
 arch/powerpc/kernel/process.c    | 3 ++-
 arch/riscv/kernel/process.c      | 3 ++-
 arch/s390/kernel/process.c       | 3 ++-
 arch/sh/kernel/process_32.c      | 3 ++-
 arch/sparc/kernel/process_32.c   | 3 ++-
 arch/sparc/kernel/process_64.c   | 3 ++-
 arch/um/kernel/process.c         | 3 ++-
 arch/x86/kernel/process.c        | 4 ++--
 arch/xtensa/kernel/process.c     | 3 ++-
 include/linux/sched/task.h       | 1 +
 kernel/fork.c                    | 2 +-
 26 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 6005b0dfe7e2..e9b2dde444f4 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 4e307e5b5205..2caa80fb9e9c 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 07ae4444b6ab..9f41435d78d9 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -248,7 +248,8 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER)))) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 7979ec253c29..d149de03bd50 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -334,7 +334,8 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	ptrauth_thread_init_kernel(p);
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER)))) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index f38b668515ae..10debf43375e 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -50,7 +50,8 @@ int copy_thread(unsigned long clone_flags,
 	/* setup thread.sp for switch_to !!! */
 	p->thread.sp = (unsigned long)childstack;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childstack->r15 = (unsigned long) ret_from_kernel_thread;
 		childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 9a8f6c033ad1..e0d69c3afa2a 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -113,7 +113,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 664367be55e5..9ea473567a5c 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index a69cc33b5e32..d7c47ea12703 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -339,7 +339,8 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 7587291793fb..a842e6c7bef2 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	 */
 	p->thread.fc = USER_DATA;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index b8eb544e1fd6..ba1a45842a70 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index d494e1d76e71..b592d1bfab09 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 1ca8900f9d07..8ec5b725842b 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -157,7 +157,8 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* kernel thread fn */
 		p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index b49dc6500118..e22b83b64769 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 7b356a9a8dc7..684ef1f0999c 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -173,7 +173,8 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index d9555ccf1e9c..1c955e6bad83 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -198,7 +198,8 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	extern void * const ret_from_kernel_thread;
 	extern void * const child_return;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		memset(cregs, 0, sizeof(struct pt_regs));
 		if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index d2f2301b0ad1..097f13b43a8f 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1700,7 +1700,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/* Copy registers */
 	sp -= sizeof(struct pt_regs);
 	childregs = (struct pt_regs *) sp;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gpr[1] = sp + sizeof(struct pt_regs);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 3d0e6390f34c..39bb4a79be15 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -125,7 +125,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	/* p->thread holds context to be restored by __switch_to() */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* Kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 01b969bb868e..29ba92911340 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -131,7 +131,8 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long)frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index d199805552c0..8cbd7404df40 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 6e04cfc64b99..2522283a63ac 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -296,7 +296,8 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index b339eaa1f890..a157474c970f 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -594,7 +594,8 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely((p->flags & PF_KTHREAD) ||
+		     (worker_flags & KERN_WORKER_USER))) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 0815a43b9f4a..28e5c9f67436 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -158,7 +158,8 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 		u32 worker_flags)
 {
 	void (*handler)(void);
-	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
+	int kthread = (current->flags & PF_KTHREAD) ||
+		      (worker_flags & KERN_WORKER_USER);
 	int ret = 0;
 
 	p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 20d9bab61b14..a904f5524d73 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -178,9 +178,9 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif
 
-	if (unlikely(p->flags & PF_IO_WORKER)) {
+	if (unlikely(worker_flags & KERN_WORKER_USER)) {
 		/*
-		 * An IO thread is a user space thread, but it doesn't
+		 * A user worker thread is a user space thread, but it doesn't
 		 * return to ret_after_fork().
 		 *
 		 * In order to indicate that to tools like gdb,
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index a0ad9f0cc0cf..0af51e94c8dc 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -217,7 +217,8 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 
 	p->thread.sp = (unsigned long)childregs;
 
-	if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (!((p->flags & PF_KTHREAD) ||
+	      (worker_flags & KERN_WORKER_USER))) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ffc7c6a384ad..cf7c9fffc839 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -19,6 +19,7 @@ struct css_set;
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
 #define KERN_WORKER_IO		BIT(0)
+#define KERN_WORKER_USER	BIT(1)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 3c3624786e4d..4b0e8257993b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2527,7 +2527,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.worker_flags	= KERN_WORKER_IO,
+		.worker_flags	= KERN_WORKER_IO | KERN_WORKER_USER,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 4/9] fork: add option to not clone or dup files
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (2 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 3/9] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 19:21 ` [PATCH V3 5/9] fork: add helper to clone a process Mike Christie
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls the kernel_worker() function added in
the next patch we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
        do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index cf7c9fffc839..e165cc67fd3c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;
 
 #define KERN_WORKER_IO		BIT(0)
 #define KERN_WORKER_USER	BIT(1)
+#define KERN_WORKER_NO_FILES	BIT(2)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 4b0e8257993b..98264cf1d6a6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1532,7 +1532,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+		      int no_files)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1544,6 +1545,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldf)
 		goto out;
 
+	if (no_files) {
+		tsk->files = NULL;
+		goto out;
+	}
+
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2179,7 +2185,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p);
+	retval = copy_files(clone_flags, p,
+			    args->worker_flags & KERN_WORKER_NO_FILES);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 5/9] fork: add helper to clone a process
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (3 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 4/9] fork: add option to not clone or dup files Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 20:29   ` Jens Axboe
  2021-10-05 12:50   ` Christian Brauner
  2021-10-04 19:21 ` [PATCH V3 6/9] io_uring: switch to kernel_worker Mike Christie
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The vhost layer has similar requirements as io_uring where its worker
threads need to access the userspace thread's memory, want to inherit the
parents's cgroups and namespaces, and be checked against the parent's
RLIMITs. Right now, the vhost layer uses the kthread API which has
kthread_use_mm for mem access, and those threads can use
cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
other items.

This adds a helper to clone a process so we can inherit everything we
want in one call. It's a more generic version of create_io_thread which
will be used by the vhost layer and io_uring in later patches in this set.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/sched/task.h |  6 ++++-
 kernel/fork.c              | 48 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index e165cc67fd3c..ba0499b6627c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -87,7 +87,11 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
+struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+				  unsigned long clone_flags, u32 worker_flags);
+__printf(2, 3)
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 98264cf1d6a6..3f3fcabffa5f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2540,6 +2540,54 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+/**
+ * kernel_worker - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @worker_flags: KERN_WORKER flags
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through kernel_worker_start(). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+				  unsigned long clone_flags, u32 worker_flags)
+{
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
+				   CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.worker_flags	= KERN_WORKER_USER | worker_flags,
+	};
+
+	return copy_process(NULL, 0, node, &args);
+}
+EXPORT_SYMBOL_GPL(kernel_worker);
+
+/**
+ * kernel_worker_start - Start a task created with kernel_worker
+ * @tsk: task to wake up
+ * @namefmt: printf-style format string for the thread name
+ * @arg: arguments for @namefmt
+ */
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
+{
+	char name[TASK_COMM_LEN];
+	va_list args;
+
+	va_start(args, namefmt);
+	vsnprintf(name, sizeof(name), namefmt, args);
+	set_task_comm(tsk, name);
+	va_end(args);
+
+	wake_up_new_task(tsk);
+}
+EXPORT_SYMBOL_GPL(kernel_worker_start);
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 6/9] io_uring: switch to kernel_worker
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (4 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 5/9] fork: add helper to clone a process Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 20:30   ` Jens Axboe
  2021-10-04 19:21 ` [PATCH V3 7/9] fork: Add worker flag to ignore signals Mike Christie
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

Convert io_uring and io-wq to use kernel_worker.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/io-wq.c                 | 15 ++++++++-------
 fs/io_uring.c              | 11 +++++------
 include/linux/sched/task.h |  1 -
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 5bf8aa81715e..a31c260d39c3 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -69,6 +69,9 @@ struct io_worker {
 
 #define IO_WQ_NR_HASH_BUCKETS	(1u << IO_WQ_HASH_ORDER)
 
+#define IO_WQ_CLONE_FLAGS	(CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
+				 CLONE_THREAD | CLONE_IO)
+
 struct io_wqe_acct {
 	unsigned nr_workers;
 	unsigned max_workers;
@@ -549,13 +552,9 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool last_timeout = false;
-	char buf[TASK_COMM_LEN];
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
 
-	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
-	set_task_comm(current, buf);
-
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		long ret;
 
@@ -650,7 +649,7 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
 	worker->flags |= IO_WORKER_F_FREE;
 	raw_spin_unlock(&wqe->lock);
-	wake_up_new_task(tsk);
+	kernel_worker_start(tsk, "iou-wrk-%d", wqe->wq->task->pid);
 }
 
 static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
@@ -680,7 +679,8 @@ static void create_worker_cont(struct callback_head *cb)
 	worker = container_of(cb, struct io_worker, create_work);
 	clear_bit_unlock(0, &worker->create_state);
 	wqe = worker->wqe;
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = kernel_worker(io_wqe_worker, worker, wqe->node,
+			    IO_WQ_CLONE_FLAGS, KERN_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 		io_worker_release(worker);
@@ -750,7 +750,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (index == IO_WQ_ACCT_BOUND)
 		worker->flags |= IO_WORKER_F_BOUND;
 
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = kernel_worker(io_wqe_worker, worker, wqe->node, IO_WQ_CLONE_FLAGS,
+			    KERN_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6b9e70208782..a5970fc182b7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7401,12 +7401,8 @@ static int io_sq_thread(void *data)
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
-	char buf[TASK_COMM_LEN];
 	DEFINE_WAIT(wait);
 
-	snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
-	set_task_comm(current, buf);
-
 	if (sqd->sq_cpu != -1)
 		set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
 	else
@@ -8626,6 +8622,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		unsigned long flags = CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+					CLONE_THREAD | CLONE_IO;
 		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 		bool attached;
@@ -8667,7 +8665,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->task_pid = current->pid;
 		sqd->task_tgid = current->tgid;
-		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		tsk = kernel_worker(io_sq_thread, sqd, NUMA_NO_NODE,
+				    flags, KERN_WORKER_IO);
 		if (IS_ERR(tsk)) {
 			ret = PTR_ERR(tsk);
 			goto err_sqpoll;
@@ -8675,7 +8674,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->thread = tsk;
 		ret = io_uring_alloc_task_context(tsk, ctx);
-		wake_up_new_task(tsk);
+		kernel_worker_start(tsk, "iou-sqp-%d", sqd->task_pid);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba0499b6627c..781abbc1c288 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -87,7 +87,6 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);
 struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
 				  unsigned long clone_flags, u32 worker_flags);
 __printf(2, 3)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 7/9] fork: Add worker flag to ignore signals
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (5 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 6/9] io_uring: switch to kernel_worker Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 20:04   ` Jens Axboe
  2021-10-04 19:21 ` [PATCH V3 8/9] vhost: move worker thread fields to new struct Mike Christie
  2021-10-04 19:21 ` [PATCH V3 9/9] vhost: use kernel_worker to check RLIMITs and inherit v2 cgroups Mike Christie
  8 siblings, 1 reply; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The kthread API creates threads that ignore all signals by default so
modules like vhost that will move from that API to kernel_worker will
not be expecting them. This patch adds a worker flag that tells
kernel_worker to setup the task to ignore signals.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 781abbc1c288..aefa0d221b57 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
 #define KERN_WORKER_IO		BIT(0)
 #define KERN_WORKER_USER	BIT(1)
 #define KERN_WORKER_NO_FILES	BIT(2)
+#define KERN_WORKER_NO_SIGS	BIT(3)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f3fcabffa5f..34d3dca70cfb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2555,6 +2555,8 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
 				  unsigned long clone_flags, u32 worker_flags)
 {
+	struct task_struct *tsk;
+
 	struct kernel_clone_args args = {
 		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
 				   CLONE_UNTRACED) & ~CSIGNAL),
@@ -2564,7 +2566,14 @@ struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
 		.worker_flags	= KERN_WORKER_USER | worker_flags,
 	};
 
-	return copy_process(NULL, 0, node, &args);
+	tsk = copy_process(NULL, 0, node, &args);
+	if (IS_ERR(tsk))
+		return tsk;
+
+	if (worker_flags & KERN_WORKER_NO_SIGS)
+		ignore_signals(tsk);
+
+	return tsk;
 }
 EXPORT_SYMBOL_GPL(kernel_worker);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 8/9] vhost: move worker thread fields to new struct
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (6 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 7/9] fork: Add worker flag to ignore signals Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  2021-10-04 19:21 ` [PATCH V3 9/9] vhost: use kernel_worker to check RLIMITs and inherit v2 cgroups Mike Christie
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patches.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
+			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	worker->kcov_handle = kcov_common_handle();
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	vhost_attach_mm(dev);
 
-	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
 err_mm:
 	return err;
 }
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+	u64			kcov_handle;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
-	u64 kcov_handle;
 	bool use_worker;
 	int (*msg_handler)(struct vhost_dev *dev,
 			   struct vhost_iotlb_msg *msg);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 9/9] vhost: use kernel_worker to check RLIMITs and inherit v2 cgroups
  2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (7 preceding siblings ...)
  2021-10-04 19:21 ` [PATCH V3 8/9] vhost: move worker thread fields to new struct Mike Christie
@ 2021-10-04 19:21 ` Mike Christie
  8 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2021-10-04 19:21 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in cgroups v2 not
working and the wrong RLIMITs being checked. This patch has us use the
kernel_copy_process function which will inherit its values/checks from the
thread that owns the device.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 68 ++++++++++++++++---------------------------
 drivers/vhost/vhost.h |  7 ++++-
 2 files changed, 31 insertions(+), 44 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..7a5142dcde1b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/kthread.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
 #include <linux/sched/mm.h>
@@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		if (kthread_should_stop()) {
+		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
@@ -376,8 +372,9 @@ static int vhost_worker(void *data)
 				schedule();
 		}
 	}
-	kthread_unuse_mm(dev->mm);
-	return 0;
+
+	complete(worker->exit_done);
+	do_exit(0);
 }
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -517,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-	struct vhost_work work;
-	struct task_struct *owner;
-	int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-	struct vhost_attach_cgroups_struct *s;
-
-	s = container_of(work, struct vhost_attach_cgroups_struct, work);
-	s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-	struct vhost_attach_cgroups_struct attach;
-
-	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
-	return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
@@ -579,6 +551,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+	DECLARE_COMPLETION_ONSTACK(exit_done);
+
+	worker->exit_done = &exit_done;
+	set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags);
+	wake_up_process(worker->task);
+	wait_for_completion(worker->exit_done);
+}
+
 static void vhost_worker_free(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker = dev->worker;
@@ -588,7 +570,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
 	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
-	kthread_stop(worker->task);
+	vhost_worker_stop(worker);
 	kfree(worker);
 }
 
@@ -603,27 +585,27 @@ static int vhost_worker_create(struct vhost_dev *dev)
 		return -ENOMEM;
 
 	dev->worker = worker;
-	worker->dev = dev;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
 
-	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	/*
+	 * vhost used to use the kthread API which ignores all signals by
+	 * default and the drivers expect this behavior. So we do not want to
+	 * ineherit the parent's signal handlers and set our worker to ignore
+	 * everything below.
+	 */
+	task = kernel_worker(vhost_worker, worker, NUMA_NO_NODE,
+			     CLONE_FS | CLONE_CLEAR_SIGHAND,
+			     KERN_WORKER_NO_FILES | KERN_WORKER_NO_SIGS);
 	if (IS_ERR(task)) {
 		ret = PTR_ERR(task);
 		goto free_worker;
 	}
 
 	worker->task = task;
-	wake_up_process(task); /* avoid contributing to loadavg */
-
-	ret = vhost_attach_cgroups(dev);
-	if (ret)
-		goto stop_worker;
-
+	kernel_worker_start(task, "vhost-%d", current->pid);
 	return 0;
 
-stop_worker:
-	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
 	dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+enum {
+	VHOST_WORKER_FLAG_STOP,
+};
+
 struct vhost_worker {
 	struct task_struct	*task;
+	struct completion	*exit_done;
 	struct llist_head	work_list;
-	struct vhost_dev	*dev;
 	u64			kcov_handle;
+	unsigned long		flags;
 };
 
 /* Poll a file (eventfd or socket) */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 7/9] fork: Add worker flag to ignore signals
  2021-10-04 19:21 ` [PATCH V3 7/9] fork: Add worker flag to ignore signals Mike Christie
@ 2021-10-04 20:04   ` Jens Axboe
  2021-10-05 12:45     ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-10-04 20:04 UTC (permalink / raw)
  To: Mike Christie, geert, vverma, hdanton, hch, stefanha, jasowang,
	mst, sgarzare, virtualization, christian.brauner, linux-kernel

On 10/4/21 1:21 PM, Mike Christie wrote:
> The kthread API creates threads that ignore all signals by default so
> modules like vhost that will move from that API to kernel_worker will
> not be expecting them. This patch adds a worker flag that tells
> kernel_worker to setup the task to ignore signals.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  include/linux/sched/task.h |  1 +
>  kernel/fork.c              | 11 ++++++++++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 781abbc1c288..aefa0d221b57 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -21,6 +21,7 @@ struct css_set;
>  #define KERN_WORKER_IO		BIT(0)
>  #define KERN_WORKER_USER	BIT(1)
>  #define KERN_WORKER_NO_FILES	BIT(2)
> +#define KERN_WORKER_NO_SIGS	BIT(3)
>  
>  struct kernel_clone_args {
>  	u64 flags;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3f3fcabffa5f..34d3dca70cfb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2555,6 +2555,8 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
>  				  unsigned long clone_flags, u32 worker_flags)
>  {
> +	struct task_struct *tsk;
> +
>  	struct kernel_clone_args args = {
>  		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
>  				   CLONE_UNTRACED) & ~CSIGNAL),
> @@ -2564,7 +2566,14 @@ struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
>  		.worker_flags	= KERN_WORKER_USER | worker_flags,
>  	};
>  
> -	return copy_process(NULL, 0, node, &args);
> +	tsk = copy_process(NULL, 0, node, &args);
> +	if (IS_ERR(tsk))
> +		return tsk;
> +
> +	if (worker_flags & KERN_WORKER_NO_SIGS)
> +		ignore_signals(tsk);
> +
> +	return tsk;

When I originally did it this way, Eric (correctly) pointed out that
it's racy. See where it's currently done as part of copy_process(), not
after.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 5/9] fork: add helper to clone a process
  2021-10-04 19:21 ` [PATCH V3 5/9] fork: add helper to clone a process Mike Christie
@ 2021-10-04 20:29   ` Jens Axboe
  2021-10-05 12:50   ` Christian Brauner
  1 sibling, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-10-04 20:29 UTC (permalink / raw)
  To: Mike Christie, geert, vverma, hdanton, hch, stefanha, jasowang,
	mst, sgarzare, virtualization, christian.brauner, linux-kernel

On 10/4/21 1:21 PM, Mike Christie wrote:
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index e165cc67fd3c..ba0499b6627c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -87,7 +87,11 @@ extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
> +struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);

Looks like an extra 'i' snuck in here, causing unrelated changes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 6/9] io_uring: switch to kernel_worker
  2021-10-04 19:21 ` [PATCH V3 6/9] io_uring: switch to kernel_worker Mike Christie
@ 2021-10-04 20:30   ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-10-04 20:30 UTC (permalink / raw)
  To: Mike Christie, geert, vverma, hdanton, hch, stefanha, jasowang,
	mst, sgarzare, virtualization, christian.brauner, linux-kernel

On 10/4/21 1:21 PM, Mike Christie wrote:
> Convert io_uring and io-wq to use kernel_worker.

Looks good to me:

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 7/9] fork: Add worker flag to ignore signals
  2021-10-04 20:04   ` Jens Axboe
@ 2021-10-05 12:45     ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2021-10-05 12:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Christie, geert, vverma, hdanton, hch, stefanha, jasowang,
	mst, sgarzare, virtualization, linux-kernel

On Mon, Oct 04, 2021 at 02:04:12PM -0600, Jens Axboe wrote:
> On 10/4/21 1:21 PM, Mike Christie wrote:
> > The kthread API creates threads that ignore all signals by default so
> > modules like vhost that will move from that API to kernel_worker will
> > not be expecting them. This patch adds a worker flag that tells
> > kernel_worker to setup the task to ignore signals.
> > 
> > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> > Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  include/linux/sched/task.h |  1 +
> >  kernel/fork.c              | 11 ++++++++++-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index 781abbc1c288..aefa0d221b57 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -21,6 +21,7 @@ struct css_set;
> >  #define KERN_WORKER_IO		BIT(0)
> >  #define KERN_WORKER_USER	BIT(1)
> >  #define KERN_WORKER_NO_FILES	BIT(2)
> > +#define KERN_WORKER_NO_SIGS	BIT(3)
> >  
> >  struct kernel_clone_args {
> >  	u64 flags;
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3f3fcabffa5f..34d3dca70cfb 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2555,6 +2555,8 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
> >  struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> >  				  unsigned long clone_flags, u32 worker_flags)
> >  {
> > +	struct task_struct *tsk;
> > +
> >  	struct kernel_clone_args args = {
> >  		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
> >  				   CLONE_UNTRACED) & ~CSIGNAL),
> > @@ -2564,7 +2566,14 @@ struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> >  		.worker_flags	= KERN_WORKER_USER | worker_flags,
> >  	};
> >  
> > -	return copy_process(NULL, 0, node, &args);
> > +	tsk = copy_process(NULL, 0, node, &args);
> > +	if (IS_ERR(tsk))
> > +		return tsk;
> > +
> > +	if (worker_flags & KERN_WORKER_NO_SIGS)
> > +		ignore_signals(tsk);
> > +
> > +	return tsk;
> 
> When I originally did it this way, Eric (correctly) pointed out that
> it's racy. See where it's currently done as part of copy_process(), not
> after.

Since this is mirroring kthread's sig ignore api introduced in commit
10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them") to ease the transition into the new api we should also
rename KERNEL_WORKER_NO_SIGS to KERNEL_WORKER_SIG_IGN to reflect that in
the name.
Ignoring signals should be moved into copy_process() after
copy_sighand() and copy_signals().
Aside from that we should introduce a helper that verifies the arguments
passed to kernel_worker() are sane so we don't end up with garbage in
there by surprise (CLONE_SIGHAND and CLONE_CLEAR_SIGHAND don't make
sense with KERNEL_WORKER_SIG_IGN). So this should give us sm like:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7a5142dcde1b..59891db97d87 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -596,7 +596,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
         */
        task = kernel_worker(vhost_worker, worker, NUMA_NO_NODE,
                             CLONE_FS | CLONE_CLEAR_SIGHAND,
-                            KERN_WORKER_NO_FILES | KERN_WORKER_NO_SIGS);
+                            KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN);
        if (IS_ERR(task)) {
                ret = PTR_ERR(task);
                goto free_worker;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index aefa0d221b57..b4f6007f335b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,7 +21,7 @@ struct css_set;
 #define KERN_WORKER_IO         BIT(0)
 #define KERN_WORKER_USER       BIT(1)
 #define KERN_WORKER_NO_FILES   BIT(2)
-#define KERN_WORKER_NO_SIGS    BIT(3)
+#define KERN_WORKER_SIG_IGN    BIT(3)

 struct kernel_clone_args {
        u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 34d3dca70cfb..874c356b3e9f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2212,6 +2212,9 @@ static __latent_entropy struct task_struct *copy_process(
        if (retval)
                goto bad_fork_cleanup_io;

+       if (args->worker_flags & KERN_WORKER_SIG_IGN)
+               ignore_signals(p);
+
        stackleak_task_init(p);

        if (pid != &init_struct_pid) {
@@ -2540,6 +2543,24 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
        return copy_process(NULL, 0, node, &args);
 }

+static bool kernel_worker_flags_valid(struct kernel_clone_args *kargs)
+{
+       /* Verify that no unknown flags are passed along. */
+       if (kargs->worker_flags & ~(KERN_WORKER_IO | KERN_WORKER_USER |
+                                   KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN))
+               return false;
+
+       /*
+        * If we're ignoring all signals don't allow sharing struct sighand and
+        * don't bother clearing signal handlers.
+        */
+       if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
+           (kargs->worker_flags & KERN_WORKER_SIG_IGN))
+               return false;
+
+       return true;
+}
+
 /**
  * kernel_worker - create a copy of a process to be used by the kernel
  * @fn: thread stack
@@ -2555,8 +2576,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
                                  unsigned long clone_flags, u32 worker_flags)
 {
-       struct task_struct *tsk;
-
        struct kernel_clone_args args = {
                .flags          = ((lower_32_bits(clone_flags) | CLONE_VM |
                                   CLONE_UNTRACED) & ~CSIGNAL),
@@ -2566,14 +2585,10 @@ struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
                .worker_flags   = KERN_WORKER_USER | worker_flags,
        };

-       tsk = copy_process(NULL, 0, node, &args);
-       if (IS_ERR(tsk))
-               return tsk;
-
-       if (worker_flags & KERN_WORKER_NO_SIGS)
-               ignore_signals(tsk);
+       if (!kernel_worker_flags_valid(&args))
+               return ERR_PTR(-EINVAL);

-       return tsk;
+       return copy_process(NULL, 0, node, &args);
 }
 EXPORT_SYMBOL_GPL(kernel_worker);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 5/9] fork: add helper to clone a process
  2021-10-04 19:21 ` [PATCH V3 5/9] fork: add helper to clone a process Mike Christie
  2021-10-04 20:29   ` Jens Axboe
@ 2021-10-05 12:50   ` Christian Brauner
  2021-10-05 17:10     ` Mike Christie
  1 sibling, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-10-05 12:50 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, axboe, linux-kernel

On Mon, Oct 04, 2021 at 02:21:24PM -0500, Mike Christie wrote:
> The vhost layer has similar requirements as io_uring where its worker
> threads need to access the userspace thread's memory, want to inherit the
> parents's cgroups and namespaces, and be checked against the parent's
> RLIMITs. Right now, the vhost layer uses the kthread API which has
> kthread_use_mm for mem access, and those threads can use
> cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
> other items.
> 
> This adds a helper to clone a process so we can inherit everything we
> want in one call. It's a more generic version of create_io_thread which
> will be used by the vhost layer and io_uring in later patches in this set.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  include/linux/sched/task.h |  6 ++++-
>  kernel/fork.c              | 48 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index e165cc67fd3c..ba0499b6627c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -87,7 +87,11 @@ extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
> +struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);
> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> +				  unsigned long clone_flags, u32 worker_flags);
> +__printf(2, 3)
> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
>  struct task_struct *fork_idle(int);
>  struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 98264cf1d6a6..3f3fcabffa5f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2540,6 +2540,54 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  	return copy_process(NULL, 0, node, &args);
>  }
>  
> +/**
> + * kernel_worker - create a copy of a process to be used by the kernel
> + * @fn: thread stack
> + * @arg: data to be passed to fn
> + * @node: numa node to allocate task from
> + * @clone_flags: CLONE flags
> + * @worker_flags: KERN_WORKER flags
> + *
> + * This returns a created task, or an error pointer. The returned task is
> + * inactive, and the caller must fire it up through kernel_worker_start(). If
> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
> + */
> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> +				  unsigned long clone_flags, u32 worker_flags)
> +{
> +	struct kernel_clone_args args = {
> +		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
> +				   CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
> +		.stack		= (unsigned long)fn,
> +		.stack_size	= (unsigned long)arg,
> +		.worker_flags	= KERN_WORKER_USER | worker_flags,
> +	};
> +
> +	return copy_process(NULL, 0, node, &args);
> +}
> +EXPORT_SYMBOL_GPL(kernel_worker);
> +
> +/**
> + * kernel_worker_start - Start a task created with kernel_worker
> + * @tsk: task to wake up
> + * @namefmt: printf-style format string for the thread name
> + * @arg: arguments for @namefmt
> + */
> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
> +{
> +	char name[TASK_COMM_LEN];
> +	va_list args;

You could think about reporting an error from this function if
KERN_WORK_USER isn't set or only call the below when KERN_WORK_USER is
set. Both options are fine.

> +
> +	va_start(args, namefmt);
> +	vsnprintf(name, sizeof(name), namefmt, args);
> +	set_task_comm(tsk, name);
> +	va_end(args);
> +
> +	wake_up_new_task(tsk);
> +}
> +EXPORT_SYMBOL_GPL(kernel_worker_start);
> +
>  /*
>   *  Ok, this is the main fork-routine.
>   *
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 5/9] fork: add helper to clone a process
  2021-10-05 12:50   ` Christian Brauner
@ 2021-10-05 17:10     ` Mike Christie
  2021-10-06 12:12       ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Christie @ 2021-10-05 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, axboe, linux-kernel

On 10/5/21 7:50 AM, Christian Brauner wrote:
> On Mon, Oct 04, 2021 at 02:21:24PM -0500, Mike Christie wrote:
>> The vhost layer has similar requirements as io_uring where its worker
>> threads need to access the userspace thread's memory, want to inherit the
>> parents's cgroups and namespaces, and be checked against the parent's
>> RLIMITs. Right now, the vhost layer uses the kthread API which has
>> kthread_use_mm for mem access, and those threads can use
>> cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
>> other items.
>>
>> This adds a helper to clone a process so we can inherit everything we
>> want in one call. It's a more generic version of create_io_thread which
>> will be used by the vhost layer and io_uring in later patches in this set.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
>> ---
>>  include/linux/sched/task.h |  6 ++++-
>>  kernel/fork.c              | 48 ++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 53 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index e165cc67fd3c..ba0499b6627c 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -87,7 +87,11 @@ extern void exit_files(struct task_struct *);
>>  extern void exit_itimers(struct signal_struct *);
>>  
>>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
>> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
>> +struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);
>> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
>> +				  unsigned long clone_flags, u32 worker_flags);
>> +__printf(2, 3)
>> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
>>  struct task_struct *fork_idle(int);
>>  struct mm_struct *copy_init_mm(void);
>>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 98264cf1d6a6..3f3fcabffa5f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2540,6 +2540,54 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>>  	return copy_process(NULL, 0, node, &args);
>>  }
>>  
>> +/**
>> + * kernel_worker - create a copy of a process to be used by the kernel
>> + * @fn: thread stack
>> + * @arg: data to be passed to fn
>> + * @node: numa node to allocate task from
>> + * @clone_flags: CLONE flags
>> + * @worker_flags: KERN_WORKER flags
>> + *
>> + * This returns a created task, or an error pointer. The returned task is
>> + * inactive, and the caller must fire it up through kernel_worker_start(). If
>> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
>> + */
>> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
>> +				  unsigned long clone_flags, u32 worker_flags)
>> +{
>> +	struct kernel_clone_args args = {
>> +		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
>> +				   CLONE_UNTRACED) & ~CSIGNAL),
>> +		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
>> +		.stack		= (unsigned long)fn,
>> +		.stack_size	= (unsigned long)arg,
>> +		.worker_flags	= KERN_WORKER_USER | worker_flags,
>> +	};
>> +
>> +	return copy_process(NULL, 0, node, &args);
>> +}
>> +EXPORT_SYMBOL_GPL(kernel_worker);
>> +
>> +/**
>> + * kernel_worker_start - Start a task created with kernel_worker
>> + * @tsk: task to wake up
>> + * @namefmt: printf-style format string for the thread name
>> + * @arg: arguments for @namefmt
>> + */
>> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
>> +{
>> +	char name[TASK_COMM_LEN];
>> +	va_list args;
> 
> You could think about reporting an error from this function if
> KERN_WORK_USER isn't set or only call the below when KERN_WORK_USER is
> set. Both options are fine.
> 

I'm not sure how to handle this comment, because I might have misread
an older comment or made it up in my head.

KERN_WORK_USER is only set on the kernel_clone_args, so at this point we
don't have that struct available anymore.

I didn't add a new PF_KTHREAD_WORK_USER flag to sched.h, because I thought
I had got a review comment to not add another PF flag for this. However, I
can't seem to find that comment now so I'm not sure if maybe I misread a
comment or made it up.

If it's ok I could add a PF_KTHREAD_WORK_USER, then do a:

WARN_ON(!(tsk->flags & PF_KTHREAD_WORK_USER)

so future developers get loud feedback they are doing the
wrong thing right away.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 5/9] fork: add helper to clone a process
  2021-10-05 17:10     ` Mike Christie
@ 2021-10-06 12:12       ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2021-10-06 12:12 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, axboe, linux-kernel

On Tue, Oct 05, 2021 at 12:10:55PM -0500, Mike Christie wrote:
> On 10/5/21 7:50 AM, Christian Brauner wrote:
> > On Mon, Oct 04, 2021 at 02:21:24PM -0500, Mike Christie wrote:
> >> The vhost layer has similar requirements as io_uring where its worker
> >> threads need to access the userspace thread's memory, want to inherit the
> >> parents's cgroups and namespaces, and be checked against the parent's
> >> RLIMITs. Right now, the vhost layer uses the kthread API which has
> >> kthread_use_mm for mem access, and those threads can use
> >> cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
> >> other items.
> >>
> >> This adds a helper to clone a process so we can inherit everything we
> >> want in one call. It's a more generic version of create_io_thread which
> >> will be used by the vhost layer and io_uring in later patches in this set.
> >>
> >> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> ---
> >>  include/linux/sched/task.h |  6 ++++-
> >>  kernel/fork.c              | 48 ++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 53 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> >> index e165cc67fd3c..ba0499b6627c 100644
> >> --- a/include/linux/sched/task.h
> >> +++ b/include/linux/sched/task.h
> >> @@ -87,7 +87,11 @@ extern void exit_files(struct task_struct *);
> >>  extern void exit_itimers(struct signal_struct *);
> >>  
> >>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> >> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
> >> +struct task_struct *create_io_thread(int (*fn)(void *i), void *arg, int node);
> >> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> >> +				  unsigned long clone_flags, u32 worker_flags);
> >> +__printf(2, 3)
> >> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
> >>  struct task_struct *fork_idle(int);
> >>  struct mm_struct *copy_init_mm(void);
> >>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 98264cf1d6a6..3f3fcabffa5f 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -2540,6 +2540,54 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
> >>  	return copy_process(NULL, 0, node, &args);
> >>  }
> >>  
> >> +/**
> >> + * kernel_worker - create a copy of a process to be used by the kernel
> >> + * @fn: thread stack
> >> + * @arg: data to be passed to fn
> >> + * @node: numa node to allocate task from
> >> + * @clone_flags: CLONE flags
> >> + * @worker_flags: KERN_WORKER flags
> >> + *
> >> + * This returns a created task, or an error pointer. The returned task is
> >> + * inactive, and the caller must fire it up through kernel_worker_start(). If
> >> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
> >> + */
> >> +struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
> >> +				  unsigned long clone_flags, u32 worker_flags)
> >> +{
> >> +	struct kernel_clone_args args = {
> >> +		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
> >> +				   CLONE_UNTRACED) & ~CSIGNAL),
> >> +		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
> >> +		.stack		= (unsigned long)fn,
> >> +		.stack_size	= (unsigned long)arg,
> >> +		.worker_flags	= KERN_WORKER_USER | worker_flags,
> >> +	};
> >> +
> >> +	return copy_process(NULL, 0, node, &args);
> >> +}
> >> +EXPORT_SYMBOL_GPL(kernel_worker);
> >> +
> >> +/**
> >> + * kernel_worker_start - Start a task created with kernel_worker
> >> + * @tsk: task to wake up
> >> + * @namefmt: printf-style format string for the thread name
> >> + * @arg: arguments for @namefmt
> >> + */
> >> +void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
> >> +{
> >> +	char name[TASK_COMM_LEN];
> >> +	va_list args;
> > 
> > You could think about reporting an error from this function if
> > KERN_WORK_USER isn't set or only call the below when KERN_WORK_USER is
> > set. Both options are fine.
> > 
> 
> I'm not sure how to handle this comment, because I might have misread
> an older comment or made it up in my head.
> 
> KERN_WORK_USER is only set on the kernel_clone_args, so at this point we
> don't have that struct available anymore.

Ah, right.

> 
> I didn't add a new PF_KTHREAD_WORK_USER flag to sched.h, because I thought
> I had got a review comment to not add another PF flag for this. However, I
> can't seem to find that comment now so I'm not sure if maybe I misread a
> comment or made it up.
> 
> If it's ok I could add a PF_KTHREAD_WORK_USER, then do a:
> 
> WARN_ON(!(tsk->flags & PF_KTHREAD_WORK_USER)
> 
> so future developers get loud feedback they are doing the
> wrong thing right away.

I think a PF_USER_WORKER might just do fine as it fits with the naming
of PF_IO_WORKER.

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-06 12:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04 19:21 [PATCH V3 0/9] Use copy_process/create_io_thread in vhost layer Mike Christie
2021-10-04 19:21 ` [PATCH V3 1/9] fork: Make IO worker options flag based Mike Christie
2021-10-04 19:21 ` [PATCH V3 2/9] fork: pass worker_flags to copy_thread Mike Christie
2021-10-04 19:21 ` [PATCH V3 3/9] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
2021-10-04 19:21 ` [PATCH V3 4/9] fork: add option to not clone or dup files Mike Christie
2021-10-04 19:21 ` [PATCH V3 5/9] fork: add helper to clone a process Mike Christie
2021-10-04 20:29   ` Jens Axboe
2021-10-05 12:50   ` Christian Brauner
2021-10-05 17:10     ` Mike Christie
2021-10-06 12:12       ` Christian Brauner
2021-10-04 19:21 ` [PATCH V3 6/9] io_uring: switch to kernel_worker Mike Christie
2021-10-04 20:30   ` Jens Axboe
2021-10-04 19:21 ` [PATCH V3 7/9] fork: Add worker flag to ignore signals Mike Christie
2021-10-04 20:04   ` Jens Axboe
2021-10-05 12:45     ` Christian Brauner
2021-10-04 19:21 ` [PATCH V3 8/9] vhost: move worker thread fields to new struct Mike Christie
2021-10-04 19:21 ` [PATCH V3 9/9] vhost: use kernel_worker to check RLIMITs and inherit v2 cgroups Mike Christie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).