LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/8] Use copy_process/create_io_thread in vhost layer
@ 2021-09-16 21:20 Mike Christie
  2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel

The following patches were made over linus's tree and also apply over
Jens's 5.15 io_ring branch and Michael's vhost branch.

The patchset allows the vhost layer to do a copy_process on the thread
that does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process
against its userspace app (Jens, the patches make create_io_thread more
generic so that's why you are cc'd). This allows the vhost layer's worker
threads to inherit cgroups, namespaces, address space, etc and this worker
thread will also be accounted for against that owner/parent process's
RLIMIT_NPROC limit.

Here is a more detailed problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times. For example, the current approach does not
support cgroups v2 so commands like virsh emulatorpin do not work. The
qemu process can go over its RLIMIT_NPROC. And for future vhost interfaces
where we export the vhost thread pid we will want the namespace info.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/8] fork: add helper to clone a process
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-17  6:00   ` Christoph Hellwig
  2021-09-17  8:48   ` Christian Brauner
  2021-09-16 21:20 ` [PATCH 2/8] signal: Export ignore_signals Mike Christie
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The vhost layer has similar requirements as io_uring where its worker
threads need to access the userspace thread's memory, want to inherit the
parents's cgroups and namespaces, and be checked against the parent's
RLIMITs. Right now, the vhost layer uses the kthread API which has
kthread_use_mm for mem access, and those threads can use
cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
other items.

This adds a helper to clone a process so we can inherit everything we
want in one call. It's a more generic version of create_io_thread which
will be used by the vhost layer and io_uring in later patches in this set.

This patch also exports __set_task_comm and wake_up_new_task which is
needed by modules to use the new helper. io_uring calls these functions
already but its always built into the kernel so was not needed before.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 fs/exec.c                  |  7 +++++++
 include/linux/sched/task.h |  3 +++
 kernel/fork.c              | 29 +++++++++++++++++++++++++++++
 kernel/sched/core.c        |  4 +++-
 4 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..9fc4bb0c5c7e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1220,6 +1220,12 @@ EXPORT_SYMBOL_GPL(__get_task_comm);
  * so that a new one can be started
  */
 
+/**
+ * __set_task_comm - set the task's executable name
+ * @tsk: task_struct to modify
+ * @buf: executable name
+ * @exec: true if called during a process exec. false for name changes.
+ */
 void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 {
 	task_lock(tsk);
@@ -1228,6 +1234,7 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 	task_unlock(tsk);
 	perf_event_comm(tsk, exec);
 }
+EXPORT_SYMBOL_GPL(__set_task_comm);
 
 /*
  * Calling this is the point of no return. None of the failures will be
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..c55f1eb69d41 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -84,6 +84,9 @@ extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
+struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
+					unsigned long clone_flags,
+					int io_thread);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..cec7b6011beb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2532,6 +2532,35 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+/**
+ * kernel_copy_process - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @io_thread: 1 if this will be a PF_IO_WORKER else 0.
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through wake_up_new_task(p). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
+					unsigned long clone_flags,
+					int io_thread)
+{
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.io_thread	= io_thread,
+	};
+
+	return copy_process(NULL, 0, node, &args);
+}
+EXPORT_SYMBOL_GPL(kernel_copy_process);
+
 /*
  *  Ok, this is the main fork-routine.
  *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..a0b9508ea202 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4429,8 +4429,9 @@ unsigned long to_ratio(u64 period, u64 runtime)
 	return div64_u64(runtime << BW_SHIFT, period);
 }
 
-/*
+/**
  * wake_up_new_task - wake up a newly created task for the first time.
+ * @p: task to wake up
  *
  * This function will do some initial scheduler statistics housekeeping
  * that must be done for every newly created context, then puts the task
@@ -4476,6 +4477,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 	task_rq_unlock(rq, p, &rf);
 }
+EXPORT_SYMBOL_GPL(wake_up_new_task);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/8] signal: Export ignore_signals
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
  2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-16 21:20 ` [PATCH 3/8] fork: add option to not clone or dup files Mike Christie
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The kthread API creates threads that ignore all signals by default so
modules like vhost that will move from that API to kernel_copy_process
will not be expecting them. This patch exports ignore_signals so those
modules can keep their existing behavior.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 kernel/signal.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 952741f6d0f9..8fb79200c18b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -534,6 +534,10 @@ void flush_itimer_signals(void)
 }
 #endif
 
+/**
+ * ignore_signals - setup task to ignore all signals
+ * @t: task to setup
+ */
 void ignore_signals(struct task_struct *t)
 {
 	int i;
@@ -543,6 +547,7 @@ void ignore_signals(struct task_struct *t)
 
 	flush_signals(t);
 }
+EXPORT_SYMBOL_GPL(ignore_signals);
 
 /*
  * Flush all handlers for a task.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/8] fork: add option to not clone or dup files
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
  2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
  2021-09-16 21:20 ` [PATCH 2/8] signal: Export ignore_signals Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-17  8:54   ` Christian Brauner
  2021-09-16 21:20 ` [PATCH 4/8] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls kernel_copy_process we can't dup or
clone the parent's (Qemu thread that does the VHOST_SET_OWNER ioctl)
files/FDS because it would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
	do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 include/linux/sched/task.h |  3 ++-
 kernel/fork.c              | 14 +++++++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c55f1eb69d41..d0b0872f56cc 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -32,6 +32,7 @@ struct kernel_clone_args {
 	size_t set_tid_size;
 	int cgroup;
 	int io_thread;
+	int no_files;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
@@ -86,7 +87,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
 					unsigned long clone_flags,
-					int io_thread);
+					int io_thread, int no_files);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index cec7b6011beb..a0468e30b27e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1532,7 +1532,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+		      int no_files)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1544,6 +1545,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldf)
 		goto out;
 
+	if (no_files) {
+		tsk->files = NULL;
+		goto out;
+	}
+
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2179,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p);
+	retval = copy_files(clone_flags, p, args->no_files);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
@@ -2539,6 +2545,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
  * @node: numa node to allocate task from
  * @clone_flags: CLONE flags
  * @io_thread: 1 if this will be a PF_IO_WORKER else 0.
+ * @no_files: Do not duplicate or copy the parent's open files.
  *
  * This returns a created task, or an error pointer. The returned task is
  * inactive, and the caller must fire it up through wake_up_new_task(p). If
@@ -2546,7 +2553,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
  */
 struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
 					unsigned long clone_flags,
-					int io_thread)
+					int io_thread, int no_files)
 {
 	struct kernel_clone_args args = {
 		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
@@ -2555,6 +2562,7 @@ struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
 		.io_thread	= io_thread,
+		.no_files	= no_files,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/8] fork: move PF_IO_WORKER's kernel frame setup to new flag
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (2 preceding siblings ...)
  2021-09-16 21:20 ` [PATCH 3/8] fork: add option to not clone or dup files Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-16 21:20 ` [PATCH 5/8] io_uring: switch to kernel_copy_process Mike Christie
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The vhost worker threads need the same frame setup as io_uring's worker
threads, but do not handle signals and may not even be executing IO. This
moves the setup part to a new flag.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 arch/x86/kernel/process.c  | 4 ++--
 include/linux/sched.h      | 1 +
 include/linux/sched/task.h | 1 +
 kernel/fork.c              | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..ccdec11cd120 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -178,9 +178,9 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif
 
-	if (unlikely(p->flags & PF_IO_WORKER)) {
+	if (unlikely(p->flags & PF_USER_WORKER)) {
 		/*
-		 * An IO thread is a user space thread, but it doesn't
+		 * An user worker thread is a user space thread, but it doesn't
 		 * return to ret_after_fork().
 		 *
 		 * In order to indicate that to tools like gdb,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e12b524426b0..817d3a7bec77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1664,6 +1664,7 @@ extern struct pid *cad_pid;
 #define PF_VCPU			0x00000001	/* I'm a virtual CPU */
 #define PF_IDLE			0x00000002	/* I am an IDLE thread */
 #define PF_EXITING		0x00000004	/* Getting shut down */
+#define PF_USER_WORKER		0x00000008	/* Userspace kernel worker */
 #define PF_IO_WORKER		0x00000010	/* Task is an IO worker */
 #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d0b0872f56cc..4a6100a24894 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -33,6 +33,7 @@ struct kernel_clone_args {
 	int cgroup;
 	int io_thread;
 	int no_files;
+	int user_worker;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index a0468e30b27e..1dda1d4ea77b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2041,6 +2041,8 @@ static __latent_entropy struct task_struct *copy_process(
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
 	}
 
+	if (args->user_worker)
+		p->flags |= PF_USER_WORKER;
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
 	 * to any of the bad_fork_* labels. This is to avoid freeing
@@ -2563,6 +2565,7 @@ struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
 		.stack_size	= (unsigned long)arg,
 		.io_thread	= io_thread,
 		.no_files	= no_files,
+		.user_worker	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 5/8] io_uring: switch to kernel_copy_process
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (3 preceding siblings ...)
  2021-09-16 21:20 ` [PATCH 4/8] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-16 21:20 ` [PATCH 6/8] vhost: move worker thread fields to new struct Mike Christie
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

Convert io_uring and io-wq to use kernel_copy_process.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 fs/io-wq.c                 |  9 +++++++--
 fs/io_uring.c              |  5 ++++-
 include/linux/sched/task.h |  1 -
 kernel/fork.c              | 22 ----------------------
 4 files changed, 11 insertions(+), 26 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 6c55362c1f99..6fccba5bdc65 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -68,6 +68,9 @@ struct io_worker {
 
 #define IO_WQ_NR_HASH_BUCKETS	(1u << IO_WQ_HASH_ORDER)
 
+#define IO_WQ_CLONE_FLAGS	CLONE_FS|CLONE_FILES|CLONE_SIGHAND| \
+				CLONE_THREAD|CLONE_IO
+
 struct io_wqe_acct {
 	unsigned nr_workers;
 	unsigned max_workers;
@@ -687,7 +690,8 @@ static void create_worker_cont(struct callback_head *cb)
 	worker = container_of(cb, struct io_worker, create_work);
 	clear_bit_unlock(0, &worker->create_state);
 	wqe = worker->wqe;
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = kernel_copy_process(io_wqe_worker, worker, wqe->node,
+				  IO_WQ_CLONE_FLAGS, 1, 0);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 		io_worker_release(worker);
@@ -757,7 +761,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (index == IO_WQ_ACCT_BOUND)
 		worker->flags |= IO_WORKER_F_BOUND;
 
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = kernel_copy_process(io_wqe_worker, worker, wqe->node,
+				  IO_WQ_CLONE_FLAGS, 1, 0);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 16fb7436043c..2493a78ddd7d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8519,6 +8519,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|
+					CLONE_THREAD|CLONE_IO;
 		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 		bool attached;
@@ -8560,7 +8562,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->task_pid = current->pid;
 		sqd->task_tgid = current->tgid;
-		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		tsk = kernel_copy_process(io_sq_thread, sqd, NUMA_NO_NODE,
+					  flags, 1, 0);
 		if (IS_ERR(tsk)) {
 			ret = PTR_ERR(tsk);
 			goto err_sqpoll;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4a6100a24894..f43e81c907e1 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -85,7 +85,6 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
 					unsigned long clone_flags,
 					int io_thread, int no_files);
diff --git a/kernel/fork.c b/kernel/fork.c
index 1dda1d4ea77b..9011cbe83fe8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2518,28 +2518,6 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
-/*
- * This is like kernel_clone(), but shaved down and tailored to just
- * creating io_uring workers. It returns a created task, or an error pointer.
- * The returned task is inactive, and the caller must fire it up through
- * wake_up_new_task(p). All signals are blocked in the created task.
- */
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-		.io_thread	= 1,
-	};
-
-	return copy_process(NULL, 0, node, &args);
-}
-
 /**
  * kernel_copy_process - create a copy of a process to be used by the kernel
  * @fn: thread stack
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 6/8] vhost: move worker thread fields to new struct
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (4 preceding siblings ...)
  2021-09-16 21:20 ` [PATCH 5/8] io_uring: switch to kernel_copy_process Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-16 21:20 ` [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups Mike Christie
  2021-09-16 21:20 ` [PATCH 8/8] vhost: remove cgroup code Mike Christie
  7 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patches.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
+			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	worker->kcov_handle = kcov_common_handle();
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	vhost_attach_mm(dev);
 
-	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
 err_mm:
 	return err;
 }
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+	u64			kcov_handle;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
-	u64 kcov_handle;
 	bool use_worker;
 	int (*msg_handler)(struct vhost_dev *dev,
 			   struct vhost_iotlb_msg *msg);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (5 preceding siblings ...)
  2021-09-16 21:20 ` [PATCH 6/8] vhost: move worker thread fields to new struct Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  2021-09-20 20:47   ` kernel test robot
  2021-09-16 21:20 ` [PATCH 8/8] vhost: remove cgroup code Mike Christie
  7 siblings, 1 reply; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in cgroups v2 not
working and the wrong RLIMITs being checked. This patch has us use the
kernel_copy_process function which will inherit its values/checks from the
thread that owns the device.

Note this patch converts us. The next patch will remove the code that is
no longer needed.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 49 +++++++++++++++++++++++++++----------------
 drivers/vhost/vhost.h |  7 ++++++-
 2 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..6e58417b13fc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -344,17 +344,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		if (kthread_should_stop()) {
+		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
@@ -376,8 +373,9 @@ static int vhost_worker(void *data)
 				schedule();
 		}
 	}
-	kthread_unuse_mm(dev->mm);
-	return 0;
+
+	complete(worker->exit_done);
+	do_exit(0);
 }
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -579,6 +577,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+	DECLARE_COMPLETION_ONSTACK(exit_done);
+
+	worker->exit_done = &exit_done;
+	set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags);
+	wake_up_process(worker->task);
+	wait_for_completion(worker->exit_done);
+}
+
 static void vhost_worker_free(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker = dev->worker;
@@ -588,14 +596,16 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
 	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
-	kthread_stop(worker->task);
+	vhost_worker_stop(worker);
 	kfree(worker);
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
+	DECLARE_COMPLETION_ONSTACK(start_done);
 	struct vhost_worker *worker;
 	struct task_struct *task;
+	char buf[TASK_COMM_LEN];
 	int ret;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
@@ -603,27 +613,30 @@ static int vhost_worker_create(struct vhost_dev *dev)
 		return -ENOMEM;
 
 	dev->worker = worker;
-	worker->dev = dev;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
 
-	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
-	if (IS_ERR(task)) {
-		ret = PTR_ERR(task);
+	/*
+	 * vhost used to use the kthread API which ignores all signals by
+	 * default and the drivers expect this behavior. So we do not want to
+	 * ineherit the parent's signal handlers and set our worker to ignore
+	 * everything below.
+	 */
+	task = kernel_copy_process(vhost_worker, worker, NUMA_NO_NODE,
+				   CLONE_FS|CLONE_CLEAR_SIGHAND, 0, 1);
+	if (IS_ERR(task))
 		goto free_worker;
-	}
 
 	worker->task = task;
-	wake_up_process(task); /* avoid contributing to loadavg */
 
-	ret = vhost_attach_cgroups(dev);
-	if (ret)
-		goto stop_worker;
+	snprintf(buf, sizeof(buf), "vhost-%d", current->pid);
+	set_task_comm(task, buf);
+
+	ignore_signals(task);
 
+	wake_up_new_task(task);
 	return 0;
 
-stop_worker:
-	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
 	dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+enum {
+	VHOST_WORKER_FLAG_STOP,
+};
+
 struct vhost_worker {
 	struct task_struct	*task;
+	struct completion	*exit_done;
 	struct llist_head	work_list;
-	struct vhost_dev	*dev;
 	u64			kcov_handle;
+	unsigned long		flags;
 };
 
 /* Poll a file (eventfd or socket) */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 8/8] vhost: remove cgroup code
  2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
                   ` (6 preceding siblings ...)
  2021-09-16 21:20 ` [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups Mike Christie
@ 2021-09-16 21:20 ` Mike Christie
  7 siblings, 0 replies; 16+ messages in thread
From: Mike Christie @ 2021-09-16 21:20 UTC (permalink / raw)
  To: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

We inherit v1 and v2 cgroups from copy_process now, so we can drop the v1
only code.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 26 --------------------------
 1 file changed, 26 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6e58417b13fc..b561c5ea00fc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/kthread.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
 #include <linux/sched/mm.h>
@@ -515,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-	struct vhost_work work;
-	struct task_struct *owner;
-	int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-	struct vhost_attach_cgroups_struct *s;
-
-	s = container_of(work, struct vhost_attach_cgroups_struct, work);
-	s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-	struct vhost_attach_cgroups_struct attach;
-
-	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
-	return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/8] fork: add helper to clone a process
  2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
@ 2021-09-17  6:00   ` Christoph Hellwig
  2021-09-17  7:44     ` Christian Brauner
  2021-09-17  8:48   ` Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2021-09-17  6:00 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel

On Thu, Sep 16, 2021 at 04:20:44PM -0500, Mike Christie wrote:
> The vhost layer has similar requirements as io_uring where its worker
> threads need to access the userspace thread's memory, want to inherit the
> parents's cgroups and namespaces, and be checked against the parent's
> RLIMITs. Right now, the vhost layer uses the kthread API which has
> kthread_use_mm for mem access, and those threads can use
> cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
> other items.
> 
> This adds a helper to clone a process so we can inherit everything we
> want in one call. It's a more generic version of create_io_thread which
> will be used by the vhost layer and io_uring in later patches in this set.
> 
> This patch also exports __set_task_comm and wake_up_new_task which is
> needed by modules to use the new helper. io_uring calls these functions
> already but its always built into the kernel so was not needed before.

Can you build proper APIs please?  e.g. the set_task_comm users
generally want a printf-like varargs caling conventions.  I'd also
much prefer to hide as much as possible in the actual helper.  That is
build a helper that gets the name, a flag to ignore the singals etc
instead of exporting all these random low-level helpers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/8] fork: add helper to clone a process
  2021-09-17  6:00   ` Christoph Hellwig
@ 2021-09-17  7:44     ` Christian Brauner
  2021-09-17  8:01       ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2021-09-17  7:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Christie, stefanha, jasowang, mst, sgarzare, virtualization,
	axboe, linux-kernel

On Fri, Sep 17, 2021 at 07:00:35AM +0100, Christoph Hellwig wrote:
> On Thu, Sep 16, 2021 at 04:20:44PM -0500, Mike Christie wrote:
> > The vhost layer has similar requirements as io_uring where its worker
> > threads need to access the userspace thread's memory, want to inherit the
> > parents's cgroups and namespaces, and be checked against the parent's
> > RLIMITs. Right now, the vhost layer uses the kthread API which has
> > kthread_use_mm for mem access, and those threads can use
> > cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
> > other items.
> > 
> > This adds a helper to clone a process so we can inherit everything we
> > want in one call. It's a more generic version of create_io_thread which
> > will be used by the vhost layer and io_uring in later patches in this set.
> > 
> > This patch also exports __set_task_comm and wake_up_new_task which is
> > needed by modules to use the new helper. io_uring calls these functions
> > already but its always built into the kernel so was not needed before.
> 
> Can you build proper APIs please?  e.g. the set_task_comm users
> generally want a printf-like varargs caling conventions.  I'd also
> much prefer to hide as much as possible in the actual helper.  That is
> build a helper that gets the name, a flag to ignore the singals etc
> instead of exporting all these random low-level helpers.

Yes, I think that's really what we want here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/8] fork: add helper to clone a process
  2021-09-17  7:44     ` Christian Brauner
@ 2021-09-17  8:01       ` Christoph Hellwig
  2021-09-17  8:43         ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2021-09-17  8:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Mike Christie, stefanha, jasowang, mst,
	sgarzare, virtualization, axboe, linux-kernel

On Fri, Sep 17, 2021 at 09:44:40AM +0200, Christian Brauner wrote:
> > generally want a printf-like varargs caling conventions.  I'd also
> > much prefer to hide as much as possible in the actual helper.  That is
> > build a helper that gets the name, a flag to ignore the singals etc
> > instead of exporting all these random low-level helpers.
> 
> Yes, I think that's really what we want here.

In a way this would mean enhancing the kthread API to also support I/O
threads.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/8] fork: add helper to clone a process
  2021-09-17  8:01       ` Christoph Hellwig
@ 2021-09-17  8:43         ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2021-09-17  8:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Christie, stefanha, jasowang, mst, sgarzare, virtualization,
	axboe, linux-kernel

On Fri, Sep 17, 2021 at 09:01:06AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 09:44:40AM +0200, Christian Brauner wrote:
> > > generally want a printf-like varargs caling conventions.  I'd also
> > > much prefer to hide as much as possible in the actual helper.  That is
> > > build a helper that gets the name, a flag to ignore the singals etc
> > > instead of exporting all these random low-level helpers.
> > 
> > Yes, I think that's really what we want here.
> 
> In a way this would mean enhancing the kthread API to also support I/O
> threads.

Kinda, it's a kthread in so far as it has been created by the kernel and
not through a regular syscall path. But it's not really using any of the
kthread.c stuff.

I think the general push of the series is right. We should aim for a
helper that allows a driver - similar to what uring already does - to
create a task with the context of the caller.

After the fork/clone rework that I did it shouldn't be that difficult to
do. The uring patch did show that it's fairly straightforward.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/8] fork: add helper to clone a process
  2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
  2021-09-17  6:00   ` Christoph Hellwig
@ 2021-09-17  8:48   ` Christian Brauner
  1 sibling, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2021-09-17  8:48 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, jasowang, mst, sgarzare, virtualization, axboe, linux-kernel

On Thu, Sep 16, 2021 at 04:20:44PM -0500, Mike Christie wrote:
> The vhost layer has similar requirements as io_uring where its worker
> threads need to access the userspace thread's memory, want to inherit the
> parents's cgroups and namespaces, and be checked against the parent's
> RLIMITs. Right now, the vhost layer uses the kthread API which has
> kthread_use_mm for mem access, and those threads can use
> cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
> other items.
> 
> This adds a helper to clone a process so we can inherit everything we
> want in one call. It's a more generic version of create_io_thread which
> will be used by the vhost layer and io_uring in later patches in this set.
> 
> This patch also exports __set_task_comm and wake_up_new_task which is
> needed by modules to use the new helper. io_uring calls these functions
> already but its always built into the kernel so was not needed before.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  fs/exec.c                  |  7 +++++++
>  include/linux/sched/task.h |  3 +++
>  kernel/fork.c              | 29 +++++++++++++++++++++++++++++
>  kernel/sched/core.c        |  4 +++-
>  4 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..9fc4bb0c5c7e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1220,6 +1220,12 @@ EXPORT_SYMBOL_GPL(__get_task_comm);
>   * so that a new one can be started
>   */
>  
> +/**
> + * __set_task_comm - set the task's executable name
> + * @tsk: task_struct to modify
> + * @buf: executable name
> + * @exec: true if called during a process exec. false for name changes.
> + */
>  void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>  {
>  	task_lock(tsk);
> @@ -1228,6 +1234,7 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>  	task_unlock(tsk);
>  	perf_event_comm(tsk, exec);
>  }
> +EXPORT_SYMBOL_GPL(__set_task_comm);
>  
>  /*
>   * Calling this is the point of no return. None of the failures will be
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..c55f1eb69d41 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -84,6 +84,9 @@ extern void exit_itimers(struct signal_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
>  struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
> +struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
> +					unsigned long clone_flags,
> +					int io_thread);
>  struct task_struct *fork_idle(int);
>  struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 38681ad44c76..cec7b6011beb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2532,6 +2532,35 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  	return copy_process(NULL, 0, node, &args);
>  }
>  
> +/**
> + * kernel_copy_process - create a copy of a process to be used by the kernel
> + * @fn: thread stack
> + * @arg: data to be passed to fn
> + * @node: numa node to allocate task from
> + * @clone_flags: CLONE flags
> + * @io_thread: 1 if this will be a PF_IO_WORKER else 0.
> + *
> + * This returns a created task, or an error pointer. The returned task is
> + * inactive, and the caller must fire it up through wake_up_new_task(p). If
> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
> + */
> +struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
> +					unsigned long clone_flags,
> +					int io_thread)

Hm, not excited about the name. I think

kernel_worker_create()

or simply

kernel_worker()

is better.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/8] fork: add option to not clone or dup files
  2021-09-16 21:20 ` [PATCH 3/8] fork: add option to not clone or dup files Mike Christie
@ 2021-09-17  8:54   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2021-09-17  8:54 UTC (permalink / raw)
  To: Mike Christie
  Cc: stefanha, jasowang, mst, sgarzare, virtualization, axboe, linux-kernel

On Thu, Sep 16, 2021 at 04:20:46PM -0500, Mike Christie wrote:
> Each vhost device gets a thread that is used to perform IO and management
> operations. Instead of a thread that is accessing a device, the thread is
> part of the device, so when it calls kernel_copy_process we can't dup or
> clone the parent's (Qemu thread that does the VHOST_SET_OWNER ioctl)
> files/FDS because it would do an extra increment on ourself.
> 
> Later, when we do:
> 
> Qemu process exits:
> 	do_exit -> exit_files -> put_files_struct -> close_files
> 
> we would leak the device's resources because of that extra refcount
> on the fd or file_struct.
> 
> This patch adds a no_files option so these worker threads can prevent
> taking an extra refcount on themselves.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  include/linux/sched/task.h |  3 ++-
>  kernel/fork.c              | 14 +++++++++++---
>  2 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index c55f1eb69d41..d0b0872f56cc 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -32,6 +32,7 @@ struct kernel_clone_args {
>  	size_t set_tid_size;
>  	int cgroup;
>  	int io_thread;
> +	int no_files;
>  	struct cgroup *cgrp;
>  	struct css_set *cset;
>  };
> @@ -86,7 +87,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs);
>  struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
>  struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
>  					unsigned long clone_flags,
> -					int io_thread);
> +					int io_thread, int no_files);
>  struct task_struct *fork_idle(int);
>  struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cec7b6011beb..a0468e30b27e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1532,7 +1532,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
>  	return 0;
>  }
>  
> -static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
> +static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
> +		      int no_files)
>  {
>  	struct files_struct *oldf, *newf;
>  	int error = 0;
> @@ -1544,6 +1545,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
>  	if (!oldf)
>  		goto out;
>  
> +	if (no_files) {
> +		tsk->files = NULL;
> +		goto out;
> +	}
> +
>  	if (clone_flags & CLONE_FILES) {
>  		atomic_inc(&oldf->count);
>  		goto out;
> @@ -2179,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
>  	retval = copy_semundo(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_security;
> -	retval = copy_files(clone_flags, p);
> +	retval = copy_files(clone_flags, p, args->no_files);
>  	if (retval)
>  		goto bad_fork_cleanup_semundo;
>  	retval = copy_fs(clone_flags, p);
> @@ -2539,6 +2545,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>   * @node: numa node to allocate task from
>   * @clone_flags: CLONE flags
>   * @io_thread: 1 if this will be a PF_IO_WORKER else 0.
> + * @no_files: Do not duplicate or copy the parent's open files.
>   *
>   * This returns a created task, or an error pointer. The returned task is
>   * inactive, and the caller must fire it up through wake_up_new_task(p). If
> @@ -2546,7 +2553,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>   */
>  struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node,
>  					unsigned long clone_flags,
> -					int io_thread)
> +					int io_thread, int no_files)

I think that the addition of no_files together with io_thread might be a
sign to introduce a set of kernel internal clone flags in
kernel_clone_args.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups
  2021-09-16 21:20 ` [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups Mike Christie
@ 2021-09-20 20:47   ` kernel test robot
  0 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2021-09-20 20:47 UTC (permalink / raw)
  To: Mike Christie, stefanha, jasowang, mst, sgarzare, virtualization,
	christian.brauner, axboe, linux-kernel
  Cc: kbuild-all, Mike Christie

[-- Attachment #1: Type: text/plain, Size: 5684 bytes --]

Hi Mike,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mst-vhost/linux-next]
[also build test ERROR on linus/master v5.15-rc2 next-20210920]
[cannot apply to tip/sched/core tip/x86/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mike-Christie/Use-copy_process-create_io_thread-in-vhost-layer/20210917-052427
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/57f4cf5028dd552f4895a6a83a8607412ceb8b6c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Mike-Christie/Use-copy_process-create_io_thread-in-vhost-layer/20210917-052427
        git checkout 57f4cf5028dd552f4895a6a83a8607412ceb8b6c
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/sched.h:10,
                    from include/linux/eventfd.h:17,
                    from drivers/vhost/vhost.c:13:
   drivers/vhost/vhost.c: In function 'vhost_worker_create':
>> include/uapi/linux/sched.h:12:18: error: conversion from 'long long unsigned int' to 'long unsigned int' changes value from '4294967808' to '512' [-Werror=overflow]
      12 | #define CLONE_FS 0x00000200 /* set if fs info shared between processes */
         |                  ^
   drivers/vhost/vhost.c:626:8: note: in expansion of macro 'CLONE_FS'
     626 |        CLONE_FS|CLONE_CLEAR_SIGHAND, 0, 1);
         |        ^~~~~~~~
   At top level:
   drivers/vhost/vhost.c:532:12: error: 'vhost_attach_cgroups' defined but not used [-Werror=unused-function]
     532 | static int vhost_attach_cgroups(struct vhost_dev *dev)
         |            ^~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +12 include/uapi/linux/sched.h

7f192e3cd316ba Christian Brauner 2019-05-25   6  
607ca46e97a1b6 David Howells     2012-10-13   7  /*
607ca46e97a1b6 David Howells     2012-10-13   8   * cloning flags:
607ca46e97a1b6 David Howells     2012-10-13   9   */
607ca46e97a1b6 David Howells     2012-10-13  10  #define CSIGNAL		0x000000ff	/* signal mask to be sent at exit */
607ca46e97a1b6 David Howells     2012-10-13  11  #define CLONE_VM	0x00000100	/* set if VM shared between processes */
607ca46e97a1b6 David Howells     2012-10-13 @12  #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
607ca46e97a1b6 David Howells     2012-10-13  13  #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
607ca46e97a1b6 David Howells     2012-10-13  14  #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
b3e5838252665e Christian Brauner 2019-03-27  15  #define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
607ca46e97a1b6 David Howells     2012-10-13  16  #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
607ca46e97a1b6 David Howells     2012-10-13  17  #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
607ca46e97a1b6 David Howells     2012-10-13  18  #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
607ca46e97a1b6 David Howells     2012-10-13  19  #define CLONE_THREAD	0x00010000	/* Same thread group? */
fcd964dda5ece2 Chen Hanxiao      2014-10-07  20  #define CLONE_NEWNS	0x00020000	/* New mount namespace group */
607ca46e97a1b6 David Howells     2012-10-13  21  #define CLONE_SYSVSEM	0x00040000	/* share system V SEM_UNDO semantics */
607ca46e97a1b6 David Howells     2012-10-13  22  #define CLONE_SETTLS	0x00080000	/* create a new TLS for the child */
607ca46e97a1b6 David Howells     2012-10-13  23  #define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
607ca46e97a1b6 David Howells     2012-10-13  24  #define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
607ca46e97a1b6 David Howells     2012-10-13  25  #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
607ca46e97a1b6 David Howells     2012-10-13  26  #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
607ca46e97a1b6 David Howells     2012-10-13  27  #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
5e2bec7c2248ae Aditya Kali       2016-01-29  28  #define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
f622b429dadf83 Chen Hanxiao      2014-11-04  29  #define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
f622b429dadf83 Chen Hanxiao      2014-11-04  30  #define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
607ca46e97a1b6 David Howells     2012-10-13  31  #define CLONE_NEWUSER		0x10000000	/* New user namespace */
607ca46e97a1b6 David Howells     2012-10-13  32  #define CLONE_NEWPID		0x20000000	/* New pid namespace */
607ca46e97a1b6 David Howells     2012-10-13  33  #define CLONE_NEWNET		0x40000000	/* New network namespace */
607ca46e97a1b6 David Howells     2012-10-13  34  #define CLONE_IO		0x80000000	/* Clone io context */
607ca46e97a1b6 David Howells     2012-10-13  35  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 66061 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-20 20:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-16 21:20 [PATCH 0/8] Use copy_process/create_io_thread in vhost layer Mike Christie
2021-09-16 21:20 ` [PATCH 1/8] fork: add helper to clone a process Mike Christie
2021-09-17  6:00   ` Christoph Hellwig
2021-09-17  7:44     ` Christian Brauner
2021-09-17  8:01       ` Christoph Hellwig
2021-09-17  8:43         ` Christian Brauner
2021-09-17  8:48   ` Christian Brauner
2021-09-16 21:20 ` [PATCH 2/8] signal: Export ignore_signals Mike Christie
2021-09-16 21:20 ` [PATCH 3/8] fork: add option to not clone or dup files Mike Christie
2021-09-17  8:54   ` Christian Brauner
2021-09-16 21:20 ` [PATCH 4/8] fork: move PF_IO_WORKER's kernel frame setup to new flag Mike Christie
2021-09-16 21:20 ` [PATCH 5/8] io_uring: switch to kernel_copy_process Mike Christie
2021-09-16 21:20 ` [PATCH 6/8] vhost: move worker thread fields to new struct Mike Christie
2021-09-16 21:20 ` [PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups Mike Christie
2021-09-20 20:47   ` kernel test robot
2021-09-16 21:20 ` [PATCH 8/8] vhost: remove cgroup code Mike Christie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).