LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/4] prctl task isolation interface and vmstat sync
@ 2021-07-27 10:38 Marcelo Tosatti
2021-07-27 10:38 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 10:38 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu
The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:
1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop
Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.
To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.
=============================
Task isolation prctl interface
=============================
Set thread isolation mode and parameters, which allows
informing the kernel that application is
executing latency sensitive code (where interruptions
are undesired).
Its composed of 4 prctl commands (passed as arg1 to
prctl):
PR_ISOL_SET: set isolation parameters for the task
PR_ISOL_GET: get isolation parameters for the task
PR_ISOL_ENTER: indicate that task should be considered
isolated from this point on
PR_ISOL_EXIT: indicate that task should not be considered
isolated from this point on
The isolation parameters and mode are not inherited by
children created by fork(2) and clone(2). The setting is
preserved across execve(2).
The meaning of isolated is specified as follows, when setting arg2 to
PR_ISOL_SET or PR_ISOL_GET, with the following arguments passed as arg3.
Isolation mode (PR_ISOL_MODE):
------------------------------
- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
and in case of interruption events, the notifications can be collected
by BPF programs.
In this mode, if system calls are performed, deferred actions initiated
by the system call will be executed before return to userspace.
Other modes, which for example send signals upon interruptions events,
can be implemented.
Example
=======
The ``samples/task_isolation/`` directory contains a sample
application.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [patch 1/4] add basic task isolation prctl interface
2021-07-27 10:38 [patch 0/4] prctl task isolation interface and vmstat sync Marcelo Tosatti
@ 2021-07-27 10:38 ` Marcelo Tosatti
2021-07-27 10:48 ` nsaenzju
2021-07-27 10:38 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
` (2 subsequent siblings)
3 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 10:38 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Marcelo Tosatti
Add basic prctl task isolation interface, which allows
informing the kernel that application is executing
latency sensitive code (where interruptions are undesired).
Interface is described by task_isolation.rst (added by this patch).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-2.6-vmstat-update/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6-vmstat-update/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Task isolation prctl interface
+=============================
+
+Set thread isolation mode and parameters, which allows
+informing the kernel that application is
+executing latency sensitive code (where interruptions
+are undesired).
+
+Its composed of 4 prctl commands (passed as arg1 to
+prctl):
+
+PR_ISOL_SET: set isolation parameters for the task
+
+PR_ISOL_GET: get isolation parameters for the task
+
+PR_ISOL_ENTER: indicate that task should be considered
+ isolated from this point on
+
+PR_ISOL_EXIT: indicate that task should not be considered
+ isolated from this point on
+
+The isolation parameters and mode are not inherited by
+children created by fork(2) and clone(2). The setting is
+preserved across execve(2).
+
+The meaning of isolated is specified as follows, when setting arg2 to
+PR_ISOL_SET or PR_ISOL_GET, with the following arguments passed as arg3.
+
+Isolation mode (PR_ISOL_MODE):
+------------------------------
+
+- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
+ PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
+
+- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
+ and in case of interruption events, the notifications can be collected
+ by BPF programs.
+ In this mode, if system calls are performed, deferred actions initiated
+ by the system call will be executed before return to userspace.
+
+Other modes, which for example send signals upon interruptions events,
+can be implemented.
+
+Example
+=======
+
+The ``samples/task_isolation/`` directory contains a sample
+application.
+
Index: linux-2.6-vmstat-update/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/uapi/linux/prctl.h
+++ linux-2.6-vmstat-update/include/uapi/linux/prctl.h
@@ -267,4 +267,13 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
# define PR_SCHED_CORE_MAX 4
+/* Task isolation control */
+#define PR_ISOL_SET 62
+#define PR_ISOL_GET 63
+#define PR_ISOL_ENTER 64
+#define PR_ISOL_EXIT 65
+# define PR_ISOL_MODE 1
+
+# define PR_ISOL_MODE_NONE 0
+# define PR_ISOL_MODE_NORMAL 1
#endif /* _LINUX_PRCTL_H */
Index: linux-2.6-vmstat-update/kernel/Makefile
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/Makefile
+++ linux-2.6-vmstat-update/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6-vmstat-update/kernel/sys.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/sys.c
+++ linux-2.6-vmstat-update/kernel/sys.c
@@ -58,6 +58,7 @@
#include <linux/sched/coredump.h>
#include <linux/sched/task.h>
#include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
#include <linux/rcupdate.h>
#include <linux/uidgid.h>
#include <linux/cred.h>
@@ -2567,6 +2568,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_ISOL_SET:
+ error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_GET:
+ error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_ENTER:
+ error = prctl_task_isolation_enter(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_EXIT:
+ error = prctl_task_isolation_exit(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
Index: linux-2.6-vmstat-update/samples/task_isolation/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6-vmstat-update/samples/task_isolation/task_isolation.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+
+int main(void)
+{
+ int ret;
+ void *buf = malloc(4096);
+
+ memset(buf, 1, 4096);
+ ret = mlock(buf, 4096);
+ if (ret) {
+ perror("mlock");
+ return EXIT_FAILURE;
+ }
+
+ ret = prctl(PR_ISOL_SET, PR_ISOL_MODE, PR_ISOL_MODE_NORMAL, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_SET");
+ return EXIT_FAILURE;
+ }
+
+ ret = prctl(PR_ISOL_ENTER, 0, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_ENTER");
+ exit(0);
+ }
+
+ /* busy loop */
+ while (ret < 99999999) {
+ memset(buf, 0, 10);
+ ret = ret+1;
+ }
+
+ ret = prctl(PR_ISOL_EXIT, 0, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_EXIT");
+ return EXIT_FAILURE;
+ }
+
+ return EXIT_SUCCESS;
+}
+
Index: linux-2.6-vmstat-update/include/linux/sched.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/sched.h
+++ linux-2.6-vmstat-update/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct isol_info;
/*
* Task state bitmask. NOTE! These bits are also
@@ -1400,6 +1401,10 @@ struct task_struct {
struct llist_head kretprobe_instances;
#endif
+#ifdef CONFIG_CPU_ISOLATION
+ struct isol_info *isol_info;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
Index: linux-2.6-vmstat-update/init/init_task.c
===================================================================
--- linux-2.6-vmstat-update.orig/init/init_task.c
+++ linux-2.6-vmstat-update/init/init_task.c
@@ -213,6 +213,9 @@ struct task_struct init_task
#ifdef CONFIG_SECCOMP_FILTER
.seccomp = { .filter_count = ATOMIC_INIT(0) },
#endif
+#ifdef CONFIG_CPU_ISOLATION
+ .isol_info = NULL,
+#endif
};
EXPORT_SYMBOL(init_task);
Index: linux-2.6-vmstat-update/kernel/fork.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/fork.c
+++ linux-2.6-vmstat-update/kernel/fork.c
@@ -97,6 +97,7 @@
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/bpf.h>
+#include <linux/task_isolation.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -734,6 +735,7 @@ void __put_task_struct(struct task_struc
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ tsk_isol_exit(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
task_numa_free(tsk, true);
@@ -2084,7 +2086,9 @@ static __latent_entropy struct task_stru
#ifdef CONFIG_BPF_SYSCALL
RCU_INIT_POINTER(p->bpf_storage, NULL);
#endif
-
+#ifdef CONFIG_CPU_ISOLATION
+ p->isol_info = NULL;
+#endif
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
Index: linux-2.6-vmstat-update/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6-vmstat-update/include/linux/task_isolation.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+ u8 mode;
+ u8 active;
+};
+
+extern void __tsk_isol_exit(struct task_struct *tsk);
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+ if (tsk->isol_info)
+ __tsk_isol_exit(tsk);
+}
+
+int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+int prctl_task_isolation_enter(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+int prctl_task_isolation_exit(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+
+#else
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
+
+static inline int prctl_task_isolation_get(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_set(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_enter(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_exit(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6-vmstat-update/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6-vmstat-update/kernel/task_isolation.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Implementation of task isolation.
+ *
+ * Authors:
+ * Chris Metcalf <cmetcalf@mellanox.com>
+ * Alex Belits <abelits@marvell.com>
+ * Yuri Norov <ynorov@marvell.com>
+ * Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+
+static int tsk_isol_alloc_context(struct task_struct *task)
+{
+ struct isol_info *info;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (unlikely(!info))
+ return -ENOMEM;
+
+ task->isol_info = info;
+ return 0;
+}
+
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+ kfree(tsk->isol_info);
+ tsk->isol_info = NULL;
+}
+
+int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ if (arg2 != PR_ISOL_MODE)
+ return -EOPNOTSUPP;
+
+ if (current->isol_info != NULL)
+ return current->isol_info->mode;
+
+ return PR_ISOL_MODE_NONE;
+}
+
+
+int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ int ret;
+
+ if (arg2 != PR_ISOL_MODE)
+ return -EOPNOTSUPP;
+
+ if (arg3 != PR_ISOL_MODE_NORMAL)
+ return -EINVAL;
+
+ ret = tsk_isol_alloc_context(current);
+ if (ret)
+ return ret;
+
+ current->isol_info->mode = arg3;
+ return 0;
+}
+
+int prctl_task_isolation_enter(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+
+ if (current->isol_info == NULL)
+ return -EINVAL;
+
+ if (current->isol_info->mode != PR_ISOL_MODE_NORMAL)
+ return -EINVAL;
+
+ current->isol_info->active = 1;
+
+ return 0;
+}
+
+int prctl_task_isolation_exit(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ if (current->isol_info == NULL)
+ return -EINVAL;
+
+ if (current->isol_info->mode != PR_ISOL_MODE_NORMAL)
+ return -EINVAL;
+
+ current->isol_info->active = 0;
+
+ return 0;
+}
+
+
Index: linux-2.6-vmstat-update/samples/Kconfig
===================================================================
--- linux-2.6-vmstat-update.orig/samples/Kconfig
+++ linux-2.6-vmstat-update/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
Build example userspace program to use the new mount_notify(),
sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
+config SAMPLE_TASK_ISOLATION
+ bool "task isolation sample"
+ depends on CC_CAN_LINK && HEADERS_INSTALL
+ help
+ Build example userspace program to use prctl task isolation
+ interface.
+
endif # SAMPLES
Index: linux-2.6-vmstat-update/samples/Makefile
===================================================================
--- linux-2.6-vmstat-update.orig/samples/Makefile
+++ linux-2.6-vmstat-update/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI) += mei/
subdir-$(CONFIG_SAMPLE_WATCHDOG) += watchdog
subdir-$(CONFIG_SAMPLE_WATCH_QUEUE) += watch_queue
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION) += task_isolation
Index: linux-2.6-vmstat-update/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6-vmstat-update/samples/task_isolation/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+userprogs-always-y += task_isolation
+
+userccflags += -I usr/include
^ permalink raw reply [flat|nested] 26+ messages in thread
* [patch 2/4] task isolation: sync vmstats on return to userspace
2021-07-27 10:38 [patch 0/4] prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-07-27 10:38 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-07-27 10:38 ` Marcelo Tosatti
2021-07-27 10:38 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
2021-07-27 10:38 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
3 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 10:38 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Marcelo Tosatti
The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:
1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop
Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.
To fix this, use the task isolation prctl interface to quiesce
deferred actions when returning to userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-2.6-vmstat-update/include/linux/task_isolation.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/task_isolation.h
+++ linux-2.6-vmstat-update/include/linux/task_isolation.h
@@ -30,9 +30,20 @@ int prctl_task_isolation_enter(unsigned
int prctl_task_isolation_exit(unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
+void __isolation_exit_to_user_mode_prepare(void);
+
+static inline void isolation_exit_to_user_mode_prepare(void)
+{
+ if (current->isol_info != NULL)
+ __isolation_exit_to_user_mode_prepare();
+}
#else
+static void isolation_exit_to_user_mode_prepare(void)
+{
+}
+
static inline void tsk_isol_exit(struct task_struct *tsk)
{
}
Index: linux-2.6-vmstat-update/include/linux/vmstat.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/vmstat.h
+++ linux-2.6-vmstat-update/include/linux/vmstat.h
@@ -21,6 +21,14 @@ int sysctl_vm_numa_stat_handler(struct c
void *buffer, size_t *length, loff_t *ppos);
#endif
+#ifdef CONFIG_SMP
+void sync_vmstat(void);
+#else
+static inline void sync_vmstat(void)
+{
+}
+#endif
+
struct reclaim_stat {
unsigned nr_dirty;
unsigned nr_unqueued_dirty;
Index: linux-2.6-vmstat-update/kernel/entry/common.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/entry/common.c
+++ linux-2.6-vmstat-update/kernel/entry/common.c
@@ -6,6 +6,7 @@
#include <linux/livepatch.h>
#include <linux/audit.h>
#include <linux/tick.h>
+#include <linux/task_isolation.h>
#include "common.h"
@@ -287,6 +288,7 @@ static void syscall_exit_to_user_mode_pr
static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
{
syscall_exit_to_user_mode_prepare(regs);
+ isolation_exit_to_user_mode_prepare();
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
}
Index: linux-2.6-vmstat-update/kernel/task_isolation.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/task_isolation.c
+++ linux-2.6-vmstat-update/kernel/task_isolation.c
@@ -13,6 +13,8 @@
#include <linux/task_isolation.h>
#include <linux/prctl.h>
#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
static int tsk_isol_alloc_context(struct task_struct *task)
{
@@ -93,4 +95,14 @@ int prctl_task_isolation_exit(unsigned l
return 0;
}
+void __isolation_exit_to_user_mode_prepare(void)
+{
+ if (current->isol_info->mode != PR_ISOL_MODE_NORMAL)
+ return;
+
+ if (current->isol_info->active != 1)
+ return;
+
+ sync_vmstat();
+}
Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1964,6 +1964,27 @@ static void vmstat_shepherd(struct work_
round_jiffies_relative(sysctl_stat_interval));
}
+void sync_vmstat(void)
+{
+ int cpu;
+
+ cpu = get_cpu();
+
+ refresh_cpu_vm_stats(false);
+ put_cpu();
+
+ /*
+ * If task is migrated to another CPU between put_cpu
+ * and cancel_delayed_work_sync, the code below might
+ * cancel vmstat_update work for a different cpu
+ * (than the one from which the vmstats were flushed).
+ *
+ * However, vmstat shepherd will re-enable it later,
+ * so its harmless.
+ */
+ cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
static void __init start_shepherd_timer(void)
{
int cpu;
^ permalink raw reply [flat|nested] 26+ messages in thread
* [patch 3/4] mm: vmstat: move need_update
2021-07-27 10:38 [patch 0/4] prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-07-27 10:38 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
2021-07-27 10:38 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
@ 2021-07-27 10:38 ` Marcelo Tosatti
2021-07-27 10:38 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
3 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 10:38 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Marcelo Tosatti
Move need_update() function up in vmstat.c, needed by next patch.
No code changes.
Remove a duplicate comment while at it.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1794,6 +1794,37 @@ static const struct seq_operations vmsta
static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
int sysctl_stat_interval __read_mostly = HZ;
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+ pg_data_t *last_pgdat = NULL;
+ struct zone *zone;
+
+ for_each_populated_zone(zone) {
+ struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ struct per_cpu_nodestat *n;
+
+ /*
+ * The fast way of checking if there are any vmstat diffs.
+ */
+ if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+ sizeof(pzstats->vm_stat_diff[0])))
+ return true;
+
+ if (last_pgdat == zone->zone_pgdat)
+ continue;
+ last_pgdat = zone->zone_pgdat;
+ n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+ if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
+ sizeof(n->vm_node_stat_diff[0])))
+ return true;
+ }
+ return false;
+}
+
#ifdef CONFIG_PROC_FS
static void refresh_vm_stats(struct work_struct *work)
{
@@ -1874,42 +1905,6 @@ static void vmstat_update(struct work_st
}
/*
- * Switch off vmstat processing and then fold all the remaining differentials
- * until the diffs stay at zero. The function is used by NOHZ and can only be
- * invoked when tick processing is not active.
- */
-/*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
- pg_data_t *last_pgdat = NULL;
- struct zone *zone;
-
- for_each_populated_zone(zone) {
- struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
- struct per_cpu_nodestat *n;
-
- /*
- * The fast way of checking if there are any vmstat diffs.
- */
- if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
- sizeof(pzstats->vm_stat_diff[0])))
- return true;
-
- if (last_pgdat == zone->zone_pgdat)
- continue;
- last_pgdat = zone->zone_pgdat;
- n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
- if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
- sizeof(n->vm_node_stat_diff[0])))
- return true;
- }
- return false;
-}
-
-/*
* Switch off vmstat processing and then fold all the remaining differentials
* until the diffs stay at zero. The function is used by NOHZ and can only be
* invoked when tick processing is not active.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
2021-07-27 10:38 [patch 0/4] prctl task isolation interface and vmstat sync Marcelo Tosatti
` (2 preceding siblings ...)
2021-07-27 10:38 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-07-27 10:38 ` Marcelo Tosatti
3 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 10:38 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Marcelo Tosatti
It is not necessary to queue work item to run refresh_vm_stats
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.
This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1826,17 +1826,40 @@ static bool need_update(int cpu)
}
#ifdef CONFIG_PROC_FS
-static void refresh_vm_stats(struct work_struct *work)
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+ struct zone *zone;
+
+ for_each_populated_zone(zone) {
+ struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
+
+ if (!pcp->count)
+ continue;
+
+ if (!pcp->expire)
+ continue;
+ if (zone_to_nid(zone) == cpu_to_node(cpu))
+ continue;
+
+ return true;
+ }
+#endif
+
+ return false;
+}
+
+static long refresh_vm_stats(void *arg)
{
refresh_cpu_vm_stats(true);
+ return 0;
}
int vmstat_refresh(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
long val;
- int err;
- int i;
+ int i, cpu;
/*
* The regular update, every sysctl_stat_interval, may come later
@@ -1850,9 +1873,15 @@ int vmstat_refresh(struct ctl_table *tab
* transiently negative values, report an error here if any of
* the stats is negative, so we know to go looking for imbalance.
*/
- err = schedule_on_each_cpu(refresh_vm_stats);
- if (err)
- return err;
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ if (need_update(cpu) || need_drain_remote_zones(cpu))
+ work_on_cpu(cpu, refresh_vm_stats, NULL);
+
+ cond_resched();
+ }
+ put_online_cpus();
+
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
/*
* Skip checking stats known to go negative occasionally.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 10:38 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
@ 2021-07-27 10:48 ` nsaenzju
2021-07-27 11:00 ` Marcelo Tosatti
0 siblings, 1 reply; 26+ messages in thread
From: nsaenzju @ 2021-07-27 10:48 UTC (permalink / raw)
To: Marcelo Tosatti, linux-kernel
Cc: Nitesh Lal, Frederic Weisbecker, Christoph Lameter, Juri Lelli,
Peter Zijlstra, Alex Belits, Peter Xu
On Tue, 2021-07-27 at 07:38 -0300, Marcelo Tosatti wrote:
> +Isolation mode (PR_ISOL_MODE):
> +------------------------------
> +
> +- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
> + PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
> +
> +- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
> + and in case of interruption events, the notifications can be collected
> + by BPF programs.
> + In this mode, if system calls are performed, deferred actions initiated
> + by the system call will be executed before return to userspace.
> +
> +Other modes, which for example send signals upon interruptions events,
> +can be implemented.
Shouldn't this be a set of flags that enable specific isolation features?
Something the likes of 'PR_ISOL_QUIESCE_ON_EXIT'. Modes seem more restrictive
and too much of a commitment. If we merge MODE_NORMAL as is, we won't be able
to tweak/extend its behaviour in the future.
--
Nicolás Sáenz
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 10:48 ` nsaenzju
@ 2021-07-27 11:00 ` Marcelo Tosatti
2021-07-27 12:38 ` nsaenzju
0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 11:00 UTC (permalink / raw)
To: nsaenzju
Cc: linux-kernel, Nitesh Lal, Frederic Weisbecker, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu
On Tue, Jul 27, 2021 at 12:48:33PM +0200, nsaenzju@redhat.com wrote:
> On Tue, 2021-07-27 at 07:38 -0300, Marcelo Tosatti wrote:
> > +Isolation mode (PR_ISOL_MODE):
> > +------------------------------
> > +
> > +- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
> > + PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
> > +
> > +- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
> > + and in case of interruption events, the notifications can be collected
> > + by BPF programs.
> > + In this mode, if system calls are performed, deferred actions initiated
> > + by the system call will be executed before return to userspace.
> > +
> > +Other modes, which for example send signals upon interruptions events,
> > +can be implemented.
>
> Shouldn't this be a set of flags that enable specific isolation features?
> Something the likes of 'PR_ISOL_QUIESCE_ON_EXIT'. Modes seem more restrictive
> and too much of a commitment. If we merge MODE_NORMAL as is, we won't be able
> to tweak/extend its behaviour in the future.
Hi Nicolas,
Well, its assuming PR_ISOL_MODE_NORMAL means "enable all isolation
features on return to userspace".
Later on, if desired, can add extend interface as follows (using
Christoph's idea to not perform automatic quiesce on return to
userspace, but expose which parts need quiescing
so userspace can do it on its own, as an example):
#define PR_ISOL_QUIESCE_ON_EXIT (1<<0)
#define PR_ISOL_VSYSCALL_PAGE (1<<1)
...
unsigned long bitmap = PR_ISOL_VSYSCALL_PAGE;
/* allow system calls */
prctl(PR_ISOL_SET, PR_ISOL_MODE, PR_ISOL_MODE_NORMAL, 0, 0, 0);
/*
* disable quiescing on exit, enable reporting through
* vsyscall page
*/
prctl(PR_ISOL_SET, PR_ISOL_FEATURES, &bitmap, 0, 0);
/*
* configure vsyscall page
*/
prctl(PR_ISOL_VSYSCALLS, params, ...);
So unless i am missing something, it is possible to tweak/extend the
interface. No?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 11:00 ` Marcelo Tosatti
@ 2021-07-27 12:38 ` nsaenzju
2021-07-27 13:06 ` Marcelo Tosatti
2021-07-27 13:09 ` Frederic Weisbecker
0 siblings, 2 replies; 26+ messages in thread
From: nsaenzju @ 2021-07-27 12:38 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: linux-kernel, Nitesh Lal, Frederic Weisbecker, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu
Hi Marcelo,
On Tue, 2021-07-27 at 08:00 -0300, Marcelo Tosatti wrote:
> On Tue, Jul 27, 2021 at 12:48:33PM +0200, nsaenzju@redhat.com wrote:
> > On Tue, 2021-07-27 at 07:38 -0300, Marcelo Tosatti wrote:
> > > +Isolation mode (PR_ISOL_MODE):
> > > +------------------------------
> > > +
> > > +- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
> > > + PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
> > > +
> > > +- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
> > > + and in case of interruption events, the notifications can be collected
> > > + by BPF programs.
> > > + In this mode, if system calls are performed, deferred actions initiated
> > > + by the system call will be executed before return to userspace.
> > > +
> > > +Other modes, which for example send signals upon interruptions events,
> > > +can be implemented.
> >
> > Shouldn't this be a set of flags that enable specific isolation features?
> > Something the likes of 'PR_ISOL_QUIESCE_ON_EXIT'. Modes seem more restrictive
> > and too much of a commitment. If we merge MODE_NORMAL as is, we won't be able
> > to tweak/extend its behaviour in the future.
>
> Hi Nicolas,
>
> Well, its assuming PR_ISOL_MODE_NORMAL means "enable all isolation
> features on return to userspace".
>
> Later on, if desired, can add extend interface as follows (using
> Christoph's idea to not perform automatic quiesce on return to
> userspace, but expose which parts need quiescing
> so userspace can do it on its own, as an example):
>
> #define PR_ISOL_QUIESCE_ON_EXIT (1<<0)
> #define PR_ISOL_VSYSCALL_PAGE (1<<1)
> ...
>
> unsigned long bitmap = PR_ISOL_VSYSCALL_PAGE;
>
> /* allow system calls */
> prctl(PR_ISOL_SET, PR_ISOL_MODE, PR_ISOL_MODE_NORMAL, 0, 0, 0);
>
> /*
> * disable quiescing on exit, enable reporting through
> * vsyscall page
> */
> prctl(PR_ISOL_SET, PR_ISOL_FEATURES, &bitmap, 0, 0);
> /*
> * configure vsyscall page
> */
> prctl(PR_ISOL_VSYSCALLS, params, ...);
>
> So unless i am missing something, it is possible to tweak/extend the
> interface. No?
OK, sorry if I'm being thick, but what is the benefit of having a distincnt
PR_ISOL_MODE instead expressing everything as PR_ISOL_FEATURES.
PR_ISOL_MODE_NONE == Empty PR_ISOL_FEATURES bitmap
PR_ISOL_MODE_NORMAL == Bitmap of commonly used PR_ISOL_FEATURES
(we could introduce a define)
PR_ISOL_MODE_NORMAL+PR_ISOL_VSYSCALLS == Custom bitmap
Other than that, my rationale is that if you extend PR_ISOL_MODE_NORMAL's
behaviour as new features are merged, wouldn't you be potentially breaking
userspace (i.e. older applications might not like the new default)?
--
Nicolás Sáenz
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 12:38 ` nsaenzju
@ 2021-07-27 13:06 ` Marcelo Tosatti
2021-07-27 13:08 ` Marcelo Tosatti
2021-07-27 13:09 ` Frederic Weisbecker
1 sibling, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 13:06 UTC (permalink / raw)
To: nsaenzju
Cc: linux-kernel, Nitesh Lal, Frederic Weisbecker, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu
On Tue, Jul 27, 2021 at 02:38:15PM +0200, nsaenzju@redhat.com wrote:
> Hi Marcelo,
>
> On Tue, 2021-07-27 at 08:00 -0300, Marcelo Tosatti wrote:
> > On Tue, Jul 27, 2021 at 12:48:33PM +0200, nsaenzju@redhat.com wrote:
> > > On Tue, 2021-07-27 at 07:38 -0300, Marcelo Tosatti wrote:
> > > > +Isolation mode (PR_ISOL_MODE):
> > > > +------------------------------
> > > > +
> > > > +- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
> > > > + PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
> > > > +
> > > > +- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
> > > > + and in case of interruption events, the notifications can be collected
> > > > + by BPF programs.
> > > > + In this mode, if system calls are performed, deferred actions initiated
> > > > + by the system call will be executed before return to userspace.
> > > > +
> > > > +Other modes, which for example send signals upon interruptions events,
> > > > +can be implemented.
> > >
> > > Shouldn't this be a set of flags that enable specific isolation features?
> > > Something the likes of 'PR_ISOL_QUIESCE_ON_EXIT'. Modes seem more restrictive
> > > and too much of a commitment. If we merge MODE_NORMAL as is, we won't be able
> > > to tweak/extend its behaviour in the future.
> >
> > Hi Nicolas,
> >
> > Well, its assuming PR_ISOL_MODE_NORMAL means "enable all isolation
> > features on return to userspace".
> >
> > Later on, if desired, can add extend interface as follows (using
> > Christoph's idea to not perform automatic quiesce on return to
> > userspace, but expose which parts need quiescing
> > so userspace can do it on its own, as an example):
> >
> > #define PR_ISOL_QUIESCE_ON_EXIT (1<<0)
> > #define PR_ISOL_VSYSCALL_PAGE (1<<1)
> > ...
> >
> > unsigned long bitmap = PR_ISOL_VSYSCALL_PAGE;
> >
> > /* allow system calls */
> > prctl(PR_ISOL_SET, PR_ISOL_MODE, PR_ISOL_MODE_NORMAL, 0, 0, 0);
> >
> > /*
> > * disable quiescing on exit, enable reporting through
> > * vsyscall page
> > */
> > prctl(PR_ISOL_SET, PR_ISOL_FEATURES, &bitmap, 0, 0);
> > /*
> > * configure vsyscall page
> > */
> > prctl(PR_ISOL_VSYSCALLS, params, ...);
> >
> > So unless i am missing something, it is possible to tweak/extend the
> > interface. No?
>
> OK, sorry if I'm being thick, but what is the benefit of having a distincnt
> PR_ISOL_MODE instead expressing everything as PR_ISOL_FEATURES.
>
> PR_ISOL_MODE_NONE == Empty PR_ISOL_FEATURES bitmap
>
> PR_ISOL_MODE_NORMAL == Bitmap of commonly used PR_ISOL_FEATURES
> (we could introduce a define)
>
> PR_ISOL_MODE_NORMAL+PR_ISOL_VSYSCALLS == Custom bitmap
>
> Other than that, my rationale is that if you extend PR_ISOL_MODE_NORMAL's
> behaviour as new features are merged, wouldn't you be potentially breaking
> userspace (i.e. older applications might not like the new default)?
>
> --
> Nicolás Sáenz
The main reason is that PR_ISOL_MODE would allow for distinct
modes to be implemented (matching each use case). For example:
https://lwn.net/Articles/816298/
"When a task has finished its initialization, it can activate isolation
by using the PR_TASK_ISOLATION operation provided by the prctl()
system call. This operation may fail for either permanent or temporary
reasons. An example of a permanent error is when the task is set up
on a CPU without isolation; in this case, entering isolation mode
is not possible. Temporary errors are indicated by the EAGAIN error
code; examples include a time when the delayed workqueues could not be
stopped. In such cases, the task may retry the operation if it wants to
enter isolation, as it may succeed the next time.
In the prctl() call, the developer may also configure the signal to be
sent to the task when it loses isolation. The additional macro to use is
PR_TASK_ISOLATION_SET_SIG(), passing it the signal to send. The command
then becomes similar to the one in the example code:"
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 13:06 ` Marcelo Tosatti
@ 2021-07-27 13:08 ` Marcelo Tosatti
0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 13:08 UTC (permalink / raw)
To: nsaenzju
Cc: linux-kernel, Nitesh Lal, Frederic Weisbecker, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu
On Tue, Jul 27, 2021 at 10:06:41AM -0300, Marcelo Tosatti wrote:
> On Tue, Jul 27, 2021 at 02:38:15PM +0200, nsaenzju@redhat.com wrote:
> > Hi Marcelo,
> >
> > On Tue, 2021-07-27 at 08:00 -0300, Marcelo Tosatti wrote:
> > > On Tue, Jul 27, 2021 at 12:48:33PM +0200, nsaenzju@redhat.com wrote:
> > > > On Tue, 2021-07-27 at 07:38 -0300, Marcelo Tosatti wrote:
> > > > > +Isolation mode (PR_ISOL_MODE):
> > > > > +------------------------------
> > > > > +
> > > > > +- PR_ISOL_MODE_NONE (arg4): no per-task isolation (default mode).
> > > > > + PR_ISOL_EXIT sets mode to PR_ISOL_MODE_NONE.
> > > > > +
> > > > > +- PR_ISOL_MODE_NORMAL (arg4): applications can perform system calls normally,
> > > > > + and in case of interruption events, the notifications can be collected
> > > > > + by BPF programs.
> > > > > + In this mode, if system calls are performed, deferred actions initiated
> > > > > + by the system call will be executed before return to userspace.
> > > > > +
> > > > > +Other modes, which for example send signals upon interruptions events,
> > > > > +can be implemented.
> > > >
> > > > Shouldn't this be a set of flags that enable specific isolation features?
> > > > Something the likes of 'PR_ISOL_QUIESCE_ON_EXIT'. Modes seem more restrictive
> > > > and too much of a commitment. If we merge MODE_NORMAL as is, we won't be able
> > > > to tweak/extend its behaviour in the future.
> > >
> > > Hi Nicolas,
> > >
> > > Well, its assuming PR_ISOL_MODE_NORMAL means "enable all isolation
> > > features on return to userspace".
> > >
> > > Later on, if desired, can add extend interface as follows (using
> > > Christoph's idea to not perform automatic quiesce on return to
> > > userspace, but expose which parts need quiescing
> > > so userspace can do it on its own, as an example):
> > >
> > > #define PR_ISOL_QUIESCE_ON_EXIT (1<<0)
> > > #define PR_ISOL_VSYSCALL_PAGE (1<<1)
> > > ...
> > >
> > > unsigned long bitmap = PR_ISOL_VSYSCALL_PAGE;
> > >
> > > /* allow system calls */
> > > prctl(PR_ISOL_SET, PR_ISOL_MODE, PR_ISOL_MODE_NORMAL, 0, 0, 0);
> > >
> > > /*
> > > * disable quiescing on exit, enable reporting through
> > > * vsyscall page
> > > */
> > > prctl(PR_ISOL_SET, PR_ISOL_FEATURES, &bitmap, 0, 0);
> > > /*
> > > * configure vsyscall page
> > > */
> > > prctl(PR_ISOL_VSYSCALLS, params, ...);
> > >
> > > So unless i am missing something, it is possible to tweak/extend the
> > > interface. No?
> >
> > OK, sorry if I'm being thick, but what is the benefit of having a distincnt
> > PR_ISOL_MODE instead expressing everything as PR_ISOL_FEATURES.
> >
> > PR_ISOL_MODE_NONE == Empty PR_ISOL_FEATURES bitmap
> >
> > PR_ISOL_MODE_NORMAL == Bitmap of commonly used PR_ISOL_FEATURES
> > (we could introduce a define)
> >
> > PR_ISOL_MODE_NORMAL+PR_ISOL_VSYSCALLS == Custom bitmap
> >
> > Other than that, my rationale is that if you extend PR_ISOL_MODE_NORMAL's
> > behaviour as new features are merged, wouldn't you be potentially breaking
> > userspace (i.e. older applications might not like the new default)?
> >
> > --
> > Nicolás Sáenz
>
> The main reason is that PR_ISOL_MODE would allow for distinct
> modes to be implemented (matching each use case). For example:
>
> https://lwn.net/Articles/816298/
>
> "When a task has finished its initialization, it can activate isolation
> by using the PR_TASK_ISOLATION operation provided by the prctl()
> system call. This operation may fail for either permanent or temporary
> reasons. An example of a permanent error is when the task is set up
> on a CPU without isolation; in this case, entering isolation mode
> is not possible. Temporary errors are indicated by the EAGAIN error
> code; examples include a time when the delayed workqueues could not be
> stopped. In such cases, the task may retry the operation if it wants to
> enter isolation, as it may succeed the next time.
>
> In the prctl() call, the developer may also configure the signal to be
> sent to the task when it loses isolation. The additional macro to use is
> PR_TASK_ISOLATION_SET_SIG(), passing it the signal to send. The command
> then becomes similar to the one in the example code:"
But have no strong preference: fine with PR_ISOL_FEATURES as you
describe above, and if that is the consensus, can resubmit.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 12:38 ` nsaenzju
2021-07-27 13:06 ` Marcelo Tosatti
@ 2021-07-27 13:09 ` Frederic Weisbecker
2021-07-27 14:52 ` Marcelo Tosatti
1 sibling, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2021-07-27 13:09 UTC (permalink / raw)
To: nsaenzju, Marcelo Tosatti
Cc: linux-kernel, Nitesh Lal, Christoph Lameter, Juri Lelli,
Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner
On Tue, Jul 27, 2021 at 02:38:15PM +0200, nsaenzju@redhat.com wrote:
> Hi Marcelo,
>
> On Tue, 2021-07-27 at 08:00 -0300, Marcelo Tosatti wrote:
> OK, sorry if I'm being thick, but what is the benefit of having a distincnt
> PR_ISOL_MODE instead expressing everything as PR_ISOL_FEATURES.
>
> PR_ISOL_MODE_NONE == Empty PR_ISOL_FEATURES bitmap
>
> PR_ISOL_MODE_NORMAL == Bitmap of commonly used PR_ISOL_FEATURES
> (we could introduce a define)
>
> PR_ISOL_MODE_NORMAL+PR_ISOL_VSYSCALLS == Custom bitmap
>
> Other than that, my rationale is that if you extend PR_ISOL_MODE_NORMAL's
> behaviour as new features are merged, wouldn't you be potentially breaking
> userspace (i.e. older applications might not like the new default)?
I agree with Nicolas, and that was Thomas request too.
Let's leave policy implementation to userspace and take
only the individual isolation features to the kernel.
CPU/Task isolation is a relatively young feature and many users don't
communicate much about their needs. We don't know exactly how finegrained
the ABI will need to be so let's not make too many high level assumptions.
It's easy for userspace to set all isolation bits by itself.
Besides, those bits will be implemented one by one over time, this
means that a prctl() bit saying "isolate everything" will have a different
behaviour as those features get integrated. And we really want well defined
behaviours.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 13:09 ` Frederic Weisbecker
@ 2021-07-27 14:52 ` Marcelo Tosatti
2021-07-27 23:45 ` Frederic Weisbecker
0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-27 14:52 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Tue, Jul 27, 2021 at 03:09:30PM +0200, Frederic Weisbecker wrote:
> On Tue, Jul 27, 2021 at 02:38:15PM +0200, nsaenzju@redhat.com wrote:
> > Hi Marcelo,
> >
> > On Tue, 2021-07-27 at 08:00 -0300, Marcelo Tosatti wrote:
> > OK, sorry if I'm being thick, but what is the benefit of having a distincnt
> > PR_ISOL_MODE instead expressing everything as PR_ISOL_FEATURES.
> >
> > PR_ISOL_MODE_NONE == Empty PR_ISOL_FEATURES bitmap
> >
> > PR_ISOL_MODE_NORMAL == Bitmap of commonly used PR_ISOL_FEATURES
> > (we could introduce a define)
> >
> > PR_ISOL_MODE_NORMAL+PR_ISOL_VSYSCALLS == Custom bitmap
> >
> > Other than that, my rationale is that if you extend PR_ISOL_MODE_NORMAL's
> > behaviour as new features are merged, wouldn't you be potentially breaking
> > userspace (i.e. older applications might not like the new default)?
>
> I agree with Nicolas, and that was Thomas request too.
> Let's leave policy implementation to userspace and take
> only the individual isolation features to the kernel.
>
> CPU/Task isolation is a relatively young feature and many users don't
> communicate much about their needs. We don't know exactly how finegrained
> the ABI will need to be so let's not make too many high level assumptions.
>
> It's easy for userspace to set all isolation bits by itself.
>
> Besides, those bits will be implemented one by one over time, this
> means that a prctl() bit saying "isolate everything" will have a different
> behaviour as those features get integrated. And we really want well defined
> behaviours.
>
> Thanks.
>
>
OK, how about this:
...
The meaning of isolated is specified as follows:
Isolation features
==================
- prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
features as a return value.
- prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
the bitmask.
- prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
enabled features.
The supported features are:
ISOL_F_QUIESCE_ON_URET: quiesce deferred actions on return to userspace.
----------------------
Quiescing of different actions can be performed on return to userspace.
- prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0) returns
the supported actions to be quiesced.
- prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) returns
the currently supported actions to be quiesced.
- prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
the currently enabled actions to be quiesced.
#define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
#define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
#define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
...
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 14:52 ` Marcelo Tosatti
@ 2021-07-27 23:45 ` Frederic Weisbecker
2021-07-28 9:37 ` Marcelo Tosatti
0 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2021-07-27 23:45 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> The meaning of isolated is specified as follows:
>
> Isolation features
> ==================
>
> - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> features as a return value.
>
> - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> the bitmask.
>
> - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> enabled features.
So what are the ISOL_FEATURES here? A mode that we enter such as flush
vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
If so I'd rather call that ISOL_MODE because feature is too general.
>
> The supported features are:
>
> ISOL_F_QUIESCE_ON_URET: quiesce deferred actions on return to userspace.
> ----------------------
>
> Quiescing of different actions can be performed on return to userspace.
>
> - prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0) returns
> the supported actions to be quiesced.
>
> - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) returns
> the currently supported actions to be quiesced.
>
> - prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
> the currently enabled actions to be quiesced.
>
> #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
And then PR_ISOL_QUIESCE_CFG is a oneshot operation that applies only upon
return to this ctrl, right? If so perhaps this should be called just
ISOL_QUIESCE or ISOL_QUIESCE_ONCE or ISOL_REQ ?
But that's just naming debate because otherwise that prctl layout looks good
to me.
Thanks!
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-27 23:45 ` Frederic Weisbecker
@ 2021-07-28 9:37 ` Marcelo Tosatti
2021-07-28 11:45 ` Frederic Weisbecker
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-28 9:37 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > The meaning of isolated is specified as follows:
> >
> > Isolation features
> > ==================
> >
> > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > features as a return value.
> >
> > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > the bitmask.
> >
> > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > enabled features.
>
> So what are the ISOL_FEATURES here? A mode that we enter such as flush
> vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
ISOL_FEATURES is just the "command" type (which you can get and set).
The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
- bitmask = ISOL_F_QUIESCE_ON_URET;
- prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
the bitmask.
- quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
(1)
(returns the supported actions to be quiesced).
- prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
the actions to be quiesced (2)
If an application does not modify "quiesce_bitmask" between
points (1) and (2) above, it will enable quiescing of all
"features" the kernel supports.
Application can, however, modify quiesce_bitmap to its preference.
Flushing vmstat _everytime_ you resume to userspace is enabled only
_after_ prctl(PR_ISOL_ENTER, 0, 0, 0, 0) is performed (which happens
only when isolation is fully configured with the PR_ISOL_SET calls).
OK, will better document that.
> If so I'd rather call that ISOL_MODE because feature is too general.
Well, in the first patchset, there was one "mode" implemented (but
it was possible to implement different modes in the future).
This would allow for example easier integration of "full task isolation"
patchset type of functionality, disallowing syscalls.
I think we'd like to keep that, so i'll keep the previous distinct modes
(but allow configuration of individual features on the bitmap).
> >
> > The supported features are:
> >
> > ISOL_F_QUIESCE_ON_URET: quiesce deferred actions on return to userspace.
> > ----------------------
> >
> > Quiescing of different actions can be performed on return to userspace.
> >
> > - prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0) returns
> > the supported actions to be quiesced.
> >
> > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) returns
s/returns/sets/
> > the currently supported actions to be quiesced.
> >
> > - prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
> > the currently enabled actions to be quiesced.
> >
> > #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> > #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> > #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
>
> And then PR_ISOL_QUIESCE_CFG is a oneshot operation that applies only upon
> return to this ctrl, right? If so perhaps this should be called just
> ISOL_QUIESCE or ISOL_QUIESCE_ONCE or ISOL_REQ ?
There was no one-shot operation implemented in the first patchset. What
application would do to achieve that is:
1. Configure isolation with PR_ISOL_SET (say configure mode which
allows system calls, and when a system call happens, flush all deferred
actions on return to userspace).
2. prctl(PR_ISOL_ENTER, 0, 0, 0, 0) (this actually enables the flushing,
and tags the task_struct as isolated). Here we can transfer this information
from per-task to per-CPU data, for example, to be able to implement
other features such as deferred TLB flushing.
On return from this prctl(), deferrable actions are flushed.
3. latency sensitive loop, with no system calls.
4. some event which requires system calls is noticed:
prctl(PR_ISOL_EXIT, 0, 0, 0, 0)
(this would untag task_struct as isolated).
5. perform system calls A, B, C, D (with no flushing of vmstat,
for example).
6. jmp to 2.
So there is a problem with this logic, which is that one would like
certain isolation functionality to remain enabled between points 4
and 6 (for example, blocking CPU hotplug or other blockable activities
that would cause interruptions).
One way to achieve this would be to replace PR_ISOL_ENTER/PR_ISOL_EXIT
with PR_ISOL_ENABLE, which accepts a bitmask:
1. Configure isolation with PR_ISOL_SET (say configure mode which
allows system calls, and when a system call happens, flush all deferred
actions on return to userspace).
2. enabled_bitmask = ISOL_F_QUIESCE_ON_URET|ISOL_F_BLOCK_INTERRUPTORS;
prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0)
On return from this prctl(), deferrable actions are flushed.
3. latency sensitive loop, with no system calls.
4. some event which requires system calls is noticed:
prctl(PR_ISOL_ENABLE, ISOL_F_BLOCK_INTERRUPTORS, 0, 0, 0)
(this would clear ISOL_F_QUIESCE_ON_URET, so no flushing
is performed on return from system calls).
5. perform system calls A, B, C, D (with no flushing of vmstat).
6. jmp to 2.
...
On exit: prctl(PR_ISOL_ENABLE, 0, 0, 0, 0)
IOW: the one-shot operation does not allow the application
to inform the kernel when the latency sensitive loop has
begun or has ended.
>
> But that's just naming debate because otherwise that prctl layout looks good
> to me.
>
> Thanks!
Thank you for the input!
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 9:37 ` Marcelo Tosatti
@ 2021-07-28 11:45 ` Frederic Weisbecker
2021-07-28 13:21 ` Marcelo Tosatti
2021-07-28 11:55 ` nsaenzju
[not found] ` <CAFki+LmHeXmSFze8YEHFNbYA5hLEtnZyk37Yjf-eyOuKa8Os4w@mail.gmail.com>
2 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2021-07-28 11:45 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, Jul 28, 2021 at 06:37:07AM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> > On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > > The meaning of isolated is specified as follows:
> > >
> > > Isolation features
> > > ==================
> > >
> > > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > > features as a return value.
> > >
> > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > > the bitmask.
> > >
> > > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > > enabled features.
> >
> > So what are the ISOL_FEATURES here? A mode that we enter such as flush
> > vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
>
> ISOL_FEATURES is just the "command" type (which you can get and set).
>
> The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
>
> - bitmask = ISOL_F_QUIESCE_ON_URET;
> - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> the bitmask.
But does it quiesce once or for every further uret?
>
> - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> (1)
>
> (returns the supported actions to be quiesced).
>
> - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> the actions to be quiesced (2)
>
> If an application does not modify "quiesce_bitmask" between
> points (1) and (2) above, it will enable quiescing of all
> "features" the kernel supports.
I don't get the difference between ISOL_FEATURES and PR_ISOL_QUIESCE_CFG.
>
> Application can, however, modify quiesce_bitmap to its preference.
>
> Flushing vmstat _everytime_ you resume to userspace is enabled only
> _after_ prctl(PR_ISOL_ENTER, 0, 0, 0, 0) is performed (which happens
> only when isolation is fully configured with the PR_ISOL_SET calls).
> OK, will better document that.
Yes please, I'm completely confused :o)
>
> > If so I'd rather call that ISOL_MODE because feature is too general.
>
> Well, in the first patchset, there was one "mode" implemented (but
> it was possible to implement different modes in the future).
>
> This would allow for example easier integration of "full task isolation"
> patchset type of functionality, disallowing syscalls.
>
> I think we'd like to keep that, so i'll keep the previous distinct modes
> (but allow configuration of individual features on the bitmap).
And I also don't see how such modes differ from configuration of individual
features on the bitmap.
> > > - prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
> > > the currently enabled actions to be quiesced.
> > >
> > > #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> > > #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> > > #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
> >
> > And then PR_ISOL_QUIESCE_CFG is a oneshot operation that applies only upon
> > return to this ctrl, right? If so perhaps this should be called just
> > ISOL_QUIESCE or ISOL_QUIESCE_ONCE or ISOL_REQ ?
>
> There was no one-shot operation implemented in the first patchset. What
> application would do to achieve that is:
>
> 1. Configure isolation with PR_ISOL_SET (say configure mode which
> allows system calls, and when a system call happens, flush all deferred
> actions on return to userspace).
>
> 2. prctl(PR_ISOL_ENTER, 0, 0, 0, 0) (this actually enables the flushing,
> and tags the task_struct as isolated). Here we can transfer this information
> from per-task to per-CPU data, for example, to be able to implement
> other features such as deferred TLB flushing.
>
> On return from this prctl(), deferrable actions are flushed.
>
> 3. latency sensitive loop, with no system calls.
>
> 4. some event which requires system calls is noticed:
> prctl(PR_ISOL_EXIT, 0, 0, 0, 0)
> (this would untag task_struct as isolated).
>
> 5. perform system calls A, B, C, D (with no flushing of vmstat,
> for example).
>
> 6. jmp to 2.
>
> So there is a problem with this logic, which is that one would like
> certain isolation functionality to remain enabled between points 4
> and 6 (for example, blocking CPU hotplug or other blockable activities
> that would cause interruptions).
>
> One way to achieve this would be to replace PR_ISOL_ENTER/PR_ISOL_EXIT
> with PR_ISOL_ENABLE, which accepts a bitmask:
>
> 1. Configure isolation with PR_ISOL_SET (say configure mode which
> allows system calls, and when a system call happens, flush all deferred
> actions on return to userspace).
>
> 2. enabled_bitmask = ISOL_F_QUIESCE_ON_URET|ISOL_F_BLOCK_INTERRUPTORS;
> prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0)
>
> On return from this prctl(), deferrable actions are flushed.
>
> 3. latency sensitive loop, with no system calls.
>
> 4. some event which requires system calls is noticed:
>
> prctl(PR_ISOL_ENABLE, ISOL_F_BLOCK_INTERRUPTORS, 0, 0, 0)
> (this would clear ISOL_F_QUIESCE_ON_URET, so no flushing
> is performed on return from system calls).
So PR_ISOL_ENABLE is a way to perform action when some sort of kernel entry
happens. Then we take actions when that happens (signal, warn, etc...).
I guess we'll need to define what kind of kernel entry, and what kind of
response need to happen. Ok that's a whole issue of its own that we'll need
to handle seperately.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 9:37 ` Marcelo Tosatti
2021-07-28 11:45 ` Frederic Weisbecker
@ 2021-07-28 11:55 ` nsaenzju
2021-07-28 13:16 ` Marcelo Tosatti
[not found] ` <CAFki+LmHeXmSFze8YEHFNbYA5hLEtnZyk37Yjf-eyOuKa8Os4w@mail.gmail.com>
2 siblings, 1 reply; 26+ messages in thread
From: nsaenzju @ 2021-07-28 11:55 UTC (permalink / raw)
To: Marcelo Tosatti, Frederic Weisbecker
Cc: linux-kernel, Nitesh Lal, Christoph Lameter, Juri Lelli,
Peter Zijlstra, Alex Belits, Peter Xu, Thomas Gleixner
Hi Marcelo,
On Wed, 2021-07-28 at 06:37 -0300, Marcelo Tosatti wrote:
> On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> > On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > > The meaning of isolated is specified as follows:
> > >
> > > Isolation features
> > > ==================
> > >
> > > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > > features as a return value.
> > >
> > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > > the bitmask.
> > >
> > > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > > enabled features.
> >
> > So what are the ISOL_FEATURES here? A mode that we enter such as flush
> > vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
>
> ISOL_FEATURES is just the "command" type (which you can get and set).
>
> The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
>
> - bitmask = ISOL_F_QUIESCE_ON_URET;
> - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> the bitmask.
>
> - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> (1)
>
> (returns the supported actions to be quiesced).
>
> - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> the actions to be quiesced (2)
>
> If an application does not modify "quiesce_bitmask" between
> points (1) and (2) above, it will enable quiescing of all
> "features" the kernel supports.
I think this pattern of enabling all by default might be prone to subtly
breaking things.
For example, let's say we introduce ISOL_F_QUIESCE_DEFER_TLB_FLUSH, this will
defer relatively short IPIs on isolated CPUs in exchange for a longer flush
whenever we enter the kernel (syscall, IRQs, NMI, etc...). A latency sensitive
application might be OK with the former but not with the latter.
--
Nicolás Sáenz
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 11:55 ` nsaenzju
@ 2021-07-28 13:16 ` Marcelo Tosatti
[not found] ` <CAFki+LkQwoqVTKmgnwLQQM8ua-ixbLp8i+jUT6xF15k6X=89mw@mail.gmail.com>
2021-07-28 17:08 ` nsaenzju
0 siblings, 2 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-28 13:16 UTC (permalink / raw)
To: nsaenzju
Cc: Frederic Weisbecker, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, Jul 28, 2021 at 01:55:33PM +0200, nsaenzju@redhat.com wrote:
> Hi Marcelo,
>
> On Wed, 2021-07-28 at 06:37 -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> > > On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > > > The meaning of isolated is specified as follows:
> > > >
> > > > Isolation features
> > > > ==================
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > > > features as a return value.
> > > >
> > > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > > > the bitmask.
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > > > enabled features.
> > >
> > > So what are the ISOL_FEATURES here? A mode that we enter such as flush
> > > vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
> >
> > ISOL_FEATURES is just the "command" type (which you can get and set).
> >
> > The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
> >
> > - bitmask = ISOL_F_QUIESCE_ON_URET;
> > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > the bitmask.
> >
> > - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> > (1)
> >
> > (returns the supported actions to be quiesced).
> >
> > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> > the actions to be quiesced (2)
> >
> > If an application does not modify "quiesce_bitmask" between
> > points (1) and (2) above, it will enable quiescing of all
> > "features" the kernel supports.
>
> I think this pattern of enabling all by default might be prone to subtly
> breaking things.
The reasoning behind this pattern is that many latency sensitive applications
(as far as i am aware) prefer "as few interruptions as possible, no
interruptions is preferred".
In that case, the pattern makes sense.
> For example, let's say we introduce ISOL_F_QUIESCE_DEFER_TLB_FLUSH, this will
> defer relatively short IPIs on isolated CPUs in exchange for a longer flush
> whenever we enter the kernel (syscall, IRQs, NMI, etc...).
Why the flush has to be longer when you enter the kernel?
ISOL_F_QUIESCE_DEFER_TLB_FLUSH might collapse multiple IPIs
into a single IPI, so the behaviour might be beneficial
for "standard" types of application as well.
> A latency sensitive
> application might be OK with the former but not with the latter.
Two alternatives:
1) The pattern above, where particular subsystems that might interrupt
the kernel are enabled automatically if the kernel supports it.
Pros:
Applications which implement this only need to be changed once,
and can benefit from new kernel features.
Applications can disable particular features if they turn
out to be problematic.
Cons:
New features might break applications.
2) Force applications to enable each new feature individually.
Pros: Won't cause regressions, kernel behaviour is explicitly
controlled by userspace.
Cons: Apps won't benefit from new features automatically.
---
It seems to me 1) is preferred. Can also add a sysfs control to
have a "default_isolation_feature" flag, which can be changed
by a sysadmin in case a new feature is undesired.
Thoughts?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 11:45 ` Frederic Weisbecker
@ 2021-07-28 13:21 ` Marcelo Tosatti
2021-07-28 21:22 ` Frederic Weisbecker
0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-28 13:21 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, Jul 28, 2021 at 01:45:48PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 28, 2021 at 06:37:07AM -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> > > On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > > > The meaning of isolated is specified as follows:
> > > >
> > > > Isolation features
> > > > ==================
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > > > features as a return value.
> > > >
> > > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > > > the bitmask.
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > > > enabled features.
> > >
> > > So what are the ISOL_FEATURES here? A mode that we enter such as flush
> > > vmstat _everytime_ we resume to userpace after (and including) this prctl() ?
> >
> > ISOL_FEATURES is just the "command" type (which you can get and set).
> >
> > The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
> >
> > - bitmask = ISOL_F_QUIESCE_ON_URET;
> > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > the bitmask.
>
> But does it quiesce once or for every further uret?
For every uret, while ISOL_F_QUIESCE_ON_URET is enabled through
prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0).
> > - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> > (1)
> >
> > (returns the supported actions to be quiesced).
> >
> > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> > the actions to be quiesced (2)
> >
> > If an application does not modify "quiesce_bitmask" between
> > points (1) and (2) above, it will enable quiescing of all
> > "features" the kernel supports.
>
> I don't get the difference between ISOL_FEATURES and PR_ISOL_QUIESCE_CFG.
prctl(PR_ISOL_SET, cmd, ...) is intented to accept different types of "command"
variables (including ones for new features which are not known at this
time).
- prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
the bitmask
(which might now be superceded by
prctl(PR_ISOL_ENABLE, ISOL_F_QUIESCE_ON_URET, 0, 0, 0))
- prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, bitmask, 0, 0) configures
quiescing of which subsystem/feature is performed:
#define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
#define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
#define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
> > Application can, however, modify quiesce_bitmap to its preference.
> >
> > Flushing vmstat _everytime_ you resume to userspace is enabled only
> > _after_ prctl(PR_ISOL_ENTER, 0, 0, 0, 0) is performed (which happens
> > only when isolation is fully configured with the PR_ISOL_SET calls).
> > OK, will better document that.
>
> Yes please, I'm completely confused :o)
OK.
> > > If so I'd rather call that ISOL_MODE because feature is too general.
> >
> > Well, in the first patchset, there was one "mode" implemented (but
> > it was possible to implement different modes in the future).
> >
> > This would allow for example easier integration of "full task isolation"
> > patchset type of functionality, disallowing syscalls.
> >
> > I think we'd like to keep that, so i'll keep the previous distinct modes
> > (but allow configuration of individual features on the bitmap).
>
> And I also don't see how such modes differ from configuration of individual
> features on the bitmap.
Good point, they do not intersect, syscall disablement and notification of
"isolation breakage" are orthogonal to quiescing.
> > > > - prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
> > > > the currently enabled actions to be quiesced.
> > > >
> > > > #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> > > > #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> > > > #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
> > >
> > > And then PR_ISOL_QUIESCE_CFG is a oneshot operation that applies only upon
> > > return to this ctrl, right? If so perhaps this should be called just
> > > ISOL_QUIESCE or ISOL_QUIESCE_ONCE or ISOL_REQ ?
> >
> > There was no one-shot operation implemented in the first patchset. What
> > application would do to achieve that is:
> >
> > 1. Configure isolation with PR_ISOL_SET (say configure mode which
> > allows system calls, and when a system call happens, flush all deferred
> > actions on return to userspace).
> >
> > 2. prctl(PR_ISOL_ENTER, 0, 0, 0, 0) (this actually enables the flushing,
> > and tags the task_struct as isolated). Here we can transfer this information
> > from per-task to per-CPU data, for example, to be able to implement
> > other features such as deferred TLB flushing.
> >
> > On return from this prctl(), deferrable actions are flushed.
> >
> > 3. latency sensitive loop, with no system calls.
> >
> > 4. some event which requires system calls is noticed:
> > prctl(PR_ISOL_EXIT, 0, 0, 0, 0)
> > (this would untag task_struct as isolated).
> >
> > 5. perform system calls A, B, C, D (with no flushing of vmstat,
> > for example).
> >
> > 6. jmp to 2.
> >
> > So there is a problem with this logic, which is that one would like
> > certain isolation functionality to remain enabled between points 4
> > and 6 (for example, blocking CPU hotplug or other blockable activities
> > that would cause interruptions).
> >
> > One way to achieve this would be to replace PR_ISOL_ENTER/PR_ISOL_EXIT
> > with PR_ISOL_ENABLE, which accepts a bitmask:
> >
> > 1. Configure isolation with PR_ISOL_SET (say configure mode which
> > allows system calls, and when a system call happens, flush all deferred
> > actions on return to userspace).
> >
> > 2. enabled_bitmask = ISOL_F_QUIESCE_ON_URET|ISOL_F_BLOCK_INTERRUPTORS;
> > prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0)
> >
> > On return from this prctl(), deferrable actions are flushed.
> >
> > 3. latency sensitive loop, with no system calls.
> >
> > 4. some event which requires system calls is noticed:
> >
> > prctl(PR_ISOL_ENABLE, ISOL_F_BLOCK_INTERRUPTORS, 0, 0, 0)
> > (this would clear ISOL_F_QUIESCE_ON_URET, so no flushing
> > is performed on return from system calls).
>
> So PR_ISOL_ENABLE is a way to perform action when some sort of kernel entry
> happens. Then we take actions when that happens (signal, warn, etc...).
>
> I guess we'll need to define what kind of kernel entry, and what kind of
> response need to happen. Ok that's a whole issue of its own that we'll need
> to handle seperately.
>
> Thanks.
In fact, why one can't use SECCOMP for syscall blocking?
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
[not found] ` <CAFki+LmHeXmSFze8YEHFNbYA5hLEtnZyk37Yjf-eyOuKa8Os4w@mail.gmail.com>
@ 2021-07-28 16:17 ` Marcelo Tosatti
0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-28 16:17 UTC (permalink / raw)
To: Nitesh Lal
Cc: Frederic Weisbecker, Nicolas Saenz Julienne, linux-kernel,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Thomas Gleixner
On Wed, Jul 28, 2021 at 10:48:25AM -0400, Nitesh Lal wrote:
> On Wed, Jul 28, 2021 at 5:56 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> > On Wed, Jul 28, 2021 at 01:45:39AM +0200, Frederic Weisbecker wrote:
> > > On Tue, Jul 27, 2021 at 11:52:09AM -0300, Marcelo Tosatti wrote:
> > > > The meaning of isolated is specified as follows:
> > > >
> > > > Isolation features
> > > > ==================
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_SUP_FEATURES, 0, 0, 0) returns the supported
> > > > features as a return value.
> > > >
> > > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the
> > features in
> > > > the bitmask.
> > > >
> > > > - prctl(PR_ISOL_GET, ISOL_FEATURES, 0, 0, 0) returns the currently
> > > > enabled features.
> > >
> > > So what are the ISOL_FEATURES here? A mode that we enter such as flush
> > > vmstat _everytime_ we resume to userpace after (and including) this
> > prctl() ?
> >
> > ISOL_FEATURES is just the "command" type (which you can get and set).
> >
>
> So, ISOL_FEATURES is really defining when the operations are really going
> to take place for eg. on every uret?
ISOL_F_QUIESCE_ON_URET enables quiescing on userspace return.
> > The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
> >
> >
> When we talk about full/complete isolation
https://lwn.net/Articles/816298/
Nohz and task isolation section
These features reduce interruptions on the isolated CPUs, but do not
fully eliminate them; task isolation is an attempt to finish the job by
removing all interruptions. A process that enters the isolation mode
will be able to run in user space with no interference from the kernel
or other processes.
> then does that translates to
> enabling all possible features supported by something like
> ISOL_F_QUIESCE_ON_URET?
Not necessarily.
If one controls what apps execute on the system (say the system
is completly idle), ISOL_F_QUIESCE_ON_URET with vmstat sync should
be sufficient for complete isolation (one can read events via
rt-trace-bpf.py).
> - bitmask = ISOL_F_QUIESCE_ON_URET;
> > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > the bitmask.
> >
> > - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> > (1)
> >
> > (returns the supported actions to be quiesced).
> >
> > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> > the actions to be quiesced (2)
> >
> > If an application does not modify "quiesce_bitmask" between
> > points (1) and (2) above, it will enable quiescing of all
> > "features" the kernel supports.
> >
> > Application can, however, modify quiesce_bitmap to its preference.
> >
> > Flushing vmstat _everytime_ you resume to userspace is enabled only
> > _after_ prctl(PR_ISOL_ENTER, 0, 0, 0, 0) is performed (which happens
> > only when isolation is fully configured with the PR_ISOL_SET calls).
> >
>
> Will this also happen if I disable ISOL_F_QUIESCE_VMSTAT_SYNC from the
> quiesce_bitmask?
Yes.
> > OK, will better document that.
> >
> > > If so I'd rather call that ISOL_MODE because feature is too general.
> >
> > Well, in the first patchset, there was one "mode" implemented (but
> > it was possible to implement different modes in the future).
> >
> > This would allow for example easier integration of "full task isolation"
> > patchset type of functionality, disallowing syscalls.
> >
> >
> Makes sense to go back to the usage of ISOL_MODE.
> After this change, the ISOL_FEATURES will be replaced with something like
> PR_ISOL_MODE_NORMAL/PR_MODE_ISOL?
>
> I think we'd like to keep that, so i'll keep the previous distinct modes
> > (but allow configuration of individual features on the bitmap).
> >
> > > >
> > > > The supported features are:
> > > >
> > > > ISOL_F_QUIESCE_ON_URET: quiesce deferred actions on return to
> > userspace.
> > > > ----------------------
> > > >
> > > > Quiescing of different actions can be performed on return to userspace.
> > > >
> > > > - prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0) returns
> > > > the supported actions to be quiesced.
> > > >
> > > > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0)
> > returns
> >
> > s/returns/sets/
> >
> > > > the currently supported actions to be quiesced.
> > > >
> > > > - prctl(PR_ISOL_GET, PR_ISOL_QUIESCE_CFG, 0, 0, 0) returns
> > > > the currently enabled actions to be quiesced.
> > > >
> > > > #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> > > > #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> > > > #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
> > >
> > > And then PR_ISOL_QUIESCE_CFG is a oneshot operation that applies only
> > upon
> > > return to this ctrl, right? If so perhaps this should be called just
> > > ISOL_QUIESCE or ISOL_QUIESCE_ONCE or ISOL_REQ ?
> >
> > There was no one-shot operation implemented in the first patchset. What
> > application would do to achieve that is:
> >
> > 1. Configure isolation with PR_ISOL_SET (say configure mode which
> > allows system calls, and when a system call happens, flush all deferred
> > actions on return to userspace).
> >
> > 2. prctl(PR_ISOL_ENTER, 0, 0, 0, 0) (this actually enables the flushing,
> > and tags the task_struct as isolated). Here we can transfer this
> > information
> > from per-task to per-CPU data, for example, to be able to implement
> > other features such as deferred TLB flushing.
> >
> > On return from this prctl(), deferrable actions are flushed.
> >
> > 3. latency sensitive loop, with no system calls.
> >
> > 4. some event which requires system calls is noticed:
> > prctl(PR_ISOL_EXIT, 0, 0, 0, 0)
> > (this would untag task_struct as isolated).
> >
> > 5. perform system calls A, B, C, D (with no flushing of vmstat,
> > for example).
> >
> > 6. jmp to 2.
> >
> > So there is a problem with this logic, which is that one would like
> > certain isolation functionality to remain enabled between points 4
> > and 6 (for example, blocking CPU hotplug or other blockable activities
> > that would cause interruptions).
> >
> > One way to achieve this would be to replace PR_ISOL_ENTER/PR_ISOL_EXIT
> > with PR_ISOL_ENABLE, which accepts a bitmask:
> >
> > 1. Configure isolation with PR_ISOL_SET (say configure mode which
> > allows system calls, and when a system call happens, flush all deferred
> > actions on return to userspace).
> >
> > 2. enabled_bitmask = ISOL_F_QUIESCE_ON_URET|ISOL_F_BLOCK_INTERRUPTORS;
> > prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0)
> >
> > On return from this prctl(), deferrable actions are flushed.
> >
> > 3. latency sensitive loop, with no system calls.
> >
> > 4. some event which requires system calls is noticed:
> >
> > prctl(PR_ISOL_ENABLE, ISOL_F_BLOCK_INTERRUPTORS, 0, 0, 0)
> > (this would clear ISOL_F_QUIESCE_ON_URET, so no flushing
> > is performed on return from system calls).
> >
>
> FWIU we will still exit before this via prctl(PR_ISOL_EXIT, 0, 0, 0, 0)?
> Because if we are still in a latency-sensitive loop then not flushing
> while returning to the userspace can cause interruptions anyways.
No, PR_ISOL_ENABLE replaces PR_ISOL_ENTER/PR_ISOL_EXIT.
> > 5. perform system calls A, B, C, D (with no flushing of vmstat).
> >
> > 6. jmp to 2.
> >
> > ...
> >
> > On exit: prctl(PR_ISOL_ENABLE, 0, 0, 0, 0)
> >
> > IOW: the one-shot operation does not allow the application
> > to inform the kernel when the latency sensitive loop has
> > begun or has ended.
> >
> > >
> > > But that's just naming debate because otherwise that prctl layout looks
> > good
> > > to me.
> > >
> > > Thanks!
> >
> > Thank you for the input!
> >
> >
>
> --
> Thanks
> Nitesh
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
[not found] ` <CAFki+LkQwoqVTKmgnwLQQM8ua-ixbLp8i+jUT6xF15k6X=89mw@mail.gmail.com>
@ 2021-07-28 16:21 ` Marcelo Tosatti
0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-28 16:21 UTC (permalink / raw)
To: Nitesh Lal
Cc: Nicolas Saenz Julienne, Frederic Weisbecker, linux-kernel,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Thomas Gleixner
On Wed, Jul 28, 2021 at 11:00:01AM -0400, Nitesh Lal wrote:
> > > A latency sensitive
> > > application might be OK with the former but not with the latter.
> >
> > Two alternatives:
> >
> > 1) The pattern above, where particular subsystems that might interrupt
> > the kernel are enabled automatically if the kernel supports it.
> >
> > Pros:
> > Applications which implement this only need to be changed once,
> > and can benefit from new kernel features.
> >
> > Applications can disable particular features if they turn
> > out to be problematic.
> >
> > Cons:
> > New features might break applications.
> >
> > 2) Force applications to enable each new feature individually.
> >
> > Pros: Won't cause regressions, kernel behaviour is explicitly
> > controlled by userspace.
> >
> > Cons: Apps won't benefit from new features automatically.
> >
> > ---
> >
> > It seems to me 1) is preferred. Can also add a sysfs control to
> > have a "default_isolation_feature" flag, which can be changed
> > by a sysadmin in case a new feature is undesired.
> >
> > Thoughts?
> >
> >
> The first option may work specifically with the sysfs interface that you
> mentioned, however, IMHO (2) is safer than regressing the workloads. Also,
> if the previously implemented controls are good enough for the workload
> then there should not be a need to enable the new ones.
OK, can set default_isolation_feature as 0 then, which admin can
configure to a non-default value. This would enable the new
features only if the admin enables them.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 13:16 ` Marcelo Tosatti
[not found] ` <CAFki+LkQwoqVTKmgnwLQQM8ua-ixbLp8i+jUT6xF15k6X=89mw@mail.gmail.com>
@ 2021-07-28 17:08 ` nsaenzju
1 sibling, 0 replies; 26+ messages in thread
From: nsaenzju @ 2021-07-28 17:08 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Frederic Weisbecker, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, 2021-07-28 at 10:16 -0300, Marcelo Tosatti wrote:
> > For example, let's say we introduce ISOL_F_QUIESCE_DEFER_TLB_FLUSH, this will
> > defer relatively short IPIs on isolated CPUs in exchange for a longer flush
> > whenever we enter the kernel (syscall, IRQs, NMI, etc...).
>
> Why the flush has to be longer when you enter the kernel?
What I had in mind was cost of rapid partial flushes (IPIs) vs full flushes on
entry, although I haven't really measured anything so the extra latency cost
might as well be zero.
> ISOL_F_QUIESCE_DEFER_TLB_FLUSH might collapse multiple IPIs
> into a single IPI, so the behaviour might be beneficial
> for "standard" types of application as well.
>
> > A latency sensitive
> > application might be OK with the former but not with the latter.
>
> Two alternatives:
>
> 1) The pattern above, where particular subsystems that might interrupt
> the kernel are enabled automatically if the kernel supports it.
>
> Pros:
> Applications which implement this only need to be changed once,
> and can benefit from new kernel features.
>
> Applications can disable particular features if they turn
> out to be problematic.
>
> Cons:
> New features might break applications.
>
> 2) Force applications to enable each new feature individually.
>
> Pros: Won't cause regressions, kernel behaviour is explicitly
> controlled by userspace.
>
> Cons: Apps won't benefit from new features automatically.
>
> ---
>
> It seems to me 1) is preferred. Can also add a sysfs control to
> have a "default_isolation_feature" flag, which can be changed
> by a sysadmin in case a new feature is undesired.
>
> Thoughts?
I'd still take option 2. Nitesh has a very good point, latency requirements are
hit or miss. What's the benefit of enabling new features on an already valid
application vs the potential regression?
That said I see value in providing means for users that want all
features/modes, but it should be an through an explicit action on their part.
--
Nicolás Sáenz
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-28 13:21 ` Marcelo Tosatti
@ 2021-07-28 21:22 ` Frederic Weisbecker
0 siblings, 0 replies; 26+ messages in thread
From: Frederic Weisbecker @ 2021-07-28 21:22 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: nsaenzju, linux-kernel, Nitesh Lal, Christoph Lameter,
Juri Lelli, Peter Zijlstra, Alex Belits, Peter Xu,
Thomas Gleixner
On Wed, Jul 28, 2021 at 10:21:34AM -0300, Marcelo Tosatti wrote:
> > > ISOL_FEATURES is just the "command" type (which you can get and set).
> > >
> > > The bitmask would include ISOL_F_QUIESCE_ON_URET, so:
> > >
> > > - bitmask = ISOL_F_QUIESCE_ON_URET;
> > > - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> > > the bitmask.
> >
> > But does it quiesce once or for every further uret?
>
> For every uret, while ISOL_F_QUIESCE_ON_URET is enabled through
> prctl(PR_ISOL_ENABLE, enabled_bitmask, 0, 0, 0).
Ok.
>
> > > - quiesce_bitmap = prctl(PR_ISOL_GET, PR_ISOL_SUP_QUIESCE_CFG, 0, 0, 0)
> > > (1)
> > >
> > > (returns the supported actions to be quiesced).
> > >
> > > - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, quiesce_bitmask, 0, 0) _sets_
> > > the actions to be quiesced (2)
> > >
> > > If an application does not modify "quiesce_bitmask" between
> > > points (1) and (2) above, it will enable quiescing of all
> > > "features" the kernel supports.
> >
> > I don't get the difference between ISOL_FEATURES and PR_ISOL_QUIESCE_CFG.
>
> prctl(PR_ISOL_SET, cmd, ...) is intented to accept different types of "command"
> variables (including ones for new features which are not known at this
> time).
>
> - prctl(PR_ISOL_SET, ISOL_FEATURES, bitmask, 0, 0) enables the features in
> the bitmask
>
> (which might now be superceded by
>
> prctl(PR_ISOL_ENABLE, ISOL_F_QUIESCE_ON_URET, 0, 0, 0))
>
> - prctl(PR_ISOL_SET, PR_ISOL_QUIESCE_CFG, bitmask, 0, 0) configures
> quiescing of which subsystem/feature is performed:
>
> #define ISOL_F_QUIESCE_VMSTAT_SYNC (1<<0)
> #define ISOL_F_QUIESCE_NOHZ_FULL (1<<1)
> #define ISOL_F_QUIESCE_DEFER_TLB_FLUSH (1<<2)
Ok but...I still don't get the difference between ISOL_FEATURES and
PR_ISOL_QUIESCE_CFG :-)
> > So PR_ISOL_ENABLE is a way to perform action when some sort of kernel entry
> > happens. Then we take actions when that happens (signal, warn, etc...).
> >
> > I guess we'll need to define what kind of kernel entry, and what kind of
> > response need to happen. Ok that's a whole issue of its own that we'll need
> > to handle seperately.
> >
> > Thanks.
>
> In fact, why one can't use SECCOMP for syscall blocking?
Heh! Good point!
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
[not found] ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
@ 2021-08-02 14:16 ` Marcelo Tosatti
0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-08-02 14:16 UTC (permalink / raw)
To: Nitesh Lal
Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu
On Mon, Aug 02, 2021 at 10:02:03AM -0400, Nitesh Lal wrote:
> On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing
> > latency sensitive code (where interruptions are undesired).
> >
> > Interface is described by task_isolation.rst (added by this patch).
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> >
> [...]
>
> +extern void __tsk_isol_exit(struct task_struct *tsk);
> > +
> > +static inline void tsk_isol_exit(struct task_struct *tsk)
> > +{
> > + if (tsk->isol_info)
> > + __tsk_isol_exit(tsk);
> > +}
> > +
> > +
> >
>
> nit: we can get rid of this extra line.
>
>
> > +int prctl_task_isolation_feat(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5);
> > +int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5);
> > +
> > +#else
> > +
> > +static inline void tsk_isol_exit(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +static inline int prctl_task_isolation_feat(unsigned long arg2,
> > + unsigned long arg3,
> > + unsigned long arg4,
> > + unsigned long arg5)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_get(unsigned long arg2,
> > + unsigned long arg3,
> > + unsigned long arg4,
> > + unsigned long arg5)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_set(unsigned long arg2,
> > + unsigned long arg3,
> > + unsigned long arg4,
> > + unsigned long arg5)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_ctrl_get(unsigned long arg2,
> > + unsigned long arg3,
> > + unsigned long arg4,
> > + unsigned long arg5)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static inline int prctl_task_isolation_ctrl_set(unsigned long arg2,
> > + unsigned long arg3,
> > + unsigned long arg4,
> > + unsigned long arg5)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +#endif /* CONFIG_CPU_ISOLATION */
> > +
> > +#endif /* __LINUX_TASK_ISOL_H */
> > Index: linux-2.6/kernel/task_isolation.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/kernel/task_isolation.c
> > @@ -0,0 +1,274 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Implementation of task isolation.
> > + *
> > + * Authors:
> > + * Chris Metcalf <cmetcalf@mellanox.com>
> > + * Alex Belits <abelits@belits.com>
> > + * Yuri Norov <ynorov@marvell.com>
> > + * Marcelo Tosatti <mtosatti@redhat.com>
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/task_isolation.h>
> > +#include <linux/prctl.h>
> > +#include <linux/slab.h>
> > +#include <linux/kobject.h>
> > +#include <linux/string.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/init.h>
> > +
> > +static unsigned long default_quiesce_mask;
> > +
> > +static int tsk_isol_alloc_context(struct task_struct *task)
> > +{
> > + struct isol_info *info;
> > +
> > + info = kzalloc(sizeof(*info), GFP_KERNEL);
> > + if (unlikely(!info))
> > + return -ENOMEM;
> > +
> > + task->isol_info = info;
> > + return 0;
> > +}
> > +
> > +void __tsk_isol_exit(struct task_struct *tsk)
> > +{
> > + kfree(tsk->isol_info);
> > + tsk->isol_info = NULL;
> > +}
> > +
> > +static int prctl_task_isolation_feat_quiesce(unsigned long type)
> > +{
> > + switch (type) {
> > + case 0:
> > + return ISOL_F_QUIESCE_VMSTATS;
> > + case ISOL_F_QUIESCE_DEFMASK:
> > + return default_quiesce_mask;
> > + default:
> > + break;
> > + }
> > +
> > + return -EINVAL;
> > +}
> > +
> > +static int task_isolation_get_quiesce(void)
> > +{
> > + if (current->isol_info != NULL)
> >
>
> Should replace the above with just 'if (current->isol_info)'.
>
> + return current->isol_info->quiesce_mask;
> > +
> > + return 0;
> > +}
> > +
> > +static int task_isolation_set_quiesce(unsigned long quiesce_mask)
> > +{
> > + if (quiesce_mask != ISOL_F_QUIESCE_VMSTATS && quiesce_mask != 0)
> > + return -EINVAL;
> > +
> > + current->isol_info->quiesce_mask = quiesce_mask;
> > + return 0;
> > +}
> > +
> > +int prctl_task_isolation_feat(unsigned long feat, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5)
> > +{
> > + switch (feat) {
> > + case 0:
> > + return ISOL_F_QUIESCE;
> > + case ISOL_F_QUIESCE:
> > + return prctl_task_isolation_feat_quiesce(arg3);
> > + default:
> > + break;
> > + }
> > + return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_get(unsigned long feat, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5)
> > +{
> > + switch (feat) {
> > + case ISOL_F_QUIESCE:
> > + return task_isolation_get_quiesce();
> > + default:
> > + break;
> > + }
> > + return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_set(unsigned long feat, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5)
> > +{
> > + int ret;
> > + bool err_free_ctx = false;
> > +
> > + if (current->isol_info == NULL)
> >
>
> Can replace this with 'if (!current->isol_info).
> There are other places below where similar improvement can be done.
>
>
> > + err_free_ctx = true;
> > +
> > + ret = tsk_isol_alloc_context(current);
> > + if (ret)
> > + return ret;
> > +
> > + switch (feat) {
> > + case ISOL_F_QUIESCE:
> > + ret = task_isolation_set_quiesce(arg3);
> > + if (ret)
> > + break;
> > + return 0;
> > + default:
> > + break;
> > + }
> > +
> > + if (err_free_ctx)
> > + __tsk_isol_exit(current);
> > + return -EINVAL;
> > +}
> > +
> > +int prctl_task_isolation_ctrl_set(unsigned long feat, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5)
> > +{
> > + if (current->isol_info == NULL)
> > + return -EINVAL;
> > +
> > + if (feat != ISOL_F_QUIESCE && feat != 0)
> > + return -EINVAL;
> > +
> > + current->isol_info->active_mask = feat;
> > + return 0;
> > +}
> > +
> > +int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5)
> > +{
> > + if (current->isol_info == NULL)
> > + return 0;
> > +
> > + return current->isol_info->active_mask;
> > +}
> > +
> > +struct qoptions {
> > + unsigned long mask;
> > + char *name;
> > +};
> > +
> > +static struct qoptions qopts[] = {
> > + {ISOL_F_QUIESCE_VMSTATS, "vmstat"},
> > +};
> > +
> > +#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
> > +
> > +static ssize_t default_quiesce_store(struct kobject *kobj,
> > + struct kobj_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + char *p, *s;
> > + unsigned long defmask = 0;
> > +
> > + s = (char *)buf;
> > + if (count == 1 && strlen(strim(s)) == 0) {
> > + default_quiesce_mask = 0;
> > + return count;
> > + }
> > +
> > + while ((p = strsep(&s, ",")) != NULL) {
> > + int i;
> > + bool found = false;
> > +
> > + if (!*p)
> > + continue;
> > +
> > + for (i = 0; i < QLEN; i++) {
> > + struct qoptions *opt = &qopts[i];
> > +
> > + if (strncmp(strim(p), opt->name,
> > strlen(opt->name)) == 0) {
> > + defmask |= opt->mask;
> > + found = true;
> > + break;
> > + }
> > + }
> > + if (found == true)
> > + continue;
> > + return -EINVAL;
> > + }
> > + default_quiesce_mask = defmask;
> > +
> > + return count;
> > +}
> > +
> > +#define MAXARRLEN 100
> > +
> > +static ssize_t default_quiesce_show(struct kobject *kobj,
> > + struct kobj_attribute *attr, char *buf)
> > +{
> > + int i;
> > + char tbuf[MAXARRLEN] = "";
> > +
> > + for (i = 0; i < QLEN; i++) {
> > + struct qoptions *opt = &qopts[i];
> > +
> > + if (default_quiesce_mask & opt->mask) {
> > + strlcat(tbuf, opt->name, MAXARRLEN);
> > + strlcat(tbuf, "\n", MAXARRLEN);
> > + }
> > + }
> > +
> > + return sprintf(buf, "%s", tbuf);
> > +}
> > +
> > +static struct kobj_attribute default_quiesce_attr =
> > + __ATTR_RW(default_quiesce);
> > +
> > +static ssize_t available_quiesce_show(struct kobject *kobj,
> > + struct kobj_attribute *attr, char
> > *buf)
> > +{
> > + int i;
> > + char tbuf[MAXARRLEN] = "";
> > +
> > + for (i = 0; i < QLEN; i++) {
> > + struct qoptions *opt = &qopts[i];
> > +
> > + strlcat(tbuf, opt->name, MAXARRLEN);
> > + strlcat(tbuf, "\n", MAXARRLEN);
> > + }
> > +
> > + return sprintf(buf, "%s", tbuf);
> > +}
> > +
> > +static struct kobj_attribute available_quiesce_attr =
> > + __ATTR_RO(available_quiesce);
> > +
> > +static struct attribute *task_isol_attrs[] = {
> > + &available_quiesce_attr.attr,
> > + &default_quiesce_attr.attr,
> > + NULL,
> > +};
> > +
> > +static const struct attribute_group task_isol_attr_group = {
> > + .attrs = task_isol_attrs,
> > + .bin_attrs = NULL,
> > +};
> > +
> > +static int __init task_isol_ksysfs_init(void)
> > +{
> > + int ret;
> > + struct kobject *task_isol_kobj;
> > +
> > + task_isol_kobj = kobject_create_and_add("task_isolation",
> > + kernel_kobj);
> > + if (!task_isol_kobj) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + ret = sysfs_create_group(task_isol_kobj, &task_isol_attr_group);
> > + if (ret)
> > + goto out_task_isol_kobj;
> > +
> > + return 0;
> > +
> > +out_task_isol_kobj:
> > + kobject_put(task_isol_kobj);
> > +out:
> > + return ret;
> > +}
> > +
> > +arch_initcall(task_isol_ksysfs_init);
> > Index: linux-2.6/samples/Kconfig
> > ===================================================================
> > --- linux-2.6.orig/samples/Kconfig
> > +++ linux-2.6/samples/Kconfig
> > @@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
> > Build example userspace program to use the new mount_notify(),
> > sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
> >
> > +config SAMPLE_TASK_ISOLATION
> > + bool "task isolation sample"
> > + depends on CC_CAN_LINK && HEADERS_INSTALL
> > + help
> > + Build example userspace program to use prctl task isolation
> > + interface.
> > +
> > endif # SAMPLES
> > Index: linux-2.6/samples/Makefile
> > ===================================================================
> > --- linux-2.6.orig/samples/Makefile
> > +++ linux-2.6/samples/Makefile
> > @@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI) += mei/
> > subdir-$(CONFIG_SAMPLE_WATCHDOG) += watchdog
> > subdir-$(CONFIG_SAMPLE_WATCH_QUEUE) += watch_queue
> > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak/
> > +subdir-$(CONFIG_SAMPLE_TASK_ISOLATION) += task_isolation
> > Index: linux-2.6/samples/task_isolation/Makefile
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/samples/task_isolation/Makefile
> > @@ -0,0 +1,4 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +userprogs-always-y += task_isolation
> > +
> > +userccflags += -I usr/include
> >
> >
> >
> I am wondering if it is possible to further split this patch into smaller
> ones?
OK, will try to split in smaller patches and fix the
style issues.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
[not found] ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
@ 2021-07-31 7:47 ` kernel test robot
[not found] ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
2 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2021-07-31 7:47 UTC (permalink / raw)
To: Marcelo Tosatti, linux-kernel
Cc: kbuild-all, Nitesh Lal, Nicolas Saenz Julienne,
Frederic Weisbecker, Christoph Lameter, Juri Lelli,
Peter Zijlstra, Alex Belits, Peter Xu, Marcelo Tosatti
[-- Attachment #1: Type: text/plain, Size: 11061 bytes --]
Hi Marcelo,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.14-rc3]
[cannot apply to hnaz-linux-mm/master linux/master tip/sched/core tip/core/entry next-20210730]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Marcelo-Tosatti/extensible-prctl-task-isolation-interface-and-vmstat-sync-v2/20210731-042348
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4669e13cd67f8532be12815ed3d37e775a9bdc16
config: s390-randconfig-r012-20210730 (attached as .config)
compiler: s390-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/c4a772b2c4f14959c65758feac89b3cd0e00a915
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Marcelo-Tosatti/extensible-prctl-task-isolation-interface-and-vmstat-sync-v2/20210731-042348
git checkout c4a772b2c4f14959c65758feac89b3cd0e00a915
# save the attached .config to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
kernel/sys.c: In function '__do_sys_prctl':
>> kernel/sys.c:2571:2: error: duplicate case value
2571 | case PR_ISOL_FEAT:
| ^~~~
kernel/sys.c:2567:2: note: previously used here
2567 | case PR_SCHED_CORE:
| ^~~~
vim +2571 kernel/sys.c
2301
2302 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
2303 unsigned long, arg4, unsigned long, arg5)
2304 {
2305 struct task_struct *me = current;
2306 unsigned char comm[sizeof(me->comm)];
2307 long error;
2308
2309 error = security_task_prctl(option, arg2, arg3, arg4, arg5);
2310 if (error != -ENOSYS)
2311 return error;
2312
2313 error = 0;
2314 switch (option) {
2315 case PR_SET_PDEATHSIG:
2316 if (!valid_signal(arg2)) {
2317 error = -EINVAL;
2318 break;
2319 }
2320 me->pdeath_signal = arg2;
2321 break;
2322 case PR_GET_PDEATHSIG:
2323 error = put_user(me->pdeath_signal, (int __user *)arg2);
2324 break;
2325 case PR_GET_DUMPABLE:
2326 error = get_dumpable(me->mm);
2327 break;
2328 case PR_SET_DUMPABLE:
2329 if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
2330 error = -EINVAL;
2331 break;
2332 }
2333 set_dumpable(me->mm, arg2);
2334 break;
2335
2336 case PR_SET_UNALIGN:
2337 error = SET_UNALIGN_CTL(me, arg2);
2338 break;
2339 case PR_GET_UNALIGN:
2340 error = GET_UNALIGN_CTL(me, arg2);
2341 break;
2342 case PR_SET_FPEMU:
2343 error = SET_FPEMU_CTL(me, arg2);
2344 break;
2345 case PR_GET_FPEMU:
2346 error = GET_FPEMU_CTL(me, arg2);
2347 break;
2348 case PR_SET_FPEXC:
2349 error = SET_FPEXC_CTL(me, arg2);
2350 break;
2351 case PR_GET_FPEXC:
2352 error = GET_FPEXC_CTL(me, arg2);
2353 break;
2354 case PR_GET_TIMING:
2355 error = PR_TIMING_STATISTICAL;
2356 break;
2357 case PR_SET_TIMING:
2358 if (arg2 != PR_TIMING_STATISTICAL)
2359 error = -EINVAL;
2360 break;
2361 case PR_SET_NAME:
2362 comm[sizeof(me->comm) - 1] = 0;
2363 if (strncpy_from_user(comm, (char __user *)arg2,
2364 sizeof(me->comm) - 1) < 0)
2365 return -EFAULT;
2366 set_task_comm(me, comm);
2367 proc_comm_connector(me);
2368 break;
2369 case PR_GET_NAME:
2370 get_task_comm(comm, me);
2371 if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
2372 return -EFAULT;
2373 break;
2374 case PR_GET_ENDIAN:
2375 error = GET_ENDIAN(me, arg2);
2376 break;
2377 case PR_SET_ENDIAN:
2378 error = SET_ENDIAN(me, arg2);
2379 break;
2380 case PR_GET_SECCOMP:
2381 error = prctl_get_seccomp();
2382 break;
2383 case PR_SET_SECCOMP:
2384 error = prctl_set_seccomp(arg2, (char __user *)arg3);
2385 break;
2386 case PR_GET_TSC:
2387 error = GET_TSC_CTL(arg2);
2388 break;
2389 case PR_SET_TSC:
2390 error = SET_TSC_CTL(arg2);
2391 break;
2392 case PR_TASK_PERF_EVENTS_DISABLE:
2393 error = perf_event_task_disable();
2394 break;
2395 case PR_TASK_PERF_EVENTS_ENABLE:
2396 error = perf_event_task_enable();
2397 break;
2398 case PR_GET_TIMERSLACK:
2399 if (current->timer_slack_ns > ULONG_MAX)
2400 error = ULONG_MAX;
2401 else
2402 error = current->timer_slack_ns;
2403 break;
2404 case PR_SET_TIMERSLACK:
2405 if (arg2 <= 0)
2406 current->timer_slack_ns =
2407 current->default_timer_slack_ns;
2408 else
2409 current->timer_slack_ns = arg2;
2410 break;
2411 case PR_MCE_KILL:
2412 if (arg4 | arg5)
2413 return -EINVAL;
2414 switch (arg2) {
2415 case PR_MCE_KILL_CLEAR:
2416 if (arg3 != 0)
2417 return -EINVAL;
2418 current->flags &= ~PF_MCE_PROCESS;
2419 break;
2420 case PR_MCE_KILL_SET:
2421 current->flags |= PF_MCE_PROCESS;
2422 if (arg3 == PR_MCE_KILL_EARLY)
2423 current->flags |= PF_MCE_EARLY;
2424 else if (arg3 == PR_MCE_KILL_LATE)
2425 current->flags &= ~PF_MCE_EARLY;
2426 else if (arg3 == PR_MCE_KILL_DEFAULT)
2427 current->flags &=
2428 ~(PF_MCE_EARLY|PF_MCE_PROCESS);
2429 else
2430 return -EINVAL;
2431 break;
2432 default:
2433 return -EINVAL;
2434 }
2435 break;
2436 case PR_MCE_KILL_GET:
2437 if (arg2 | arg3 | arg4 | arg5)
2438 return -EINVAL;
2439 if (current->flags & PF_MCE_PROCESS)
2440 error = (current->flags & PF_MCE_EARLY) ?
2441 PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
2442 else
2443 error = PR_MCE_KILL_DEFAULT;
2444 break;
2445 case PR_SET_MM:
2446 error = prctl_set_mm(arg2, arg3, arg4, arg5);
2447 break;
2448 case PR_GET_TID_ADDRESS:
2449 error = prctl_get_tid_address(me, (int __user * __user *)arg2);
2450 break;
2451 case PR_SET_CHILD_SUBREAPER:
2452 me->signal->is_child_subreaper = !!arg2;
2453 if (!arg2)
2454 break;
2455
2456 walk_process_tree(me, propagate_has_child_subreaper, NULL);
2457 break;
2458 case PR_GET_CHILD_SUBREAPER:
2459 error = put_user(me->signal->is_child_subreaper,
2460 (int __user *)arg2);
2461 break;
2462 case PR_SET_NO_NEW_PRIVS:
2463 if (arg2 != 1 || arg3 || arg4 || arg5)
2464 return -EINVAL;
2465
2466 task_set_no_new_privs(current);
2467 break;
2468 case PR_GET_NO_NEW_PRIVS:
2469 if (arg2 || arg3 || arg4 || arg5)
2470 return -EINVAL;
2471 return task_no_new_privs(current) ? 1 : 0;
2472 case PR_GET_THP_DISABLE:
2473 if (arg2 || arg3 || arg4 || arg5)
2474 return -EINVAL;
2475 error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
2476 break;
2477 case PR_SET_THP_DISABLE:
2478 if (arg3 || arg4 || arg5)
2479 return -EINVAL;
2480 if (mmap_write_lock_killable(me->mm))
2481 return -EINTR;
2482 if (arg2)
2483 set_bit(MMF_DISABLE_THP, &me->mm->flags);
2484 else
2485 clear_bit(MMF_DISABLE_THP, &me->mm->flags);
2486 mmap_write_unlock(me->mm);
2487 break;
2488 case PR_MPX_ENABLE_MANAGEMENT:
2489 case PR_MPX_DISABLE_MANAGEMENT:
2490 /* No longer implemented: */
2491 return -EINVAL;
2492 case PR_SET_FP_MODE:
2493 error = SET_FP_MODE(me, arg2);
2494 break;
2495 case PR_GET_FP_MODE:
2496 error = GET_FP_MODE(me);
2497 break;
2498 case PR_SVE_SET_VL:
2499 error = SVE_SET_VL(arg2);
2500 break;
2501 case PR_SVE_GET_VL:
2502 error = SVE_GET_VL();
2503 break;
2504 case PR_GET_SPECULATION_CTRL:
2505 if (arg3 || arg4 || arg5)
2506 return -EINVAL;
2507 error = arch_prctl_spec_ctrl_get(me, arg2);
2508 break;
2509 case PR_SET_SPECULATION_CTRL:
2510 if (arg4 || arg5)
2511 return -EINVAL;
2512 error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
2513 break;
2514 case PR_PAC_RESET_KEYS:
2515 if (arg3 || arg4 || arg5)
2516 return -EINVAL;
2517 error = PAC_RESET_KEYS(me, arg2);
2518 break;
2519 case PR_PAC_SET_ENABLED_KEYS:
2520 if (arg4 || arg5)
2521 return -EINVAL;
2522 error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
2523 break;
2524 case PR_PAC_GET_ENABLED_KEYS:
2525 if (arg2 || arg3 || arg4 || arg5)
2526 return -EINVAL;
2527 error = PAC_GET_ENABLED_KEYS(me);
2528 break;
2529 case PR_SET_TAGGED_ADDR_CTRL:
2530 if (arg3 || arg4 || arg5)
2531 return -EINVAL;
2532 error = SET_TAGGED_ADDR_CTRL(arg2);
2533 break;
2534 case PR_GET_TAGGED_ADDR_CTRL:
2535 if (arg2 || arg3 || arg4 || arg5)
2536 return -EINVAL;
2537 error = GET_TAGGED_ADDR_CTRL();
2538 break;
2539 case PR_SET_IO_FLUSHER:
2540 if (!capable(CAP_SYS_RESOURCE))
2541 return -EPERM;
2542
2543 if (arg3 || arg4 || arg5)
2544 return -EINVAL;
2545
2546 if (arg2 == 1)
2547 current->flags |= PR_IO_FLUSHER;
2548 else if (!arg2)
2549 current->flags &= ~PR_IO_FLUSHER;
2550 else
2551 return -EINVAL;
2552 break;
2553 case PR_GET_IO_FLUSHER:
2554 if (!capable(CAP_SYS_RESOURCE))
2555 return -EPERM;
2556
2557 if (arg2 || arg3 || arg4 || arg5)
2558 return -EINVAL;
2559
2560 error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
2561 break;
2562 case PR_SET_SYSCALL_USER_DISPATCH:
2563 error = set_syscall_user_dispatch(arg2, arg3, arg4,
2564 (char __user *) arg5);
2565 break;
2566 #ifdef CONFIG_SCHED_CORE
2567 case PR_SCHED_CORE:
2568 error = sched_core_share_pid(arg2, arg3, arg4, arg5);
2569 break;
2570 #endif
> 2571 case PR_ISOL_FEAT:
2572 error = prctl_task_isolation_feat(arg2, arg3, arg4, arg5);
2573 break;
2574 case PR_ISOL_GET:
2575 error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
2576 break;
2577 case PR_ISOL_SET:
2578 error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
2579 break;
2580 case PR_ISOL_CTRL_GET:
2581 error = prctl_task_isolation_ctrl_get(arg2, arg3, arg4, arg5);
2582 break;
2583 case PR_ISOL_CTRL_SET:
2584 error = prctl_task_isolation_ctrl_set(arg2, arg3, arg4, arg5);
2585 break;
2586 default:
2587 error = -EINVAL;
2588 break;
2589 }
2590 return error;
2591 }
2592
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35453 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [patch 1/4] add basic task isolation prctl interface
[not found] ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
@ 2021-07-31 0:49 ` Marcelo Tosatti
0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-31 0:49 UTC (permalink / raw)
To: Nitesh Lal
Cc: linux-kernel, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu
On Fri, Jul 30, 2021 at 07:36:31PM -0400, Nitesh Lal wrote:
> On Fri, Jul 30, 2021 at 4:21 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing
> > latency sensitive code (where interruptions are undesired).
> >
> > Interface is described by task_isolation.rst (added by this patch).
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/userspace-api/task_isolation.rst
> > @@ -0,0 +1,187 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============================
> > +Task isolation prctl interface
> > +===============================
> > +
> > +Certain types of applications benefit from running uninterrupted by
> > +background OS activities. Realtime systems and high-bandwidth networking
> > +applications with user-space drivers can fall into the category.
> > +
> > +
> > +To create a OS noise free environment for the application, this
> > +interface allows userspace to inform the kernel the start and
> > +end of the latency sensitive application section (with configurable
> > +system behaviour for that section).
> > +
> > +The prctl options are:
> > +
> > +
> > + - PR_ISOL_FEAT: Retrieve supported features.
> > + - PR_ISOL_GET: Retrieve task isolation parameters.
> > + - PR_ISOL_SET: Set task isolation parameters.
> > + - PR_ISOL_CTRL_GET: Retrieve task isolation state.
> > + - PR_ISOL_CTRL_SET: Set task isolation state (enable/disable task
> > isolation).
> > +
> >
>
> Didn't we decide to replace FEAT/FEATURES with MODE?
Searching for the definition of mode:
mode: one of a series of ways that a machine can be made to work
in manual/automatic mode.
mode: a particular way of doing something.
mode: a way of operating, living, or behaving.
So "mode" seems to fit the case where one case can be chosen
between different choices (exclusively).
Now for this case it seems a composition of things is what is
happening, because quiescing might be functional with both
"syscalls allowed" and "syscalls not allowed" modes
(in that case, "mode" makes more sense).
> > +The isolation parameters and state are not inherited by
> > +children created by fork(2) and clone(2). The setting is
> > +preserved across execve(2).
> > +
> > +The sequence of steps to enable task isolation are:
> > +
> > +1. Retrieve supported task isolation features (PR_ISOL_FEAT).
> > +
> > +2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
> > +
> > +3. Activate or deactivate task isolation features
> > + (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
> > +
> > +This interface is based on ideas and code from the
> > +task isolation patchset from Alex Belits:
> > +https://lwn.net/Articles/816298/
> > +
> > +--------------------
> > +Feature description
> > +--------------------
> > +
> > + - ``ISOL_F_QUIESCE``
> > +
> > + This feature allows quiescing select kernel activities on
> > + return from system calls.
> > +
> > +---------------------
> > +Interface description
> > +---------------------
> > +
> > +**PR_ISOL_FEAT**:
> > +
> > + Returns the supported features and feature
> > + capabilities, as a bitmask. Features and its capabilities
> > + are defined at include/uapi/linux/task_isolation.h::
> > +
> > + prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > +
> > + The 'feat' argument specifies whether to return
> > + supported features (if zero), or feature capabilities
> > + (if not zero). Possible non-zero values for 'feat' are:
> >
>
> By feature capabilities you mean the kernel activities (vmstat, tlb_flush)?
Not necessarily, but in the case of ISOL_F_QUIESCE, yes, the different
kernel activities that might interrupt the task.
Feature capabilities is a generic term. For example, one might add
ISOL_F_NOTIFY with ISOL_F_NOTIFY_SIGNAL capabilities.
or
ISOL_F_NOTIFY with ISOL_F_NOTIFY_EVENTFD capabilities.
or
ISOL_F_future_feature with ISOL_F_future_feature_capability.
> +
> > + - ``ISOL_F_QUIESCE``:
> > +
> > + If arg3 is zero, returns a bitmask containing
> > + which kernel activities are supported for quiescing.
> > +
> > + If arg3 is ISOL_F_QUIESCE_DEFMASK, returns
> > + default_quiesce_mask, a system-wide configurable.
> > + See description of default_quiesce_mask below.
> > +
> > +**PR_ISOL_GET**:
> > +
> > + Retrieve task isolation feature configuration.
> > + The general format is::
> > +
> > + prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
> > +
> > + Possible values for feat are:
> > +
> > + - ``ISOL_F_QUIESCE``:
> > +
> > + Returns a bitmask containing which kernel
> > + activities are enabled for quiescing.
> > +
> > +
> > +**PR_ISOL_SET**:
> > +
> > + Configures task isolation features. The general format is::
> > +
> > + prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
> > +
> > + The 'feat' argument specifies which feature to configure.
> > + Possible values for feat are:
> >
>
> We should be able to enable multiple features as well via this? Something
> like ISOL_F_QUIESCE|ISOL_F_BLOCK_INTERRUPTORS as you have mentioned in the
> last posting.
One probably would do it separately (PR_ISOL_SET configures each
feature separately):
ret = prctl(PR_ISOL_FEAT, 0, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_FEAT");
return EXIT_FAILURE;
}
if (!(ret & ISOL_F_BLOCK_INTERRUPTORS)) {
printf("ISOL_F_BLOCK_INTERRUPTORS feature unsupported, quitting\n");
return EXIT_FAILURE;
}
ret = prctl(PR_ISOL_SET, ISOL_F_BLOCK_INTERRUPTORS, params...);
if (ret == -1) {
perror("prctl PR_ISOL_SET");
return EXIT_FAILURE;
}
/* configure ISOL_F_QUIESCE, ISOL_F_NOTIFY,
* ISOL_F_future_feature... */
ctrl_set_mask = ISOL_F_QUIESCE|ISOL_F_BLOCK_INTERRUPTORS|
ISOL_F_NOTIFY|ISOL_F_future_feature;
/*
* activate isolation mode with the features
* as configured above
*/
ret = prctl(PR_ISOL_CTRL_SET, ctrl_set_mask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
return EXIT_FAILURE;
}
latency sensitive loop
> > +
> > + - ``ISOL_F_QUIESCE``:
> > +
> > + The 'arg3' argument is a bitmask specifying which
> > + kernel activities to quiesce. Possible bit sets are:
> > +
> > + - ``ISOL_F_QUIESCE_VMSTATS``
> > +
> > + VM statistics are maintained in per-CPU counters to
> > + improve performance. When a CPU modifies a VM statistic,
> > + this modification is kept in the per-CPU counter.
> > + Certain activities require a global count, which
> > + involves requesting each CPU to flush its local counters
> > + to the global VM counters.
> > +
> > + This flush is implemented via a workqueue item, which
> > + might schedule a workqueue on isolated CPUs.
> > +
> > + To avoid this interruption, task isolation can be
> > + configured to, upon return from system calls,
> > synchronize
> > + the per-CPU counters to global counters, thus avoiding
> > + the interruption.
> > +
> > + To ensure the application returns to userspace
> > + with no modified per-CPU counters, its necessary to
> > + use mlockall() in addition to this isolcpus flag.
> > +
> > +**PR_ISOL_CTRL_GET**:
> > +
> > + Retrieve task isolation control.
> > +
> > + prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
> > +
> > + Returns which isolation features are active.
> > +
> > +**PR_ISOL_CTRL_SET**:
> > +
> > + Activates/deactivates task isolation control.
> > +
> > + prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
> > +
> > + The 'mask' argument specifies which features
> > + to activate (bit set) or deactivate (bit clear).
> > +
> > + For ISOL_F_QUIESCE, quiescing of background activities
> > + happens on return to userspace from the
> > + prctl(PR_ISOL_CTRL_SET) call, and on return from
> > + subsequent system calls.
> > +
> > + Quiescing can be adjusted (while active) by
> > + prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
> >
>
> Why do we need this additional control? We should be able to enable or
> disable task isolation using the _GET_ and _SET_ calls, isn't it?
The distinction is so one is able to configure the features separately,
and then enter isolated mode with them activated.
> > +
> > +--------------------
> > +Default quiesce mask
> > +--------------------
> > +
> > +Applications can either explicitly specify individual
> > +background activities that should be quiesced, or
> > +obtain a system configurable value, which is to be
> > +configured by the system admin/mgmt system.
> > +
> > +/sys/kernel/task_isolation/available_quiesce lists, as
> > +one string per line, the activities which the kernel
> > +supports quiescing.
> >
>
> Probably replace 'quiesce' with 'quiesce_activities' because we are really
> controlling the kernel activities via this control and not the quiesce
> state/feature itself.
OK, makes sense.
> > +
> > +To configure the default quiesce mask, write a comma separated
> > +list of strings (from available_quiesce) to
> > +/sys/kernel/task_isolation/default_quiesce.
> > +
> > +echo > /sys/kernel/task_isolation/default_quiesce disables
> > +all quiescing via ISOL_F_QUIESCE_DEFMASK.
> > +
> > +Using ISOL_F_QUIESCE_DEFMASK allows for the application to
> > +take advantage of future quiescing capabilities without
> > +modification (provided default_quiesce is configured
> > +accordingly).
> >
>
> ISOL_F_QUIESCE_DEFMASK is really telling to quite all kernel
> activities including the one that is not currently supported or I am
> misinterpreting something?
Its telling to quiesce activities that are configured via
/sys/kernel/task_isolation/default_quiesce, including
ones that are not currently supported (in the future,
/sys/kernel/task_isolation/default_quiesce will have to contain the bit
for the new feature as 1).
So userspace can either:
quiesce_mask = value of /sys/kernel/task_isolation/default_quiesce
prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_mask, 0, 0);
(so that new features might be automatically enabled by
a sysadmin).
or
quiesce_mask = application choice of bits
prctl(PR_ISOL_SET, ISOL_F_QUIESCE, quiesce_mask, 0, 0);
(so that new features might be automatically enabled by
a sysadmin).
^ permalink raw reply [flat|nested] 26+ messages in thread
* [patch 1/4] add basic task isolation prctl interface
2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
@ 2021-07-30 20:18 ` Marcelo Tosatti
[not found] ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2021-07-30 20:18 UTC (permalink / raw)
To: linux-kernel
Cc: Nitesh Lal, Nicolas Saenz Julienne, Frederic Weisbecker,
Christoph Lameter, Juri Lelli, Peter Zijlstra, Alex Belits,
Peter Xu, Marcelo Tosatti
Add basic prctl task isolation interface, which allows
informing the kernel that application is executing
latency sensitive code (where interruptions are undesired).
Interface is described by task_isolation.rst (added by this patch).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Index: linux-2.6/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,187 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Task isolation prctl interface
+===============================
+
+Certain types of applications benefit from running uninterrupted by
+background OS activities. Realtime systems and high-bandwidth networking
+applications with user-space drivers can fall into the category.
+
+
+To create a OS noise free environment for the application, this
+interface allows userspace to inform the kernel the start and
+end of the latency sensitive application section (with configurable
+system behaviour for that section).
+
+The prctl options are:
+
+
+ - PR_ISOL_FEAT: Retrieve supported features.
+ - PR_ISOL_GET: Retrieve task isolation parameters.
+ - PR_ISOL_SET: Set task isolation parameters.
+ - PR_ISOL_CTRL_GET: Retrieve task isolation state.
+ - PR_ISOL_CTRL_SET: Set task isolation state (enable/disable task isolation).
+
+The isolation parameters and state are not inherited by
+children created by fork(2) and clone(2). The setting is
+preserved across execve(2).
+
+The sequence of steps to enable task isolation are:
+
+1. Retrieve supported task isolation features (PR_ISOL_FEAT).
+
+2. Configure task isolation features (PR_ISOL_SET/PR_ISOL_GET).
+
+3. Activate or deactivate task isolation features
+ (PR_ISOL_CTRL_GET/PR_ISOL_CTRL_SET).
+
+This interface is based on ideas and code from the
+task isolation patchset from Alex Belits:
+https://lwn.net/Articles/816298/
+
+--------------------
+Feature description
+--------------------
+
+ - ``ISOL_F_QUIESCE``
+
+ This feature allows quiescing select kernel activities on
+ return from system calls.
+
+---------------------
+Interface description
+---------------------
+
+**PR_ISOL_FEAT**:
+
+ Returns the supported features and feature
+ capabilities, as a bitmask. Features and its capabilities
+ are defined at include/uapi/linux/task_isolation.h::
+
+ prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
+
+ The 'feat' argument specifies whether to return
+ supported features (if zero), or feature capabilities
+ (if not zero). Possible non-zero values for 'feat' are:
+
+ - ``ISOL_F_QUIESCE``:
+
+ If arg3 is zero, returns a bitmask containing
+ which kernel activities are supported for quiescing.
+
+ If arg3 is ISOL_F_QUIESCE_DEFMASK, returns
+ default_quiesce_mask, a system-wide configurable.
+ See description of default_quiesce_mask below.
+
+**PR_ISOL_GET**:
+
+ Retrieve task isolation feature configuration.
+ The general format is::
+
+ prctl(PR_ISOL_GET, feat, arg3, arg4, arg5);
+
+ Possible values for feat are:
+
+ - ``ISOL_F_QUIESCE``:
+
+ Returns a bitmask containing which kernel
+ activities are enabled for quiescing.
+
+
+**PR_ISOL_SET**:
+
+ Configures task isolation features. The general format is::
+
+ prctl(PR_ISOL_SET, feat, arg3, arg4, arg5);
+
+ The 'feat' argument specifies which feature to configure.
+ Possible values for feat are:
+
+ - ``ISOL_F_QUIESCE``:
+
+ The 'arg3' argument is a bitmask specifying which
+ kernel activities to quiesce. Possible bit sets are:
+
+ - ``ISOL_F_QUIESCE_VMSTATS``
+
+ VM statistics are maintained in per-CPU counters to
+ improve performance. When a CPU modifies a VM statistic,
+ this modification is kept in the per-CPU counter.
+ Certain activities require a global count, which
+ involves requesting each CPU to flush its local counters
+ to the global VM counters.
+
+ This flush is implemented via a workqueue item, which
+ might schedule a workqueue on isolated CPUs.
+
+ To avoid this interruption, task isolation can be
+ configured to, upon return from system calls, synchronize
+ the per-CPU counters to global counters, thus avoiding
+ the interruption.
+
+ To ensure the application returns to userspace
+ with no modified per-CPU counters, its necessary to
+ use mlockall() in addition to this isolcpus flag.
+
+**PR_ISOL_CTRL_GET**:
+
+ Retrieve task isolation control.
+
+ prctl(PR_ISOL_CTRL_GET, 0, 0, 0, 0);
+
+ Returns which isolation features are active.
+
+**PR_ISOL_CTRL_SET**:
+
+ Activates/deactivates task isolation control.
+
+ prctl(PR_ISOL_CTRL_SET, mask, 0, 0, 0);
+
+ The 'mask' argument specifies which features
+ to activate (bit set) or deactivate (bit clear).
+
+ For ISOL_F_QUIESCE, quiescing of background activities
+ happens on return to userspace from the
+ prctl(PR_ISOL_CTRL_SET) call, and on return from
+ subsequent system calls.
+
+ Quiescing can be adjusted (while active) by
+ prctl(PR_ISOL_SET, ISOL_F_QUIESCE, ...).
+
+--------------------
+Default quiesce mask
+--------------------
+
+Applications can either explicitly specify individual
+background activities that should be quiesced, or
+obtain a system configurable value, which is to be
+configured by the system admin/mgmt system.
+
+/sys/kernel/task_isolation/available_quiesce lists, as
+one string per line, the activities which the kernel
+supports quiescing.
+
+To configure the default quiesce mask, write a comma separated
+list of strings (from available_quiesce) to
+/sys/kernel/task_isolation/default_quiesce.
+
+echo > /sys/kernel/task_isolation/default_quiesce disables
+all quiescing via ISOL_F_QUIESCE_DEFMASK.
+
+Using ISOL_F_QUIESCE_DEFMASK allows for the application to
+take advantage of future quiescing capabilities without
+modification (provided default_quiesce is configured
+accordingly).
+
+See PR_ISOL_FEAT subsection of "Interface description" section
+for more details. samples/task_isolation/task_isolation.c
+contains an example.
+
+Examples
+========
+
+The ``samples/task_isolation/`` directory contains sample
+applications.
+
+
Index: linux-2.6/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/prctl.h
+++ linux-2.6/include/uapi/linux/prctl.h
@@ -267,4 +267,17 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
# define PR_SCHED_CORE_MAX 4
+/* Task isolation control */
+#define PR_ISOL_FEAT 62
+#define PR_ISOL_GET 63
+#define PR_ISOL_SET 64
+#define PR_ISOL_CTRL_GET 65
+#define PR_ISOL_CTRL_SET 66
+
+# define ISOL_F_QUIESCE (1UL << 0)
+# define ISOL_F_QUIESCE_VMSTATS (1UL << 0)
+
+# define ISOL_F_QUIESCE_DEFMASK (1UL << 0)
+
+
#endif /* _LINUX_PRCTL_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -132,6 +132,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
+obj-$(CONFIG_CPU_ISOLATION) += task_isolation.o
+
CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
KASAN_SANITIZE_stackleak.o := n
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -58,6 +58,7 @@
#include <linux/sched/coredump.h>
#include <linux/sched/task.h>
#include <linux/sched/cputime.h>
+#include <linux/task_isolation.h>
#include <linux/rcupdate.h>
#include <linux/uidgid.h>
#include <linux/cred.h>
@@ -2567,6 +2568,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_ISOL_FEAT:
+ error = prctl_task_isolation_feat(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_GET:
+ error = prctl_task_isolation_get(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_SET:
+ error = prctl_task_isolation_set(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_CTRL_GET:
+ error = prctl_task_isolation_ctrl_get(arg2, arg3, arg4, arg5);
+ break;
+ case PR_ISOL_CTRL_SET:
+ error = prctl_task_isolation_ctrl_set(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
Index: linux-2.6/samples/task_isolation/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/task_isolation.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <linux/prctl.h>
+
+int main(void)
+{
+ int ret, defmask;
+ void *buf = malloc(4096);
+
+ memset(buf, 1, 4096);
+ ret = mlock(buf, 4096);
+ if (ret) {
+ perror("mlock");
+ return EXIT_FAILURE;
+ }
+
+ ret = prctl(PR_ISOL_FEAT, 0, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_FEAT");
+ return EXIT_FAILURE;
+ }
+ printf("supported features bitmask: 0x%x\n", ret);
+
+ if (!(ret & ISOL_F_QUIESCE)) {
+ printf("quiesce feature unsupported, quitting\n");
+ return EXIT_FAILURE;
+ }
+
+ ret = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE)");
+ return EXIT_FAILURE;
+ }
+ printf("supported ISOL_F_QUIESCE bits: 0x%x\n", ret);
+
+ ret = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, ISOL_F_QUIESCE_DEFMASK,
+ 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_FEAT (ISOL_F_QUIESCE, DEFMASK)");
+ return EXIT_FAILURE;
+ }
+
+ defmask = ret;
+ printf("default ISOL_F_QUIESCE bits: 0x%x\n", defmask);
+
+ /*
+ * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
+ * which is configurable through
+ * /sys/kernel/task_isolation/default_quiesce, or specific values.
+ *
+ * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
+ * take advantage of future quiescing capabilities without
+ * modification (provided default_quiesce is configured
+ * accordingly).
+ */
+ defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
+
+ ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
+ 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_SET");
+ return EXIT_FAILURE;
+ }
+
+ ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
+ return EXIT_FAILURE;
+ }
+
+#define NR_LOOPS 999999999
+#define NR_PRINT 100000000
+ /* busy loop */
+ while (ret < NR_LOOPS) {
+ memset(buf, 0, 4096);
+ ret = ret+1;
+ if (!(ret % NR_PRINT))
+ printf("loops=%d of %d\n", ret, NR_LOOPS);
+ }
+
+ ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_CTRL_SET (0)");
+ exit(0);
+ }
+
+ return EXIT_SUCCESS;
+}
+
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct isol_info;
/*
* Task state bitmask. NOTE! These bits are also
@@ -1400,6 +1401,10 @@ struct task_struct {
struct llist_head kretprobe_instances;
#endif
+#ifdef CONFIG_CPU_ISOLATION
+ struct isol_info *isol_info;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -213,6 +213,9 @@ struct task_struct init_task
#ifdef CONFIG_SECCOMP_FILTER
.seccomp = { .filter_count = ATOMIC_INIT(0) },
#endif
+#ifdef CONFIG_CPU_ISOLATION
+ .isol_info = NULL,
+#endif
};
EXPORT_SYMBOL(init_task);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -97,6 +97,7 @@
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/bpf.h>
+#include <linux/task_isolation.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -734,6 +735,7 @@ void __put_task_struct(struct task_struc
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ tsk_isol_exit(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
task_numa_free(tsk, true);
@@ -2084,7 +2086,9 @@ static __latent_entropy struct task_stru
#ifdef CONFIG_BPF_SYSCALL
RCU_INIT_POINTER(p->bpf_storage, NULL);
#endif
-
+#ifdef CONFIG_CPU_ISOLATION
+ p->isol_info = NULL;
+#endif
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
Index: linux-2.6/include/linux/task_isolation.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/task_isolation.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_TASK_ISOL_H
+#define __LINUX_TASK_ISOL_H
+
+#ifdef CONFIG_CPU_ISOLATION
+
+struct isol_info {
+ /* Which features are active */
+ unsigned long active_mask;
+ /* Quiesce mask */
+ unsigned long quiesce_mask;
+};
+
+extern void __tsk_isol_exit(struct task_struct *tsk);
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+ if (tsk->isol_info)
+ __tsk_isol_exit(tsk);
+}
+
+
+int prctl_task_isolation_feat(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_get(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_set(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int prctl_task_isolation_ctrl_set(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
+#else
+
+static inline void tsk_isol_exit(struct task_struct *tsk)
+{
+}
+
+static inline int prctl_task_isolation_feat(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_get(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_set(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_get(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int prctl_task_isolation_ctrl_set(unsigned long arg2,
+ unsigned long arg3,
+ unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_CPU_ISOLATION */
+
+#endif /* __LINUX_TASK_ISOL_H */
Index: linux-2.6/kernel/task_isolation.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/task_isolation.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Implementation of task isolation.
+ *
+ * Authors:
+ * Chris Metcalf <cmetcalf@mellanox.com>
+ * Alex Belits <abelits@belits.com>
+ * Yuri Norov <ynorov@marvell.com>
+ * Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/task_isolation.h>
+#include <linux/prctl.h>
+#include <linux/slab.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+static unsigned long default_quiesce_mask;
+
+static int tsk_isol_alloc_context(struct task_struct *task)
+{
+ struct isol_info *info;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (unlikely(!info))
+ return -ENOMEM;
+
+ task->isol_info = info;
+ return 0;
+}
+
+void __tsk_isol_exit(struct task_struct *tsk)
+{
+ kfree(tsk->isol_info);
+ tsk->isol_info = NULL;
+}
+
+static int prctl_task_isolation_feat_quiesce(unsigned long type)
+{
+ switch (type) {
+ case 0:
+ return ISOL_F_QUIESCE_VMSTATS;
+ case ISOL_F_QUIESCE_DEFMASK:
+ return default_quiesce_mask;
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
+static int task_isolation_get_quiesce(void)
+{
+ if (current->isol_info != NULL)
+ return current->isol_info->quiesce_mask;
+
+ return 0;
+}
+
+static int task_isolation_set_quiesce(unsigned long quiesce_mask)
+{
+ if (quiesce_mask != ISOL_F_QUIESCE_VMSTATS && quiesce_mask != 0)
+ return -EINVAL;
+
+ current->isol_info->quiesce_mask = quiesce_mask;
+ return 0;
+}
+
+int prctl_task_isolation_feat(unsigned long feat, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ switch (feat) {
+ case 0:
+ return ISOL_F_QUIESCE;
+ case ISOL_F_QUIESCE:
+ return prctl_task_isolation_feat_quiesce(arg3);
+ default:
+ break;
+ }
+ return -EINVAL;
+}
+
+int prctl_task_isolation_get(unsigned long feat, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ switch (feat) {
+ case ISOL_F_QUIESCE:
+ return task_isolation_get_quiesce();
+ default:
+ break;
+ }
+ return -EINVAL;
+}
+
+int prctl_task_isolation_set(unsigned long feat, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ int ret;
+ bool err_free_ctx = false;
+
+ if (current->isol_info == NULL)
+ err_free_ctx = true;
+
+ ret = tsk_isol_alloc_context(current);
+ if (ret)
+ return ret;
+
+ switch (feat) {
+ case ISOL_F_QUIESCE:
+ ret = task_isolation_set_quiesce(arg3);
+ if (ret)
+ break;
+ return 0;
+ default:
+ break;
+ }
+
+ if (err_free_ctx)
+ __tsk_isol_exit(current);
+ return -EINVAL;
+}
+
+int prctl_task_isolation_ctrl_set(unsigned long feat, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ if (current->isol_info == NULL)
+ return -EINVAL;
+
+ if (feat != ISOL_F_QUIESCE && feat != 0)
+ return -EINVAL;
+
+ current->isol_info->active_mask = feat;
+ return 0;
+}
+
+int prctl_task_isolation_ctrl_get(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ if (current->isol_info == NULL)
+ return 0;
+
+ return current->isol_info->active_mask;
+}
+
+struct qoptions {
+ unsigned long mask;
+ char *name;
+};
+
+static struct qoptions qopts[] = {
+ {ISOL_F_QUIESCE_VMSTATS, "vmstat"},
+};
+
+#define QLEN (sizeof(qopts) / sizeof(struct qoptions))
+
+static ssize_t default_quiesce_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ char *p, *s;
+ unsigned long defmask = 0;
+
+ s = (char *)buf;
+ if (count == 1 && strlen(strim(s)) == 0) {
+ default_quiesce_mask = 0;
+ return count;
+ }
+
+ while ((p = strsep(&s, ",")) != NULL) {
+ int i;
+ bool found = false;
+
+ if (!*p)
+ continue;
+
+ for (i = 0; i < QLEN; i++) {
+ struct qoptions *opt = &qopts[i];
+
+ if (strncmp(strim(p), opt->name, strlen(opt->name)) == 0) {
+ defmask |= opt->mask;
+ found = true;
+ break;
+ }
+ }
+ if (found == true)
+ continue;
+ return -EINVAL;
+ }
+ default_quiesce_mask = defmask;
+
+ return count;
+}
+
+#define MAXARRLEN 100
+
+static ssize_t default_quiesce_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int i;
+ char tbuf[MAXARRLEN] = "";
+
+ for (i = 0; i < QLEN; i++) {
+ struct qoptions *opt = &qopts[i];
+
+ if (default_quiesce_mask & opt->mask) {
+ strlcat(tbuf, opt->name, MAXARRLEN);
+ strlcat(tbuf, "\n", MAXARRLEN);
+ }
+ }
+
+ return sprintf(buf, "%s", tbuf);
+}
+
+static struct kobj_attribute default_quiesce_attr =
+ __ATTR_RW(default_quiesce);
+
+static ssize_t available_quiesce_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int i;
+ char tbuf[MAXARRLEN] = "";
+
+ for (i = 0; i < QLEN; i++) {
+ struct qoptions *opt = &qopts[i];
+
+ strlcat(tbuf, opt->name, MAXARRLEN);
+ strlcat(tbuf, "\n", MAXARRLEN);
+ }
+
+ return sprintf(buf, "%s", tbuf);
+}
+
+static struct kobj_attribute available_quiesce_attr =
+ __ATTR_RO(available_quiesce);
+
+static struct attribute *task_isol_attrs[] = {
+ &available_quiesce_attr.attr,
+ &default_quiesce_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group task_isol_attr_group = {
+ .attrs = task_isol_attrs,
+ .bin_attrs = NULL,
+};
+
+static int __init task_isol_ksysfs_init(void)
+{
+ int ret;
+ struct kobject *task_isol_kobj;
+
+ task_isol_kobj = kobject_create_and_add("task_isolation",
+ kernel_kobj);
+ if (!task_isol_kobj) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = sysfs_create_group(task_isol_kobj, &task_isol_attr_group);
+ if (ret)
+ goto out_task_isol_kobj;
+
+ return 0;
+
+out_task_isol_kobj:
+ kobject_put(task_isol_kobj);
+out:
+ return ret;
+}
+
+arch_initcall(task_isol_ksysfs_init);
Index: linux-2.6/samples/Kconfig
===================================================================
--- linux-2.6.orig/samples/Kconfig
+++ linux-2.6/samples/Kconfig
@@ -223,4 +223,11 @@ config SAMPLE_WATCH_QUEUE
Build example userspace program to use the new mount_notify(),
sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
+config SAMPLE_TASK_ISOLATION
+ bool "task isolation sample"
+ depends on CC_CAN_LINK && HEADERS_INSTALL
+ help
+ Build example userspace program to use prctl task isolation
+ interface.
+
endif # SAMPLES
Index: linux-2.6/samples/Makefile
===================================================================
--- linux-2.6.orig/samples/Makefile
+++ linux-2.6/samples/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_SAMPLE_INTEL_MEI) += mei/
subdir-$(CONFIG_SAMPLE_WATCHDOG) += watchdog
subdir-$(CONFIG_SAMPLE_WATCH_QUEUE) += watch_queue
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak/
+subdir-$(CONFIG_SAMPLE_TASK_ISOLATION) += task_isolation
Index: linux-2.6/samples/task_isolation/Makefile
===================================================================
--- /dev/null
+++ linux-2.6/samples/task_isolation/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+userprogs-always-y += task_isolation
+
+userccflags += -I usr/include
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2021-08-02 14:17 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27 10:38 [patch 0/4] prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-07-27 10:38 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
2021-07-27 10:48 ` nsaenzju
2021-07-27 11:00 ` Marcelo Tosatti
2021-07-27 12:38 ` nsaenzju
2021-07-27 13:06 ` Marcelo Tosatti
2021-07-27 13:08 ` Marcelo Tosatti
2021-07-27 13:09 ` Frederic Weisbecker
2021-07-27 14:52 ` Marcelo Tosatti
2021-07-27 23:45 ` Frederic Weisbecker
2021-07-28 9:37 ` Marcelo Tosatti
2021-07-28 11:45 ` Frederic Weisbecker
2021-07-28 13:21 ` Marcelo Tosatti
2021-07-28 21:22 ` Frederic Weisbecker
2021-07-28 11:55 ` nsaenzju
2021-07-28 13:16 ` Marcelo Tosatti
[not found] ` <CAFki+LkQwoqVTKmgnwLQQM8ua-ixbLp8i+jUT6xF15k6X=89mw@mail.gmail.com>
2021-07-28 16:21 ` Marcelo Tosatti
2021-07-28 17:08 ` nsaenzju
[not found] ` <CAFki+LmHeXmSFze8YEHFNbYA5hLEtnZyk37Yjf-eyOuKa8Os4w@mail.gmail.com>
2021-07-28 16:17 ` Marcelo Tosatti
2021-07-27 10:38 ` [patch 2/4] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-07-27 10:38 ` [patch 3/4] mm: vmstat: move need_update Marcelo Tosatti
2021-07-27 10:38 ` [patch 4/4] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2021-07-30 20:18 [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2) Marcelo Tosatti
2021-07-30 20:18 ` [patch 1/4] add basic task isolation prctl interface Marcelo Tosatti
[not found] ` <CAFki+Lnf0cs62Se0aPubzYxP9wh7xjMXn7RXEPvrmtBdYBrsow@mail.gmail.com>
2021-07-31 0:49 ` Marcelo Tosatti
2021-07-31 7:47 ` kernel test robot
[not found] ` <CAFki+LkQVQOe+5aNEKWDvLdnjWjxzKWOiqOvBZzeuPWX+G=XgA@mail.gmail.com>
2021-08-02 14:16 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).