LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
@ 2021-09-09  0:23 Prakash Sangappa
  2021-09-09  0:23 ` [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-09  0:23 UTC (permalink / raw)
  To: linux-kernel, linux-api; +Cc: mingo, prakash.sangappa

Including liunx-kernel..

Resending RFC. This patchset is not final. I am looking for feedback on
this proposal to share thread specific data for us in latency sensitive
codepath.

(patchset based on v5.14-rc7)

Cover letter previously sent:
----------------------------

Some applications, like a Databases require reading thread specific stats
frequently from the kernel in latency sensitive codepath. The overhead of
reading stats from kernel using system call affects performance.
One use case is reading thread's scheduler stats from /proc schedstat file
(/proc/pid/schedstat) to collect time spent by a thread executing on the
cpu(sum_exec_runtime), time blocked waiting on runq(run_delay). These
scheduler stats, read several times per transaction in latency-sensitive
codepath, are used to measure time taken by DB operations.

This patch proposes to introduce a mechanism for kernel to share thread
stats thru a per thread shared structure shared between userspace and
kernel. The per thread shared structure is allocated on a page shared
mapped between user space and kernel, which will provide a way for fast
communication between user and kernel. Kernel publishes stats in this
shared structure. Application thread can read from it in user space
without requiring system calls.

Similarly, there can be other use cases for such shared structure
mechanism.

Introduce 'off cpu' time:

The time spent executing on a cpu(sum_exec_runtime) by a thread,
currently available thru thread's schedstat file, can be shared thru
the shared structure mentioned above. However, when a thread is running 
on the cpu, this time gets updated periodically, can take upto 1ms or
more as part of scheduler tick processing. If the application has to 
measure cpu time consumed across some DB operations, using
'sum_exec_runtime' will not be accurate. To address this the proposal
is to introduce a thread's 'off cpu' time, which is measured at context
switch, similar to time on runq(ie run_delay in schedstat file) is and
should be more accurate. With that the application can determine cpu time
consumed by taking the elapsed time and subtracting off cpu time. The
off cpu time will be made available thru the shared structure along with
the other schedstats from /proc/pid/schedstat file.

The elapsed time itself can be measured using clock_gettime, which is
vdso optimized and would be fast. The schedstats(runq time & off cpu time)
published in the shared structure will be accumulated time, same as what
is available thru schedstat file, all in units of nanoseconds. The
application would take the difference of the values from before and after
the operation for measurement.

Preliminary results from a simple cached read Database workload shows
performance benefit, when the database uses shared struct for reading
stats vs reading from /proc directly.

Implementation:

A new system call is added to request use of shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. This will contain struct members or nested
structures corresponding to supported stats, like the thread's schedstats,
published by the kernel for user space consumption. More struct members
can be added as new feature support is implemented. Multiple such shared
structures will be allocated from a page(upto 32 per 4k page) and avoid
having to allocate one page per thread of a process. Although, will need
optimizing for locality. Additional pages will be allocated as needed to
accommodate more threads requesting use of shared structures. Aim is to
not expose the layout of the shared structure itself to the application,
which will allow future enhancements/changes without affecting the
existing APIs.

The system call will return a pointer(user space mapped address) to the per
thread shared structure members. Application would save this per thread
pointer in a TLS variable and reference it.

The system call is of the form.
int task_getshared(int option, int flags, void __user *uaddr)

// Currently only TASK_SCHEDSTAT option is supported - returns pointer
// to struct task_schedstat. The struct task_schedstat is nested within
// the shared structure.

struct task_schedstat {
        volatile u64    sum_exec_runtime;
        volatile u64    run_delay;
        volatile u64    pcount;
        volatile u64    off_cpu;
};

Usage:

__thread struct task_schedstat *ts;
task_getshared(TASK_SCHEDSTAT, 0, &ts);

Subsequently the stats are accessed using the 'ts' pointer by the thread

Prakash Sangappa (3):
  Introduce per thread user-kernel shared structure
  Publish tasks's scheduler stats thru the shared structure
  Introduce task's 'off cpu' time

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/mm_types.h               |   2 +
 include/linux/sched.h                  |   9 +
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  92 ++++++++++
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/task_shared.h       |  23 +++
 kernel/fork.c                          |   7 +
 kernel/sched/deadline.c                |   1 +
 kernel/sched/fair.c                    |   1 +
 kernel/sched/rt.c                      |   1 +
 kernel/sched/sched.h                   |   1 +
 kernel/sched/stats.h                   |  55 ++++--
 kernel/sched/stop_task.c               |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Makefile                            |   2 +-
 mm/task_shared.c                       | 314 +++++++++++++++++++++++++++++++++
 18 files changed, 501 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure
  2021-09-09  0:23 [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Prakash Sangappa
@ 2021-09-09  0:23 ` Prakash Sangappa
  2021-09-09  1:50   ` Jann Horn
  2021-09-09  0:23 ` [RESEND RFC PATCH 2/3] Publish tasks's scheduler stats thru the " Prakash Sangappa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-09  0:23 UTC (permalink / raw)
  To: linux-kernel, linux-api; +Cc: mingo, prakash.sangappa

A structure per thread is allocated from a page that is shared mapped
between user space and kernel as means for faster communication. This will
facilitate sharing information, Ex: per thread stats shared between kernel
and user space, that can be read by applications without the need for
making frequent system calls in latency sensitive code path.

A new system call is added, which will allocate the shared structure and
return its mapped user address. Multiple such structures will be allocated
on a page to accommodate requests from different threads of a multithreaded
process. Available space on a page is managed using a bitmap. When a thread
exits, the shared structure is freed and can get reused for another thread
that requests the shared structure. More pages will be allocated and used
as needed based on the number of threads requesting use of shared
structures. These pages are all freed when the process exits.

Each of these shared structures are rounded to 128 bytes. Available space
in this structure can be used to accommodate additional per thread stats,
state etc as needed. In future, if more space beyond 128 bytes, is
needed, multiple such shared structures per thread could be allocated and
managed by the kernel. Although, space in shared structure for sharing any
kind of stats or state should be sparingly used. Therefore shared structure
layout is not exposed to user space. the system call will return the
mapped user address of a specific member or nested structure within the
shared structure corresponding to stats requested, This would allow future
enhancements/changes without breaking the API.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/mm_types.h               |   2 +
 include/linux/sched.h                  |   3 +
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  57 +++++++
 include/uapi/asm-generic/unistd.h      |   5 +-
 kernel/fork.c                          |   7 +
 kernel/sys_ni.c                        |   3 +
 mm/Makefile                            |   2 +-
 mm/task_shared.c                       | 301 +++++++++++++++++++++++++++++++++
 11 files changed, 382 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 mm/task_shared.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ce763a1..a194581 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -452,3 +452,4 @@
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
 447	i386	memfd_secret		sys_memfd_secret
+448	i386	task_getshared		sys_task_getshared
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f6b5779..9dda907 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -369,6 +369,7 @@
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
+448	common	task_getshared		sys_task_getshared
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b..5ec26ed 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -572,6 +572,8 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
 #endif
+		/* user shared pages */
+		void *usharedpg;
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d..237aa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,9 @@ struct task_struct {
 	struct llist_head               kretprobe_instances;
 #endif
 
+	/* user shared struct */
+	void *task_ushrd;
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 69c9a70..09680b7 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1052,6 +1052,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
 
+asmlinkage long sys_task_getshared(long opt, long flags, void __user *uaddr);
+
 /*
  * Architecture-specific system calls
  */
diff --git a/include/linux/task_shared.h b/include/linux/task_shared.h
new file mode 100644
index 0000000..de17849
--- /dev/null
+++ b/include/linux/task_shared.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef	__TASK_SHARED_H__
+#define	__TASK_SHARED_H__
+
+#include <linux/mm_types.h>
+
+/*
+ * Track user-kernel shared pages referred by mm_struct
+ */
+struct ushared_pages {
+	struct list_head plist;
+	struct list_head frlist;
+	unsigned long pcount;
+};
+
+/*
+ * Following is the per task struct shared with kernel for
+ * fast communication.
+ */
+struct task_ushared {
+	long version;
+};
+
+/*
+ * Following is used for cacheline aligned allocations in a page.
+ */
+union  task_shared {
+	struct task_ushared tu;
+	char    s[128];
+};
+
+/*
+ * Struct to track per page slots
+ */
+struct ushared_pg {
+	struct list_head list;
+	struct list_head fr_list;
+	struct page *pages[2];
+	u64 bitmap; /* free slots */
+	int slot_count;
+	unsigned long kaddr;
+	unsigned long vaddr; /* user address */
+	struct vm_special_mapping ushrd_mapping;
+};
+
+/*
+ * Following struct is referred by tast_struct
+ */
+struct task_ushrd_struct {
+	struct task_ushared *kaddr; /* kernel address */
+	struct task_ushared *uaddr; /* user address */
+	struct ushared_pg *upg;
+};
+
+extern void task_ushared_free(struct task_struct *t);
+extern void mm_ushared_clear(struct mm_struct *mm);
+#endif /* __TASK_SHARED_H__ */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a9d6fcd..7c985b1 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -878,8 +878,11 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #endif
 
+#define __NR_task_getshared 448
+__SYSCALL(__NR_task_getshared, sys_task_getshared)
+
 #undef __NR_syscalls
-#define __NR_syscalls 448
+#define __NR_syscalls 449
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2c..f84bac0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/task_shared.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -903,6 +904,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_stack;
 
+	/* task's ushared struct not inherited across fork */
+	tsk->task_ushrd =  NULL;
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -1049,6 +1053,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
+	mm->usharedpg = NULL;
 	mm_init_uprobes_state(mm);
 
 	if (current->mm) {
@@ -1099,6 +1104,7 @@ static inline void __mmput(struct mm_struct *mm)
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
 	exit_mmap(mm);
+	mm_ushared_clear(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
 	if (!list_empty(&mm->mmlist)) {
@@ -1308,6 +1314,7 @@ static int wait_for_vfork_done(struct task_struct *child,
 static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
 	uprobe_free_utask(tsk);
+	task_ushared_free(tsk);
 
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 30971b1..8fbdc55 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -481,3 +481,6 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+/* task shared */
+COND_SYSCALL(task_getshared);
diff --git a/mm/Makefile b/mm/Makefile
index e343674..03f88fe 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o task_shared.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/task_shared.c b/mm/task_shared.c
new file mode 100644
index 0000000..3ec5eb6
--- /dev/null
+++ b/mm/task_shared.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/uio.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/highmem.h>
+#include <linux/ptrace.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/task_shared.h>
+
+/* Shared page */
+
+#define TASK_USHARED_SLOTS (PAGE_SIZE/sizeof(union task_shared))
+
+/*
+ * Called once to init struct ushared_pages pointer.
+ */
+static int init_mm_ushared(struct mm_struct *mm)
+{
+	struct ushared_pages *usharedpg;
+
+	usharedpg = kmalloc(sizeof(struct ushared_pages), GFP_KERNEL);
+	if (usharedpg == NULL)
+		return 1;
+
+	INIT_LIST_HEAD(&usharedpg->plist);
+	INIT_LIST_HEAD(&usharedpg->frlist);
+	usharedpg->pcount = 0;
+	mmap_write_lock(mm);
+	if (mm->usharedpg == NULL) {
+		mm->usharedpg = usharedpg;
+		usharedpg = NULL;
+	}
+	mmap_write_unlock(mm);
+	if (usharedpg != NULL)
+		kfree(usharedpg);
+	return 0;
+}
+
+static int init_task_ushrd(struct task_struct *t)
+{
+	struct task_ushrd_struct *ushrd;
+
+	ushrd = kzalloc(sizeof(struct task_ushrd_struct), GFP_KERNEL);
+	if (ushrd == NULL)
+		return 1;
+
+	mmap_write_lock(t->mm);
+	if (t->task_ushrd == NULL) {
+		t->task_ushrd = ushrd;
+		ushrd = NULL;
+	}
+	mmap_write_unlock(t->mm);
+	if (ushrd != NULL)
+		kfree(ushrd);
+	return 0;
+}
+
+/*
+ * Called from __mmput(), mm is going away
+ */
+void mm_ushared_clear(struct mm_struct *mm)
+{
+	struct ushared_pg *upg;
+	struct ushared_pg *tmp;
+	struct ushared_pages *usharedpg;
+
+	if (mm == NULL || mm->usharedpg == NULL)
+		return;
+
+	usharedpg = mm->usharedpg;
+	if (list_empty(&usharedpg->frlist))
+		goto out;
+
+	list_for_each_entry_safe(upg, tmp, &usharedpg->frlist, fr_list) {
+		list_del(&upg->fr_list);
+		put_page(upg->pages[0]);
+		kfree(upg);
+	}
+out:
+	kfree(mm->usharedpg);
+	mm->usharedpg = NULL;
+
+}
+
+void task_ushared_free(struct task_struct *t)
+{
+	struct task_ushrd_struct *ushrd = t->task_ushrd;
+	struct mm_struct *mm = t->mm;
+	struct ushared_pages *usharedpg;
+	int slot;
+
+	if (mm == NULL || mm->usharedpg == NULL || ushrd == NULL)
+		return;
+
+	usharedpg = mm->usharedpg;
+	mmap_write_lock(mm);
+
+	if (ushrd->upg == NULL)
+		goto out;
+
+	slot = (unsigned long)((unsigned long)ushrd->uaddr
+		 - ushrd->upg->vaddr) / sizeof(union task_shared);
+	clear_bit(slot, (unsigned long *)(&ushrd->upg->bitmap));
+
+	/* move to head */
+	if (ushrd->upg->slot_count == 0) {
+		list_del(&ushrd->upg->fr_list);
+		list_add(&ushrd->upg->fr_list, &usharedpg->frlist);
+	}
+
+	ushrd->upg->slot_count++;
+
+	ushrd->uaddr = ushrd->kaddr = NULL;
+	ushrd->upg = NULL;
+
+out:
+	t->task_ushrd = NULL;
+	mmap_write_unlock(mm);
+	kfree(ushrd);
+}
+
+/* map shared page */
+static int task_shared_add_vma(struct ushared_pg *pg)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm =  current->mm;
+	unsigned long ret = 1;
+
+
+	if (!pg->vaddr) {
+		/* Try to map as high as possible, this is only a hint. */
+		pg->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+					PAGE_SIZE, 0, 0);
+		if (pg->vaddr & ~PAGE_MASK) {
+			ret = 0;
+			goto fail;
+		}
+	}
+
+	vma = _install_special_mapping(mm, pg->vaddr, PAGE_SIZE,
+			VM_SHARED|VM_READ|VM_MAYREAD|VM_DONTCOPY,
+			&pg->ushrd_mapping);
+	if (IS_ERR(vma)) {
+		ret = 0;
+		pg->vaddr = 0;
+		goto fail;
+	}
+
+	pg->kaddr = (unsigned long)page_address(pg->pages[0]);
+fail:
+	return ret;
+}
+
+/*
+ * Allocate a page, map user address and add to freelist
+ */
+static struct ushared_pg *ushared_allocpg(void)
+{
+
+	struct ushared_pg *pg;
+	struct mm_struct *mm = current->mm;
+	struct ushared_pages *usharedpg = mm->usharedpg;
+
+	if (usharedpg == NULL)
+		return NULL;
+	pg = kzalloc(sizeof(*pg), GFP_KERNEL);
+
+	if (unlikely(!pg))
+		return NULL;
+	pg->ushrd_mapping.name = "[task_shared]";
+	pg->ushrd_mapping.fault = NULL;
+	pg->ushrd_mapping.pages = pg->pages;
+	pg->pages[0] = alloc_page(GFP_KERNEL);
+	if (!pg->pages[0])
+		goto out;
+	pg->pages[1] = NULL;
+	pg->bitmap = 0;
+
+	/*
+	 * page size should be 4096 or 8192
+	 */
+	pg->slot_count = TASK_USHARED_SLOTS;
+
+	mmap_write_lock(mm);
+	if (task_shared_add_vma(pg)) {
+		list_add(&pg->fr_list, &usharedpg->frlist);
+		usharedpg->pcount++;
+		mmap_write_unlock(mm);
+		return pg;
+	}
+	mmap_write_unlock(mm);
+
+out:
+	__free_page(pg->pages[0]);
+	kfree(pg);
+	return NULL;
+}
+
+
+/*
+ * Allocate task_ushared struct for calling thread.
+ */
+static int task_ushared_alloc(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct ushared_pg *ent = NULL;
+	struct task_ushrd_struct *ushrd;
+	struct ushared_pages *usharedpg;
+	int tryalloc = 0;
+	int slot = -1;
+	int ret = -ENOMEM;
+
+	if (mm->usharedpg == NULL && init_mm_ushared(mm))
+		return ret;
+
+	if (current->task_ushrd == NULL && init_task_ushrd(current))
+		return ret;
+
+	usharedpg = mm->usharedpg;
+	ushrd = current->task_ushrd;
+repeat:
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	ent = list_empty(&usharedpg->frlist) ? NULL :
+		list_entry(usharedpg->frlist.next,
+		struct ushared_pg, fr_list);
+
+	if (ent == NULL || ent->slot_count == 0) {
+		if (tryalloc == 0) {
+			mmap_write_unlock(mm);
+			(void)ushared_allocpg();
+			tryalloc = 1;
+			goto repeat;
+		} else {
+			ent = NULL;
+		}
+	}
+
+	if (ent) {
+		slot = find_first_zero_bit((unsigned long *)(&ent->bitmap),
+		  TASK_USHARED_SLOTS);
+		BUG_ON(slot >=  TASK_USHARED_SLOTS);
+
+		set_bit(slot, (unsigned long *)(&ent->bitmap));
+
+		ushrd->uaddr = (struct task_ushared *)(ent->vaddr +
+		  (slot * sizeof(union task_shared)));
+		ushrd->kaddr = (struct task_ushared *)(ent->kaddr +
+		  (slot * sizeof(union task_shared)));
+		ushrd->upg = ent;
+		ent->slot_count--;
+		/* move it to tail */
+		if (ent->slot_count == 0) {
+			list_del(&ent->fr_list);
+			list_add_tail(&ent->fr_list, &usharedpg->frlist);
+		}
+
+	       ret = 0;
+	}
+
+out:
+	mmap_write_unlock(mm);
+	return ret;
+}
+
+
+/*
+ * Task Shared : allocate if needed, and return address of shared struct for
+ * this thread/task.
+ */
+static long task_getshared(u64 opt, u64 flags, void __user *uaddr)
+{
+	struct task_ushrd_struct *ushrd = current->task_ushrd;
+
+	/* We have address, return. */
+	if (ushrd != NULL && ushrd->upg != NULL) {
+		if (copy_to_user(uaddr, &ushrd->uaddr,
+			sizeof(struct task_ushared *)))
+			return (-EFAULT);
+		return 0;
+	}
+
+	task_ushared_alloc();
+	ushrd = current->task_ushrd;
+	if (ushrd != NULL && ushrd->upg != NULL) {
+		if (copy_to_user(uaddr, &ushrd->uaddr,
+			sizeof(struct task_ushared *)))
+			return (-EFAULT);
+		return 0;
+	}
+	return (-ENOMEM);
+}
+
+
+SYSCALL_DEFINE3(task_getshared, u64, opt, u64, flags, void __user *, uaddr)
+{
+	return task_getshared(opt, flags, uaddr);
+}
-- 
2.7.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RESEND RFC PATCH 2/3] Publish tasks's scheduler stats thru the shared structure
  2021-09-09  0:23 [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Prakash Sangappa
  2021-09-09  0:23 ` [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
@ 2021-09-09  0:23 ` Prakash Sangappa
  2021-09-09  0:23 ` [RESEND RFC PATCH 3/3] Introduce task's 'off cpu' time Prakash Sangappa
       [not found] ` <CAFTs51VDUPWu=r9d=ThABc-Z6wCwTOC+jKDCq=Jk8Pfid61xyQ@mail.gmail.com>
  3 siblings, 0 replies; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-09  0:23 UTC (permalink / raw)
  To: linux-kernel, linux-api; +Cc: mingo, prakash.sangappa

Define a 'struct task_schedstat' which contains members corresponding to
scheduler stats that are currently available thru /proc/pid/tasks/pid
/schedstats. Update scheduler stats in this structure in kernel at the
same time stats in the 'struct task_struct' are updated. Add a
TASK_SCHEDSTAT option to task_getshared system call to request these per
thread scheduler stats thru the shared structure.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/task_shared.h      | 35 ++++++++++++++++++++++++++++++++++-
 include/uapi/linux/task_shared.h | 22 ++++++++++++++++++++++
 kernel/sched/deadline.c          |  1 +
 kernel/sched/fair.c              |  1 +
 kernel/sched/rt.c                |  1 +
 kernel/sched/sched.h             |  1 +
 kernel/sched/stats.h             |  3 +++
 kernel/sched/stop_task.c         |  1 +
 mm/task_shared.c                 | 13 +++++++++++++
 9 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/task_shared.h

diff --git a/include/linux/task_shared.h b/include/linux/task_shared.h
index de17849..62793e4 100644
--- a/include/linux/task_shared.h
+++ b/include/linux/task_shared.h
@@ -3,6 +3,7 @@
 #define	__TASK_SHARED_H__
 
 #include <linux/mm_types.h>
+#include <uapi/linux/task_shared.h>
 
 /*
  * Track user-kernel shared pages referred by mm_struct
@@ -18,7 +19,7 @@ struct ushared_pages {
  * fast communication.
  */
 struct task_ushared {
-	long version;
+	struct task_schedstat ts;
 };
 
 /*
@@ -52,6 +53,38 @@ struct task_ushrd_struct {
 	struct ushared_pg *upg;
 };
 
+
+#ifdef CONFIG_SCHED_INFO
+
+#define task_update_exec_runtime(t)					\
+	do {								\
+		struct task_ushrd_struct *shrdp = t->task_ushrd;	\
+		if (shrdp != NULL && shrdp->kaddr != NULL)		\
+			shrdp->kaddr->ts.sum_exec_runtime =		\
+				 t->se.sum_exec_runtime;		\
+	} while (0)
+
+#define task_update_runq_stat(t, p)					\
+	do {								\
+		struct task_ushrd_struct *shrdp = t->task_ushrd;	\
+		if (shrdp != NULL && shrdp->kaddr != NULL) {		\
+			shrdp->kaddr->ts.run_delay =			\
+				 t->sched_info.run_delay;		\
+			if (p) {					\
+				shrdp->kaddr->ts.pcount =		\
+					 t->sched_info.pcount;		\
+			}						\
+		}							\
+	} while (0)
+#else
+
+#define task_update_exec_runtime(t)	do { } while (0)
+#define task_update_runq_stat(t, p)	do { } while (0)
+
+#endif
+
+
+
 extern void task_ushared_free(struct task_struct *t);
 extern void mm_ushared_clear(struct mm_struct *mm);
 #endif /* __TASK_SHARED_H__ */
diff --git a/include/uapi/linux/task_shared.h b/include/uapi/linux/task_shared.h
new file mode 100644
index 0000000..06a8522
--- /dev/null
+++ b/include/uapi/linux/task_shared.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef LINUX_TASK_SHARED_H
+#define LINUX_TASK_SHARED_H
+
+/*
+ * Per task user-kernel mapped structure for faster communication.
+ */
+
+/*
+ * Following is the option to request struct task_schedstats shared structure,
+ * in which kernel shares the task's exec time and time on run queue & number
+ * of times it was scheduled to run on a cpu. Requires kernel with
+ * CONFIG_SCHED_INFO enabled.
+ */
+#define TASK_SCHEDSTAT 1
+
+struct task_schedstat {
+	volatile u64	sum_exec_runtime;
+	volatile u64	run_delay;
+	volatile u64	pcount;
+};
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index aaacd6c..189c74c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1270,6 +1270,7 @@ static void update_curr_dl(struct rq *rq)
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
+	task_update_exec_runtime(curr);
 
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c4520..cbd182b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -817,6 +817,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (entity_is_task(curr)) {
 		struct task_struct *curtask = task_of(curr);
 
+		task_update_exec_runtime(curtask);
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3daf42a..61082fc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1014,6 +1014,7 @@ static void update_curr_rt(struct rq *rq)
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
+	task_update_exec_runtime(curr);
 
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a2..4ebbd8f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -67,6 +67,7 @@
 #include <linux/syscalls.h>
 #include <linux/task_work.h>
 #include <linux/tsacct_kern.h>
+#include <linux/task_shared.h>
 
 #include <asm/tlb.h>
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index d8f8eb0..6b2d69c 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CONFIG_SCHEDSTATS
+#include <linux/task_shared.h>
 
 /*
  * Expects runqueue lock to be held for atomicity of update
@@ -166,6 +167,7 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t)
 	delta = rq_clock(rq) - t->sched_info.last_queued;
 	t->sched_info.last_queued = 0;
 	t->sched_info.run_delay += delta;
+	task_update_runq_stat(t, 0);
 
 	rq_sched_info_dequeue(rq, delta);
 }
@@ -188,6 +190,7 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
 	t->sched_info.run_delay += delta;
 	t->sched_info.last_arrival = now;
 	t->sched_info.pcount++;
+	task_update_runq_stat(t, 1);
 
 	rq_sched_info_arrive(rq, delta);
 }
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f988ebe..7b9b60f 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -82,6 +82,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 			max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	task_update_exec_runtime(curr);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
diff --git a/mm/task_shared.c b/mm/task_shared.c
index 3ec5eb6..7169ccd 100644
--- a/mm/task_shared.c
+++ b/mm/task_shared.c
@@ -275,6 +275,14 @@ static long task_getshared(u64 opt, u64 flags, void __user *uaddr)
 {
 	struct task_ushrd_struct *ushrd = current->task_ushrd;
 
+	/* Currently only TASK_SCHEDSTAT supported */
+#ifdef CONFIG_SCHED_INFO
+	if (opt != TASK_SCHEDSTAT)
+		return (-EINVAL);
+#else
+	return (-EOPNOTSUPP);
+#endif
+
 	/* We have address, return. */
 	if (ushrd != NULL && ushrd->upg != NULL) {
 		if (copy_to_user(uaddr, &ushrd->uaddr,
@@ -286,6 +294,11 @@ static long task_getshared(u64 opt, u64 flags, void __user *uaddr)
 	task_ushared_alloc();
 	ushrd = current->task_ushrd;
 	if (ushrd != NULL && ushrd->upg != NULL) {
+		if (opt == TASK_SCHEDSTAT) {
+			/* init current values */
+			task_update_exec_runtime(current);
+			task_update_runq_stat(current, 1);
+		}
 		if (copy_to_user(uaddr, &ushrd->uaddr,
 			sizeof(struct task_ushared *)))
 			return (-EFAULT);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RESEND RFC PATCH 3/3] Introduce task's 'off cpu' time
  2021-09-09  0:23 [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Prakash Sangappa
  2021-09-09  0:23 ` [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
  2021-09-09  0:23 ` [RESEND RFC PATCH 2/3] Publish tasks's scheduler stats thru the " Prakash Sangappa
@ 2021-09-09  0:23 ` Prakash Sangappa
       [not found] ` <CAFTs51VDUPWu=r9d=ThABc-Z6wCwTOC+jKDCq=Jk8Pfid61xyQ@mail.gmail.com>
  3 siblings, 0 replies; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-09  0:23 UTC (permalink / raw)
  To: linux-kernel, linux-api; +Cc: mingo, prakash.sangappa

Add a task's 'off cpu' time in nanoseconds to sched_info, that represents
accumulated time spent either on run queue or blocked in the kernel.
Publish the off cpu time thru the shared structure. This will be used by
an application to determine cpu time consumed(time executing on a cpu) as
accurately as possible, by taking elapsed time and subtracting off cpu
time.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h            |  6 +++++
 include/linux/task_shared.h      |  2 ++
 include/uapi/linux/task_shared.h |  1 +
 kernel/sched/stats.h             | 56 ++++++++++++++++++++++++++--------------
 4 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 237aa21..a63e447 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -311,6 +311,12 @@ struct sched_info {
 	/* When were we last queued to run? */
 	unsigned long long		last_queued;
 
+	/* When did we last leave cpu */
+	unsigned long long		last_depart;
+
+	/* Time spent off cpu */
+	unsigned long long		off_cpu;
+
 #endif /* CONFIG_SCHED_INFO */
 };
 
diff --git a/include/linux/task_shared.h b/include/linux/task_shared.h
index 62793e4..ce475c4 100644
--- a/include/linux/task_shared.h
+++ b/include/linux/task_shared.h
@@ -70,6 +70,8 @@ struct task_ushrd_struct {
 		if (shrdp != NULL && shrdp->kaddr != NULL) {		\
 			shrdp->kaddr->ts.run_delay =			\
 				 t->sched_info.run_delay;		\
+			shrdp->kaddr->ts.off_cpu =			\
+				 t->sched_info.off_cpu;			\
 			if (p) {					\
 				shrdp->kaddr->ts.pcount =		\
 					 t->sched_info.pcount;		\
diff --git a/include/uapi/linux/task_shared.h b/include/uapi/linux/task_shared.h
index 06a8522..c867c09 100644
--- a/include/uapi/linux/task_shared.h
+++ b/include/uapi/linux/task_shared.h
@@ -18,5 +18,6 @@ struct task_schedstat {
 	volatile u64	sum_exec_runtime;
 	volatile u64	run_delay;
 	volatile u64	pcount;
+	volatile u64	off_cpu;
 };
 #endif
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 6b2d69c..ee59994 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -159,17 +159,24 @@ static inline void psi_sched_switch(struct task_struct *prev,
  */
 static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t)
 {
-	unsigned long long delta = 0;
+	unsigned long long now = rq_clock(rq), delta = 0, ddelta = 0;
 
-	if (!t->sched_info.last_queued)
-		return;
+	if (t->sched_info.last_queued) {
+		delta = now - t->sched_info.last_queued;
+		t->sched_info.last_queued = 0;
+		t->sched_info.run_delay += delta;
+
+		rq_sched_info_dequeue(rq, delta);
+	}
 
-	delta = rq_clock(rq) - t->sched_info.last_queued;
-	t->sched_info.last_queued = 0;
-	t->sched_info.run_delay += delta;
-	task_update_runq_stat(t, 0);
+	if (t->sched_info.last_depart) {
+		ddelta = now - t->sched_info.last_depart;
+		t->sched_info.last_depart = 0;
+		t->sched_info.off_cpu += ddelta;
+	}
 
-	rq_sched_info_dequeue(rq, delta);
+	if (delta || ddelta)
+		task_update_runq_stat(t, 0);
 }
 
 /*
@@ -179,20 +186,27 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t)
  */
 static void sched_info_arrive(struct rq *rq, struct task_struct *t)
 {
-	unsigned long long now, delta = 0;
+	unsigned long long now = rq_clock(rq), delta = 0, ddelta = 0;
 
-	if (!t->sched_info.last_queued)
-		return;
+	if (t->sched_info.last_queued) {
+		delta = now - t->sched_info.last_queued;
+		t->sched_info.last_queued = 0;
+		t->sched_info.run_delay += delta;
+		t->sched_info.last_arrival = now;
+		t->sched_info.pcount++;
+
+		rq_sched_info_arrive(rq, delta);
+	}
+
+	if (t->sched_info.last_depart) {
+		ddelta = now - t->sched_info.last_depart;
+		t->sched_info.last_depart = 0;
+		t->sched_info.off_cpu += ddelta;
+	}
 
-	now = rq_clock(rq);
-	delta = now - t->sched_info.last_queued;
-	t->sched_info.last_queued = 0;
-	t->sched_info.run_delay += delta;
-	t->sched_info.last_arrival = now;
-	t->sched_info.pcount++;
-	task_update_runq_stat(t, 1);
+	if (delta || ddelta)
+		task_update_runq_stat(t, 1);
 
-	rq_sched_info_arrive(rq, delta);
 }
 
 /*
@@ -216,10 +230,12 @@ static inline void sched_info_enqueue(struct rq *rq, struct task_struct *t)
  */
 static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
 {
-	unsigned long long delta = rq_clock(rq) - t->sched_info.last_arrival;
+	unsigned long long delta, now = rq_clock(rq);
 
+	delta = now - t->sched_info.last_arrival;
 	rq_sched_info_depart(rq, delta);
 
+	t->sched_info.last_depart = now;
 	if (task_is_running(t))
 		sched_info_enqueue(rq, t);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure
  2021-09-09  0:23 ` [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
@ 2021-09-09  1:50   ` Jann Horn
  2021-09-09 17:30     ` Prakash Sangappa
  0 siblings, 1 reply; 19+ messages in thread
From: Jann Horn @ 2021-09-09  1:50 UTC (permalink / raw)
  To: Prakash Sangappa; +Cc: linux-kernel, linux-api, mingo

On Thu, Sep 9, 2021 at 2:16 AM Prakash Sangappa
<prakash.sangappa@oracle.com> wrote:
> A structure per thread is allocated from a page that is shared mapped
> between user space and kernel as means for faster communication. This will
> facilitate sharing information, Ex: per thread stats shared between kernel
> and user space, that can be read by applications without the need for
> making frequent system calls in latency sensitive code path.
[...]
> +               /* Try to map as high as possible, this is only a hint. */
> +               pg->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
> +                                       PAGE_SIZE, 0, 0);

I don't have an opinion on the rest of this patch, but from a security
perspective, please don't map things at fixed addresses; let the mmap
code pick some random address that it wants to use. If an attacker
manages to overwrite some userspace pointer with chosen data, we don't
want them to know an address where they can be sure something is
mapped.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure
  2021-09-09  1:50   ` Jann Horn
@ 2021-09-09 17:30     ` Prakash Sangappa
  0 siblings, 0 replies; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-09 17:30 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-kernel, linux-api, mingo



> On Sep 8, 2021, at 6:50 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Thu, Sep 9, 2021 at 2:16 AM Prakash Sangappa
> <prakash.sangappa@oracle.com> wrote:
>> A structure per thread is allocated from a page that is shared mapped
>> between user space and kernel as means for faster communication. This will
>> facilitate sharing information, Ex: per thread stats shared between kernel
>> and user space, that can be read by applications without the need for
>> making frequent system calls in latency sensitive code path.
> [...]
>> +               /* Try to map as high as possible, this is only a hint. */
>> +               pg->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
>> +                                       PAGE_SIZE, 0, 0);
> 
> I don't have an opinion on the rest of this patch, but from a security
> perspective, please don't map things at fixed addresses; let the mmap
> code pick some random address that it wants to use. If an attacker
> manages to overwrite some userspace pointer with chosen data, we don't
> want them to know an address where they can be sure something is
> mapped.

Sure. I will address this. 
Thanks,
-Prakash

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Fwd: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
       [not found] ` <CAFTs51VDUPWu=r9d=ThABc-Z6wCwTOC+jKDCq=Jk8Pfid61xyQ@mail.gmail.com>
@ 2021-09-10 15:18   ` Peter Oskolkov
  2021-09-10 16:13     ` Prakash Sangappa
  2021-09-10 16:37     ` Fwd: " Florian Weimer
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-10 15:18 UTC (permalink / raw)
  To: Prakash Sangappa, Linux Kernel Mailing List, linux-api
  Cc: Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov

On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
<prakash.sangappa@oracle.com> wrote:
>
> Including liunx-kernel..
>
> Resending RFC. This patchset is not final. I am looking for feedback on
> this proposal to share thread specific data for us in latency sensitive
> codepath.

Hi Prakash,

I'd like to add here that Jann and I have been discussing a similar
feature for my UMCG patchset:

https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/

In short, due to the need to read/write to the userspace from
non-sleepable contexts in the kernel it seems that we need to have some
form of per task/thread kernel/userspace shared memory that is pinned,
similar to what your sys_task_getshared does.

Do you think your sys_task_getshared can be tweaked to return an
arbitrarily-sized block of memory (subject to overall constraints)
rather than a fixed number of "options"?

On a more general note, we have a kernel extension internally at
Google, named "kuchannel", that is similar to what you propose here:
per task/thread shared memory with counters and other stat fields that
the kernel populates and the userspace reads (and some additional
functionality that is not too relevant to the discussion).

Thanks,
Peter

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 15:18   ` Fwd: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Peter Oskolkov
@ 2021-09-10 16:13     ` Prakash Sangappa
  2021-09-10 16:28       ` Peter Oskolkov
  2021-09-10 16:37     ` Fwd: " Florian Weimer
  1 sibling, 1 reply; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-10 16:13 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Linux Kernel Mailing List, linux-api, Ingo Molnar, Paul Turner,
	Jann Horn, Peter Oskolkov



> On Sep 10, 2021, at 8:18 AM, Peter Oskolkov <posk@google.com> wrote:
> 
> On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
> <prakash.sangappa@oracle.com> wrote:
>> 
>> Including liunx-kernel..
>> 
>> Resending RFC. This patchset is not final. I am looking for feedback on
>> this proposal to share thread specific data for us in latency sensitive
>> codepath.
> 
> Hi Prakash,


> 
> I'd like to add here that Jann and I have been discussing a similar
> feature for my UMCG patchset:
> 
> https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/

Hi Peter,

I will take  a look.

> 
> In short, due to the need to read/write to the userspace from
> non-sleepable contexts in the kernel it seems that we need to have some
> form of per task/thread kernel/userspace shared memory that is pinned,
> similar to what your sys_task_getshared does.

Exactly. For this reason wanted kernel to allocate the pinned memory.
Didn’t want to deal with files etc as a large number threads will be using
the shared structure mechanism. 

> 
> Do you think your sys_task_getshared can be tweaked to return an
> arbitrarily-sized block of memory (subject to overall constraints)
> rather than a fixed number of "options"?

I suppose it could. How big of a size? We don’t want to hold on to 
arbitrarily large  amount of pinned memory. The preference would 
be for the kernel to decide what is going to be shared based on
what functionality/data sharing is supported. In that sense the size 
is pre defined not something the userspace/application can ask. 

I have not looked at your use case.

> 
> On a more general note, we have a kernel extension internally at
> Google, named "kuchannel", that is similar to what you propose here:
> per task/thread shared memory with counters and other stat fields that
> the kernel populates and the userspace reads (and some additional
> functionality that is not too relevant to the discussion).

We have few other use cases for this we are looking at, which I can 
describe later. 

-Prakash

> 
> Thanks,
> Peter
> 
> [...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 16:13     ` Prakash Sangappa
@ 2021-09-10 16:28       ` Peter Oskolkov
  2021-09-10 19:12         ` Jann Horn
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-10 16:28 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Linux Kernel Mailing List, linux-api, Ingo Molnar, Paul Turner,
	Jann Horn, Peter Oskolkov, Peter Zijlstra

On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
<prakash.sangappa@oracle.com> wrote:
>
>
>
> > On Sep 10, 2021, at 8:18 AM, Peter Oskolkov <posk@google.com> wrote:
> >
> > On Wed, Sep 8, 2021 at 5:16 PM Prakash Sangappa
> > <prakash.sangappa@oracle.com> wrote:
> >>
> >> Including liunx-kernel..
> >>
> >> Resending RFC. This patchset is not final. I am looking for feedback on
> >> this proposal to share thread specific data for us in latency sensitive
> >> codepath.
> >
> > Hi Prakash,
>
>
> >
> > I'd like to add here that Jann and I have been discussing a similar
> > feature for my UMCG patchset:
> >
> > https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
>
> Hi Peter,
>
> I will take  a look.
>
> >
> > In short, due to the need to read/write to the userspace from
> > non-sleepable contexts in the kernel it seems that we need to have some
> > form of per task/thread kernel/userspace shared memory that is pinned,
> > similar to what your sys_task_getshared does.
>
> Exactly. For this reason wanted kernel to allocate the pinned memory.
> Didn’t want to deal with files etc as a large number threads will be using
> the shared structure mechanism.
>
> >
> > Do you think your sys_task_getshared can be tweaked to return an
> > arbitrarily-sized block of memory (subject to overall constraints)
> > rather than a fixed number of "options"?
>
> I suppose it could. How big of a size? We don’t want to hold on to
> arbitrarily large  amount of pinned memory. The preference would
> be for the kernel to decide what is going to be shared based on
> what functionality/data sharing is supported. In that sense the size
> is pre defined not something the userspace/application can ask.

There could be a sysctl or some other mechanism that limits the amount
of memory pinned per mm (or per task). Having "options" hardcoded for
such a generally useful feature seems limiting...

>
> I have not looked at your use case.
>
> >
> > On a more general note, we have a kernel extension internally at
> > Google, named "kuchannel", that is similar to what you propose here:
> > per task/thread shared memory with counters and other stat fields that
> > the kernel populates and the userspace reads (and some additional
> > functionality that is not too relevant to the discussion).
>
> We have few other use cases for this we are looking at, which I can
> describe later.
>
> -Prakash
>
> >
> > Thanks,
> > Peter
> >
> > [...]
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Fwd: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 15:18   ` Fwd: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Peter Oskolkov
  2021-09-10 16:13     ` Prakash Sangappa
@ 2021-09-10 16:37     ` Florian Weimer
  2021-09-10 17:33       ` Mathieu Desnoyers
  1 sibling, 1 reply; 19+ messages in thread
From: Florian Weimer @ 2021-09-10 16:37 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Prakash Sangappa, Linux Kernel Mailing List, linux-api,
	Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov,
	Vincenzo Frascino, Mathieu Desnoyers

* Peter Oskolkov:

> In short, due to the need to read/write to the userspace from
> non-sleepable contexts in the kernel it seems that we need to have some
> form of per task/thread kernel/userspace shared memory that is pinned,
> similar to what your sys_task_getshared does.

In glibc, we'd also like to have this for PID and TID.  Eventually,
rt_sigprocmask without kernel roundtrip in most cases would be very nice
as well.  For performance and simplicity in userspace, it would be best
if the memory region could be at the same offset from the TCB for all
threads.

For KTLS, the idea was that the auxiliary vector would contain size and
alignment of the KTLS.  Userspace would reserve that memory, register it
with the kernel like rseq (or the robust list pointers), and pass its
address to the vDSO functions that need them.  The last part ensures
that the vDSO functions do not need non-global data to determine the
offset from the TCB.  Registration is still needed for the caches.

I think previous discussions (in the KTLS and rseq context) did not have
the pinning constraint.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 16:37     ` Fwd: " Florian Weimer
@ 2021-09-10 17:33       ` Mathieu Desnoyers
  2021-09-10 17:48         ` Peter Oskolkov
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2021-09-10 17:33 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Oskolkov, Prakash Sangappa, linux-kernel, linux-api,
	Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov,
	Vincenzo Frascino

----- On Sep 10, 2021, at 12:37 PM, Florian Weimer fweimer@redhat.com wrote:

> * Peter Oskolkov:
> 
>> In short, due to the need to read/write to the userspace from
>> non-sleepable contexts in the kernel it seems that we need to have some
>> form of per task/thread kernel/userspace shared memory that is pinned,
>> similar to what your sys_task_getshared does.
> 
> In glibc, we'd also like to have this for PID and TID.  Eventually,
> rt_sigprocmask without kernel roundtrip in most cases would be very nice
> as well.  For performance and simplicity in userspace, it would be best
> if the memory region could be at the same offset from the TCB for all
> threads.
> 
> For KTLS, the idea was that the auxiliary vector would contain size and
> alignment of the KTLS.  Userspace would reserve that memory, register it
> with the kernel like rseq (or the robust list pointers), and pass its
> address to the vDSO functions that need them.  The last part ensures
> that the vDSO functions do not need non-global data to determine the
> offset from the TCB.  Registration is still needed for the caches.
> 
> I think previous discussions (in the KTLS and rseq context) did not have
> the pinning constraint.

If this data is per-thread, and read from user-space, why is it relevant
to update this data from non-sleepable kernel context rather than update it as
needed on return-to-userspace ? When returning to userspace, sleeping due to a
page fault is entirely acceptable. This is what we currently do for rseq.

In short, the data could be accessible from the task struct. Flags in the
task struct can let return-to-userspace know that it has outdated ktls
data. So before returning to userspace, the kernel can copy the relevant data
from the task struct to the shared memory area, without requiring any pinning.

What am I missing ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 17:33       ` Mathieu Desnoyers
@ 2021-09-10 17:48         ` Peter Oskolkov
  2021-09-10 17:55           ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-10 17:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, Prakash Sangappa, linux-kernel, linux-api,
	Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov,
	Vincenzo Frascino, Peter Zijlstra

On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer fweimer@redhat.com wrote:
>
> > * Peter Oskolkov:
> >
> >> In short, due to the need to read/write to the userspace from
> >> non-sleepable contexts in the kernel it seems that we need to have some
> >> form of per task/thread kernel/userspace shared memory that is pinned,
> >> similar to what your sys_task_getshared does.
> >
> > In glibc, we'd also like to have this for PID and TID.  Eventually,
> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
> > as well.  For performance and simplicity in userspace, it would be best
> > if the memory region could be at the same offset from the TCB for all
> > threads.
> >
> > For KTLS, the idea was that the auxiliary vector would contain size and
> > alignment of the KTLS.  Userspace would reserve that memory, register it
> > with the kernel like rseq (or the robust list pointers), and pass its
> > address to the vDSO functions that need them.  The last part ensures
> > that the vDSO functions do not need non-global data to determine the
> > offset from the TCB.  Registration is still needed for the caches.
> >
> > I think previous discussions (in the KTLS and rseq context) did not have
> > the pinning constraint.
>
> If this data is per-thread, and read from user-space, why is it relevant
> to update this data from non-sleepable kernel context rather than update it as
> needed on return-to-userspace ? When returning to userspace, sleeping due to a
> page fault is entirely acceptable. This is what we currently do for rseq.
>
> In short, the data could be accessible from the task struct. Flags in the
> task struct can let return-to-userspace know that it has outdated ktls
> data. So before returning to userspace, the kernel can copy the relevant data
> from the task struct to the shared memory area, without requiring any pinning.
>
> What am I missing ?

I can't speak about other use cases, but in the context of userspace
scheduling, the information that a task has blocked in the kernel and
is going to be removed from its runqueue cannot wait to be delivered
to the userspace until the task wakes up, as the userspace scheduler
needs to know of the even when it happened so that it can schedule
another task in place of the blocked one. See the discussion here:

https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/

>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 17:48         ` Peter Oskolkov
@ 2021-09-10 17:55           ` Mathieu Desnoyers
  2021-09-10 18:00             ` Peter Oskolkov
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2021-09-10 17:55 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Florian Weimer, Prakash Sangappa, linux-kernel, linux-api,
	Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov,
	Vincenzo Frascino, Peter Zijlstra

----- On Sep 10, 2021, at 1:48 PM, Peter Oskolkov posk@google.com wrote:

> On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer fweimer@redhat.com wrote:
>>
>> > * Peter Oskolkov:
>> >
>> >> In short, due to the need to read/write to the userspace from
>> >> non-sleepable contexts in the kernel it seems that we need to have some
>> >> form of per task/thread kernel/userspace shared memory that is pinned,
>> >> similar to what your sys_task_getshared does.
>> >
>> > In glibc, we'd also like to have this for PID and TID.  Eventually,
>> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
>> > as well.  For performance and simplicity in userspace, it would be best
>> > if the memory region could be at the same offset from the TCB for all
>> > threads.
>> >
>> > For KTLS, the idea was that the auxiliary vector would contain size and
>> > alignment of the KTLS.  Userspace would reserve that memory, register it
>> > with the kernel like rseq (or the robust list pointers), and pass its
>> > address to the vDSO functions that need them.  The last part ensures
>> > that the vDSO functions do not need non-global data to determine the
>> > offset from the TCB.  Registration is still needed for the caches.
>> >
>> > I think previous discussions (in the KTLS and rseq context) did not have
>> > the pinning constraint.
>>
>> If this data is per-thread, and read from user-space, why is it relevant
>> to update this data from non-sleepable kernel context rather than update it as
>> needed on return-to-userspace ? When returning to userspace, sleeping due to a
>> page fault is entirely acceptable. This is what we currently do for rseq.
>>
>> In short, the data could be accessible from the task struct. Flags in the
>> task struct can let return-to-userspace know that it has outdated ktls
>> data. So before returning to userspace, the kernel can copy the relevant data
>> from the task struct to the shared memory area, without requiring any pinning.
>>
>> What am I missing ?
> 
> I can't speak about other use cases, but in the context of userspace
> scheduling, the information that a task has blocked in the kernel and
> is going to be removed from its runqueue cannot wait to be delivered
> to the userspace until the task wakes up, as the userspace scheduler
> needs to know of the even when it happened so that it can schedule
> another task in place of the blocked one. See the discussion here:
> 
> https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/

OK, just to confirm my understanding, so the use-case here is per-thread
state which can be read by other threads (in this case the userspace scheduler) ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 17:55           ` Mathieu Desnoyers
@ 2021-09-10 18:00             ` Peter Oskolkov
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-10 18:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, Prakash Sangappa, linux-kernel, linux-api,
	Ingo Molnar, Paul Turner, Jann Horn, Peter Oskolkov,
	Vincenzo Frascino, Peter Zijlstra

On Fri, Sep 10, 2021 at 10:55 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Sep 10, 2021, at 1:48 PM, Peter Oskolkov posk@google.com wrote:
>
> > On Fri, Sep 10, 2021 at 10:33 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> ----- On Sep 10, 2021, at 12:37 PM, Florian Weimer fweimer@redhat.com wrote:
> >>
> >> > * Peter Oskolkov:
> >> >
> >> >> In short, due to the need to read/write to the userspace from
> >> >> non-sleepable contexts in the kernel it seems that we need to have some
> >> >> form of per task/thread kernel/userspace shared memory that is pinned,
> >> >> similar to what your sys_task_getshared does.
> >> >
> >> > In glibc, we'd also like to have this for PID and TID.  Eventually,
> >> > rt_sigprocmask without kernel roundtrip in most cases would be very nice
> >> > as well.  For performance and simplicity in userspace, it would be best
> >> > if the memory region could be at the same offset from the TCB for all
> >> > threads.
> >> >
> >> > For KTLS, the idea was that the auxiliary vector would contain size and
> >> > alignment of the KTLS.  Userspace would reserve that memory, register it
> >> > with the kernel like rseq (or the robust list pointers), and pass its
> >> > address to the vDSO functions that need them.  The last part ensures
> >> > that the vDSO functions do not need non-global data to determine the
> >> > offset from the TCB.  Registration is still needed for the caches.
> >> >
> >> > I think previous discussions (in the KTLS and rseq context) did not have
> >> > the pinning constraint.
> >>
> >> If this data is per-thread, and read from user-space, why is it relevant
> >> to update this data from non-sleepable kernel context rather than update it as
> >> needed on return-to-userspace ? When returning to userspace, sleeping due to a
> >> page fault is entirely acceptable. This is what we currently do for rseq.
> >>
> >> In short, the data could be accessible from the task struct. Flags in the
> >> task struct can let return-to-userspace know that it has outdated ktls
> >> data. So before returning to userspace, the kernel can copy the relevant data
> >> from the task struct to the shared memory area, without requiring any pinning.
> >>
> >> What am I missing ?
> >
> > I can't speak about other use cases, but in the context of userspace
> > scheduling, the information that a task has blocked in the kernel and
> > is going to be removed from its runqueue cannot wait to be delivered
> > to the userspace until the task wakes up, as the userspace scheduler
> > needs to know of the even when it happened so that it can schedule
> > another task in place of the blocked one. See the discussion here:
> >
> > https://lore.kernel.org/lkml/CAG48ez0mgCXpXnqAUsa0TcFBPjrid-74Gj=xG8HZqj2n+OPoKw@mail.gmail.com/
>
> OK, just to confirm my understanding, so the use-case here is per-thread
> state which can be read by other threads (in this case the userspace scheduler) ?

Yes, exactly! And sometimes these other threads have to read/write the
state while they are themselves in preempt_disabled regions in the
kernel. There could be a way to do that asynchronously (e.g. via
workpools), but this will add latency and complexity.

>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 16:28       ` Peter Oskolkov
@ 2021-09-10 19:12         ` Jann Horn
  2021-09-10 19:36           ` Peter Oskolkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jann Horn @ 2021-09-10 19:12 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Prakash Sangappa, Linux Kernel Mailing List, linux-api,
	Ingo Molnar, Paul Turner, Peter Oskolkov, Peter Zijlstra

On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <posk@google.com> wrote:
> On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
> <prakash.sangappa@oracle.com> wrote:
> > > Do you think your sys_task_getshared can be tweaked to return an
> > > arbitrarily-sized block of memory (subject to overall constraints)
> > > rather than a fixed number of "options"?
> >
> > I suppose it could. How big of a size? We don’t want to hold on to
> > arbitrarily large  amount of pinned memory. The preference would
> > be for the kernel to decide what is going to be shared based on
> > what functionality/data sharing is supported. In that sense the size
> > is pre defined not something the userspace/application can ask.
>
> There could be a sysctl or some other mechanism that limits the amount
> of memory pinned per mm (or per task). Having "options" hardcoded for
> such a generally useful feature seems limiting...

That seems like it'll just create trouble a few years down the line
when the arbitrarily-chosen limit that nobody is monitoring blows up
in someone's production environment.

If this area is used for specific per-thread items, then the kernel
should be able to enforce that you only allocate as much space as is
needed for all threads of the process (based on the maximum number
that have ever been running in parallel in the process), right? Which
would probably work best if the kernel managed those allocations.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 19:12         ` Jann Horn
@ 2021-09-10 19:36           ` Peter Oskolkov
  2021-09-13 17:36             ` Prakash Sangappa
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-10 19:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Prakash Sangappa, Linux Kernel Mailing List, linux-api,
	Ingo Molnar, Paul Turner, Peter Oskolkov, Peter Zijlstra

On Fri, Sep 10, 2021 at 12:12 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <posk@google.com> wrote:
> > On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
> > <prakash.sangappa@oracle.com> wrote:
> > > > Do you think your sys_task_getshared can be tweaked to return an
> > > > arbitrarily-sized block of memory (subject to overall constraints)
> > > > rather than a fixed number of "options"?
> > >
> > > I suppose it could. How big of a size? We don’t want to hold on to
> > > arbitrarily large  amount of pinned memory. The preference would
> > > be for the kernel to decide what is going to be shared based on
> > > what functionality/data sharing is supported. In that sense the size
> > > is pre defined not something the userspace/application can ask.
> >
> > There could be a sysctl or some other mechanism that limits the amount
> > of memory pinned per mm (or per task). Having "options" hardcoded for
> > such a generally useful feature seems limiting...
>
> That seems like it'll just create trouble a few years down the line
> when the arbitrarily-chosen limit that nobody is monitoring blows up
> in someone's production environment.
>
> If this area is used for specific per-thread items, then the kernel
> should be able to enforce that you only allocate as much space as is
> needed for all threads of the process (based on the maximum number
> that have ever been running in parallel in the process), right? Which
> would probably work best if the kernel managed those allocations.

This sounds, again, as if the kernel should be aware of the kind of
items being allocated; having a more generic mechanism of allocating
pinned memory for the userspace to use at its discretion would be more
generally useful, I think. But how then the kernel/system should be
protected from a buggy or malicious process trying to grab too much?

One option would be to have a generic in-kernel mechanism for this,
but expose it to the userspace via domain-specific syscalls that do
the accounting you hint at. This sounds a bit like an over-engineered
solution, though...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-10 19:36           ` Peter Oskolkov
@ 2021-09-13 17:36             ` Prakash Sangappa
  2021-09-13 18:00               ` Peter Oskolkov
  0 siblings, 1 reply; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-13 17:36 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Jann Horn, Linux Kernel Mailing List, linux-api, Ingo Molnar,
	Paul Turner, Peter Oskolkov, Peter Zijlstra



> On Sep 10, 2021, at 12:36 PM, Peter Oskolkov <posk@google.com> wrote:
> 
> On Fri, Sep 10, 2021 at 12:12 PM Jann Horn <jannh@google.com> wrote:
>> 
>> On Fri, Sep 10, 2021 at 6:28 PM Peter Oskolkov <posk@google.com> wrote:
>>> On Fri, Sep 10, 2021 at 9:13 AM Prakash Sangappa
>>> <prakash.sangappa@oracle.com> wrote:
>>>>> Do you think your sys_task_getshared can be tweaked to return an
>>>>> arbitrarily-sized block of memory (subject to overall constraints)
>>>>> rather than a fixed number of "options"?
>>>> 
>>>> I suppose it could. How big of a size? We don’t want to hold on to
>>>> arbitrarily large  amount of pinned memory. The preference would
>>>> be for the kernel to decide what is going to be shared based on
>>>> what functionality/data sharing is supported. In that sense the size
>>>> is pre defined not something the userspace/application can ask.
>>> 
>>> There could be a sysctl or some other mechanism that limits the amount
>>> of memory pinned per mm (or per task). Having "options" hardcoded for
>>> such a generally useful feature seems limiting...
>> 
>> That seems like it'll just create trouble a few years down the line
>> when the arbitrarily-chosen limit that nobody is monitoring blows up
>> in someone's production environment.
>> 
>> If this area is used for specific per-thread items, then the kernel
>> should be able to enforce that you only allocate as much space as is
>> needed for all threads of the process (based on the maximum number
>> that have ever been running in parallel in the process), right? Which
>> would probably work best if the kernel managed those allocations.
> 
> This sounds, again, as if the kernel should be aware of the kind of
> items being allocated; having a more generic mechanism of allocating
> pinned memory for the userspace to use at its discretion would be more
> generally useful, I think. But how then the kernel/system should be
> protected from a buggy or malicious process trying to grab too much?
> 
> One option would be to have a generic in-kernel mechanism for this,
> but expose it to the userspace via domain-specific syscalls that do
> the accounting you hint at. This sounds a bit like an over-engineered
> solution, though…


What will this pinned memory be used for in your use case,
can you explain?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-13 17:36             ` Prakash Sangappa
@ 2021-09-13 18:00               ` Peter Oskolkov
  2021-09-14 16:10                 ` Prakash Sangappa
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Oskolkov @ 2021-09-13 18:00 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Jann Horn, Linux Kernel Mailing List, linux-api, Ingo Molnar,
	Paul Turner, Peter Oskolkov, Peter Zijlstra

On Mon, Sep 13, 2021 at 10:36 AM Prakash Sangappa
<prakash.sangappa@oracle.com> wrote:

[...]

> > This sounds, again, as if the kernel should be aware of the kind of
> > items being allocated; having a more generic mechanism of allocating
> > pinned memory for the userspace to use at its discretion would be more
> > generally useful, I think. But how then the kernel/system should be
> > protected from a buggy or malicious process trying to grab too much?
> >
> > One option would be to have a generic in-kernel mechanism for this,
> > but expose it to the userspace via domain-specific syscalls that do
> > the accounting you hint at. This sounds a bit like an over-engineered
> > solution, though…
>
>
> What will this pinned memory be used for in your use case,
> can you explain?

For userspace scheduling, to share thread/task state information
between the kernel and the userspace. This memory will be allocated
per task/thread; both the kernel and the userspace will write to the
shared memory, and these reads/writes will happen not only in the
memory regions belonging to the "current" task/thread, but also to
remote tasks/threads.

Somewhat detailed doc/rst is here:
https://lore.kernel.org/lkml/20210908184905.163787-5-posk@google.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data
  2021-09-13 18:00               ` Peter Oskolkov
@ 2021-09-14 16:10                 ` Prakash Sangappa
  0 siblings, 0 replies; 19+ messages in thread
From: Prakash Sangappa @ 2021-09-14 16:10 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Jann Horn, Linux Kernel Mailing List, linux-api, Ingo Molnar,
	Paul Turner, Peter Oskolkov, Peter Zijlstra



> On Sep 13, 2021, at 11:00 AM, Peter Oskolkov <posk@google.com> wrote:
> 
> On Mon, Sep 13, 2021 at 10:36 AM Prakash Sangappa
> <prakash.sangappa@oracle.com> wrote:
> 
> [...]
> 
>>> This sounds, again, as if the kernel should be aware of the kind of
>>> items being allocated; having a more generic mechanism of allocating
>>> pinned memory for the userspace to use at its discretion would be more
>>> generally useful, I think. But how then the kernel/system should be
>>> protected from a buggy or malicious process trying to grab too much?
>>> 
>>> One option would be to have a generic in-kernel mechanism for this,
>>> but expose it to the userspace via domain-specific syscalls that do
>>> the accounting you hint at. This sounds a bit like an over-engineered
>>> solution, though…
>> 
>> 
>> What will this pinned memory be used for in your use case,
>> can you explain?
> 
> For userspace scheduling, to share thread/task state information
> between the kernel and the userspace. This memory will be allocated
> per task/thread; both the kernel and the userspace will write to the
> shared memory, and these reads/writes will happen not only in the
> memory regions belonging to the "current" task/thread, but also to
> remote tasks/threads.
> 
> Somewhat detailed doc/rst is here:
> https://lore.kernel.org/lkml/20210908184905.163787-5-posk@google.com/

(Resending reply)

From what I could glean from the link above, looks like you will need the 
entire 'struct umcg_task’(which is 24 bytes in size) in the per thread shared
mapped space(pinned memory?) Accessed/updated both in  user space 
and kernel. Appears the state transitions here are specific to umcg.  So, 
may not be usable in other use cases that are interested in just checking 
if a thread is executing on cpu or blocked.

We have a requirement to share thread state as well(on or off cpu) in the 
shared structure, which also will be accessed by other threads in the user 
space. Kernel updates the state when the thread blocks or resumes execution.
Need to see if may be the task state you have could be repurposed when 
not used by umcg threads.

Regarding use of pinned memory, it is not arbitrary amount per thread then
right? Basically you need 24 bytes per thread. The proposed task_getshared() 
allocates pinned memory pages to accommodate  requests from as many 
threads in a process that need to use the shared structure
(padded to 128 bytes). The  amount of memory/pages consumed will be
bound by the number threads a process can create. As I mentioned in the
cover letter multiple shared structures are fit/allocated from a page.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-14 16:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09  0:23 [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Prakash Sangappa
2021-09-09  0:23 ` [RESEND RFC PATCH 1/3] Introduce per thread user-kernel shared structure Prakash Sangappa
2021-09-09  1:50   ` Jann Horn
2021-09-09 17:30     ` Prakash Sangappa
2021-09-09  0:23 ` [RESEND RFC PATCH 2/3] Publish tasks's scheduler stats thru the " Prakash Sangappa
2021-09-09  0:23 ` [RESEND RFC PATCH 3/3] Introduce task's 'off cpu' time Prakash Sangappa
     [not found] ` <CAFTs51VDUPWu=r9d=ThABc-Z6wCwTOC+jKDCq=Jk8Pfid61xyQ@mail.gmail.com>
2021-09-10 15:18   ` Fwd: [RESEND RFC PATCH 0/3] Provide fast access to thread specific data Peter Oskolkov
2021-09-10 16:13     ` Prakash Sangappa
2021-09-10 16:28       ` Peter Oskolkov
2021-09-10 19:12         ` Jann Horn
2021-09-10 19:36           ` Peter Oskolkov
2021-09-13 17:36             ` Prakash Sangappa
2021-09-13 18:00               ` Peter Oskolkov
2021-09-14 16:10                 ` Prakash Sangappa
2021-09-10 16:37     ` Fwd: " Florian Weimer
2021-09-10 17:33       ` Mathieu Desnoyers
2021-09-10 17:48         ` Peter Oskolkov
2021-09-10 17:55           ` Mathieu Desnoyers
2021-09-10 18:00             ` Peter Oskolkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).