LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
@ 2018-04-24  2:21 jeffm
  2018-04-24  2:21 ` [PATCH 1/5] procfs: factor out a few helpers jeffm
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Hi all -

I recently encountered a customer issue where, on a machine with many TiB
of memory and a few hundred cores, after a task with a few thousand threads
and hundreds of files open exited, the system would softlockup.  That
issue was (is still) being addressed by Nik Borisov's patch to add a
cond_resched call to shrink_dentry_list.  The underlying issue is still
there, though.  We just don't complain as loudly.  When a huge task
exits, now the system is more or less unresponsive for about eight
minutes.  All CPUs are pinned and every one of them is going through
dentry and inode eviction for the procfs files associated with each
thread.  It's made worse by every CPU contending on the super's
inode list lock.

The numbers get big.  My test case was 4096 threads with 16384 files
open.  It's a contrived example, but not that far off from the actual
customer case.  In this case, a simple "find /proc" would create around
300 million dentry/inode pairs.  More practically, lsof(1) does it too,
it just takes longer.  On smaller systems, memory pressure starts pushing
them out. Memory pressure isn't really an issue on this machine, so we
end up using well over 100GB for proc files.  It's the combination of
the wasted CPU cycles in teardown and the wasted memory at runtime that
pushed me to take this approach.

The biggest culprit is the "fd" and "fdinfo" directories, but those are
made worse by there being multiple copies of them even for the same
task without threads getting involved:

- /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
  resources.

- Every /proc/pid/task/*/fd directory in a thread group has identical
  contents (unless unshare(CLONE_FILES) was called), but share no
  resources.

- If we do a lookup like /proc/pid/fd on a member of a thread group,
  we'll get a valid directory.  Inside, there will be a complete
  copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
  nothing is shared.

This patch set reduces some (most) of the duplication by conditionally
replacing some of the directories with symbolic links to copies that are
identical.

1) Eliminate the duplication of the task directories between threads.
   The task directory belongs to the thread leader and the threads
   link to it: e.g. /proc/915/task -> ../910/task  This mainly
   reduces duplication when individual threads are looked up directly
   at the tgid level.  The impact varies based on the number of threads.
   The user has to go out of their way in order to mess up their system
   in this way.  But if they were so inclined, they could create ~550
   billion inodes and dentries using the test case.

2) Eliminate the duplication of directories that are created identically
   between the tgid-level pid directory and its task directory: fd,
   fdinfo, ns, net, attr.  There is obviously more duplication between
   the two directories, but replacing a file with a symbolic link
   doesn't get us anything.  This reduces the number of files associated
   with fd and fdinfo by half if threads aren't involved.

3) Eliminate the duplication of fd and fdinfo directories among threads
   that share a files_struct.  We check at directory creation time if
   the task is a group leader and if not, whether it shares ->files with
   the group leader.  If so, we create a symbolic link to ../tgid/fd*.
   We use a d_revalidate callback to check whether the thread has called
   unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
   Upon re-lookup, a directory will be created in its place.  This is
   pretty simple, so if the thread group leader calls unshare, all threads
   get directories.

With these patches applied, running the same testcase, the proc_inode
cache only gets to about 600k objects, which is about 99.7% fewer.  I
get that procfs isn't supposed to be scalable, but this is kind of
extreme. :)

Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
with more knowledge of the details to pick it apart.  The biggest is that
I'm not sure if any tools depend on any of these things being directories
instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
sure there are corner cases I'm missing.  Hopefully, it's not just flat
out broken since this is a problem that does need solving.

Now I'll go put on the fireproof suit.

Thanks,

-Jeff

---

Jeff Mahoney (5):
  procfs: factor out a few helpers
  procfs: factor out inode revalidation work from pid_revalidation
  procfs: use symlinks for /proc/<pid>/task when not thread group leader
  procfs: share common directories between /proc/tgid and
    /proc/tgid/task/tgid
  procfs: share fd/fdinfo with thread group leader when files are shared

 fs/proc/base.c | 487 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 437 insertions(+), 50 deletions(-)

-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] procfs: factor out a few helpers
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
@ 2018-04-24  2:21 ` jeffm
  2018-04-24  2:21 ` [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation jeffm
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

In order to make the following series easier to review, this patch
factors out a few helpers:
- proc_fill_cache_entry: proc_fill_cache that takes an entry instead of
  an entry, a name, and a name length
- pident_lookup_task: proc_pident_lookup that takes a task, allowing the
  caller to do the task validation
- pid_entry_match_dentry: the comparison between a dentry's name and
  a pid_entry

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/proc/base.c | 60 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324325ed..e9876a89a5ad 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2453,19 +2453,28 @@ static int proc_pident_instantiate(struct inode *dir,
 	return -ENOENT;
 }
 
-static struct dentry *proc_pident_lookup(struct inode *dir, 
-					 struct dentry *dentry,
-					 const struct pid_entry *ents,
-					 unsigned int nents)
+static bool proc_fill_cache_entry(struct file *file, struct dir_context *ctx,
+				  const struct pid_entry *entry,
+				  struct task_struct *task)
 {
-	int error;
-	struct task_struct *task = get_proc_task(dir);
-	const struct pid_entry *p, *last;
+	return proc_fill_cache(file, ctx, entry->name, entry->len,
+			       proc_pident_instantiate, task, entry);
+}
 
-	error = -ENOENT;
+static bool pid_entry_match_dentry(const struct pid_entry *entry,
+				   const struct dentry *dentry)
+{
+	if (entry->len != dentry->d_name.len)
+		return false;
+	return !memcmp(dentry->d_name.name, entry->name, entry->len);
+}
 
-	if (!task)
-		goto out_no_task;
+static int proc_pident_lookup_task(struct inode *dir, struct dentry *dentry,
+				   const struct pid_entry *ents,
+				   unsigned int nents,
+				   struct task_struct *task)
+{
+	const struct pid_entry *p, *last;
 
 	/*
 	 * Yes, it does not scale. And it should not. Don't add
@@ -2473,18 +2482,30 @@ static struct dentry *proc_pident_lookup(struct inode *dir,
 	 */
 	last = &ents[nents];
 	for (p = ents; p < last; p++) {
-		if (p->len != dentry->d_name.len)
-			continue;
-		if (!memcmp(dentry->d_name.name, p->name, p->len))
+		if (pid_entry_match_dentry(p, dentry))
 			break;
 	}
 	if (p >= last)
-		goto out;
+		return -ENOENT;
+
+	return proc_pident_instantiate(dir, dentry, task, p);
+}
+
+static struct dentry *proc_pident_lookup(struct inode *dir,
+					 struct dentry *dentry,
+					 const struct pid_entry *ents,
+					 unsigned int nents)
+{
+	struct task_struct *task;
+	int error = -ENOENT;
+
+	task = get_proc_task(dir);
+	if (task) {
+		error = proc_pident_lookup_task(dir, dentry, ents,
+						nents, task);
+		put_task_struct(task);
+	}
 
-	error = proc_pident_instantiate(dir, dentry, task, p);
-out:
-	put_task_struct(task);
-out_no_task:
 	return ERR_PTR(error);
 }
 
@@ -2504,8 +2525,7 @@ static int proc_pident_readdir(struct file *file, struct dir_context *ctx,
 		goto out;
 
 	for (p = ents + (ctx->pos - 2); p < ents + nents; p++) {
-		if (!proc_fill_cache(file, ctx, p->name, p->len,
-				proc_pident_instantiate, task, p))
+		if (!proc_fill_cache_entry(file, ctx, p, task))
 			break;
 		ctx->pos++;
 	}
-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
  2018-04-24  2:21 ` [PATCH 1/5] procfs: factor out a few helpers jeffm
@ 2018-04-24  2:21 ` jeffm
  2018-04-24  2:21 ` [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader jeffm
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Subsequent patches will introduce their own d_revalidate callbacks and we
won't want to get the task multiple times.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/proc/base.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9876a89a5ad..e7ca45504a5f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1802,14 +1802,24 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 
 /* dentry stuff */
 
+
+/*
+ * Rewrite the inode's ownerships here because the owning task may have
+ * performed a setuid(), etc.
+ */
+static void pid_revalidate_inode(struct inode *inode, struct task_struct *task)
+{
+	task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid);
+
+	inode->i_mode &= ~(S_ISUID | S_ISGID);
+	security_task_to_inode(task, inode);
+}
+
 /*
  *	Exceptional case: normally we are not allowed to unhash a busy
  * directory. In this case, however, we can do it - no aliasing problems
  * due to the way we treat inodes.
  *
- * Rewrite the inode's ownerships here because the owning task may have
- * performed a setuid(), etc.
- *
  */
 int pid_revalidate(struct dentry *dentry, unsigned int flags)
 {
@@ -1823,10 +1833,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
 	task = get_proc_task(inode);
 
 	if (task) {
-		task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid);
-
-		inode->i_mode &= ~(S_ISUID | S_ISGID);
-		security_task_to_inode(task, inode);
+		pid_revalidate_inode(inode, task);
 		put_task_struct(task);
 		return 1;
 	}
-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
  2018-04-24  2:21 ` [PATCH 1/5] procfs: factor out a few helpers jeffm
  2018-04-24  2:21 ` [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation jeffm
@ 2018-04-24  2:21 ` jeffm
  2018-04-24  2:21 ` [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid jeffm
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

Although readdir only lists thread group leaders at the tgid-level of
/proc, it's possible to do a lookup to get individual threads back.  The
directory contains all of the usual tgid-level files and directories,
including task.  The task directory contains directories for every sibling
thread populated with the usual complement of files, all of which are
identical to the files contained under the tgid's own task directory.
If every thread is looked up, we'll create n^2 directories and there
is no sharing among them.  For a 3000-thread task, that becomes a pretty
big number.

This patch avoids the duplication by retaining the tgid's copy of
the task directory and converting the other threads' task directory
to a symbolic link to the tgid's copy.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/proc/base.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e7ca45504a5f..de12bd2137ac 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2948,7 +2948,6 @@ static const struct file_operations proc_task_operations;
 static const struct inode_operations proc_task_inode_operations;
 
 static const struct pid_entry tgid_base_stuff[] = {
-	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
@@ -3047,10 +3046,96 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 };
 
+/*
+ * Don't instantiate a full duplicate of the thread leader's task
+ * directory for every member of the task group.  Just symlink to the
+ * thread leader's copy.
+ */
+static const char *proc_tgid_task_symlink_get_link(struct dentry *dentry,
+						   struct inode *inode,
+						   struct delayed_call *done)
+{
+	struct task_struct *task;
+	char *link = ERR_PTR(-ENOENT);
+
+	if (!dentry)
+		return ERR_PTR(-ECHILD);
+
+	task = get_proc_task(inode);
+	if (task) {
+		struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+		link = kasprintf(GFP_KERNEL, "../%u/task",
+				 pid_nr_ns(task_tgid(task), ns));
+		if (link)
+			set_delayed_call(done, kfree_link, link);
+		else
+			link = ERR_PTR(-ENOMEM);
+		put_task_struct(task);
+	}
+	return link;
+}
+
+static const struct inode_operations proc_task_symlink_inode_operations = {
+	.get_link	= proc_tgid_task_symlink_get_link,
+	.setattr	= proc_setattr,
+};
+
+static const struct pid_entry proc_tgid_task_symlink_entry = {
+	.name		= "task",
+	.len		= sizeof("task") - 1,
+	.mode		= S_IFLNK|S_IRWXUGO,
+	.iop		= &proc_task_symlink_inode_operations,
+};
+
+static const struct pid_entry proc_tgid_task_dir_entry = {
+	.name		= "task",
+	.len		= sizeof("task") - 1,
+	.mode		= S_IFDIR|S_IRUGO|S_IXUGO,
+	.iop		= &proc_task_inode_operations,
+	.fop		= &proc_task_operations,
+};
+
+static const struct pid_entry *proc_tgid_task_entry(struct task_struct *task)
+{
+	if (thread_group_leader(task))
+		return &proc_tgid_task_dir_entry;
+	else
+		return &proc_tgid_task_symlink_entry;
+}
+
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx,
-				   tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
+	const struct pid_entry *entry;
+	struct task_struct *task;
+	int i;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ENOENT;
+
+	if (!dir_emit_dots(file, ctx))
+		goto out;
+
+	/* Add /proc/pid/task entry */
+	if (ctx->pos == 2) {
+		entry = proc_tgid_task_entry(task);
+
+		if (!proc_fill_cache_entry(file, ctx, entry, task))
+			goto out;
+		ctx->pos++;
+	}
+
+	for (i = ctx->pos - 3; i < ARRAY_SIZE(tgid_base_stuff); i++) {
+		entry = &tgid_base_stuff[i];
+
+		if (!proc_fill_cache_entry(file, ctx, entry, task))
+			goto out;
+		ctx->pos++;
+	}
+out:
+	put_task_struct(task);
+	return 0;
 }
 
 static const struct file_operations proc_tgid_base_operations = {
@@ -3059,10 +3144,29 @@ static const struct file_operations proc_tgid_base_operations = {
 	.llseek		= generic_file_llseek,
 };
 
-static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+
+static struct dentry *proc_tgid_base_lookup(struct inode *dir,
+					    struct dentry *dentry,
+					    unsigned int flags)
 {
-	return proc_pident_lookup(dir, dentry,
-				  tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
+	struct task_struct *task;
+	int error = -ENOENT;
+
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	/* Handle /proc/pid/task separately */
+	if (pid_entry_match_dentry(&proc_tgid_task_dir_entry, dentry))
+		error = proc_pident_instantiate(dir, dentry, task,
+						proc_tgid_task_entry(task));
+	else
+		error = proc_pident_lookup_task(dir, dentry, tgid_base_stuff,
+						ARRAY_SIZE(tgid_base_stuff),
+						task);
+	put_task_struct(task);
+out:
+	return ERR_PTR(error);
 }
 
 static const struct inode_operations proc_tgid_base_inode_operations = {
@@ -3163,6 +3267,7 @@ static int proc_pid_instantiate(struct inode *dir,
 				   struct task_struct *task, const void *ptr)
 {
 	struct inode *inode;
+	int nlinks = nlink_tgid;
 
 	inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
 	if (!inode)
@@ -3172,7 +3277,11 @@ static int proc_pid_instantiate(struct inode *dir,
 	inode->i_fop = &proc_tgid_base_operations;
 	inode->i_flags|=S_IMMUTABLE;
 
-	set_nlink(inode, nlink_tgid);
+	/* The group leader has a directory */
+	if (thread_group_leader(task))
+		nlinks++;
+
+	set_nlink(inode, nlinks);
 
 	d_set_d_op(dentry, &pid_dentry_operations);
 
-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
                   ` (2 preceding siblings ...)
  2018-04-24  2:21 ` [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader jeffm
@ 2018-04-24  2:21 ` jeffm
  2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

The contents of /proc/tgid and /proc/tgid/task/tgid are nearly identical
but aren't.  There are files (or directory) contained in only one or
the other.  Therefore, we can't replace one with a symbolic link.
Creating links for the common files doesn't gain us anything since we'll
still need the dentries and inodes for them.  What we can do is create
links for the common subdirectories so the files below them are shared.

This patch converts the fd, fdinfo, ns, net, and attr directories at the
/proc/tgid level to symbolic links that point to the directories in
/proc/tgid/task/pid .

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/proc/base.c | 44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index de12bd2137ac..005b4f8a19c2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -140,6 +140,9 @@ struct pid_entry {
 	NOD(NAME, (S_IFLNK|S_IRWXUGO),				\
 		&proc_pid_link_inode_operations, NULL,		\
 		{ .proc_get_link = get_link } )
+#define TASKLNK(NAME)						\
+	NOD(NAME, (S_IFLNK|S_IRWXUGO),				\
+		&proc_tgid_to_pid_link_inode_operations, NULL, {})
 #define REG(NAME, MODE, fops)				\
 	NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {})
 #define ONE(NAME, MODE, show)				\
@@ -2941,6 +2944,37 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_LIVEPATCH */
 
+static const char *proc_tgid_to_pid_symlink_get_link(struct dentry *dentry,
+						     struct inode *inode,
+						     struct delayed_call *done)
+{
+	struct task_struct *task;
+	char *link = ERR_PTR(-ENOENT);
+
+	if (!dentry)
+		return ERR_PTR(-ECHILD);
+
+	task = get_proc_task(inode);
+	if (task) {
+		struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+		link = kasprintf(GFP_KERNEL, "task/%u/%.*s",
+				 pid_nr_ns(PROC_I(inode)->pid, ns),
+				 dentry->d_name.len, dentry->d_name.name);
+		if (link)
+			set_delayed_call(done, kfree_link, link);
+		else
+			link = ERR_PTR(-ENOMEM);
+		put_task_struct(task);
+	}
+	return link;
+}
+
+static const struct inode_operations proc_tgid_to_pid_link_inode_operations = {
+	.get_link	= proc_tgid_to_pid_symlink_get_link,
+	.setattr	= proc_setattr,
+};
+
 /*
  * Thread groups
  */
@@ -2948,12 +2982,12 @@ static const struct file_operations proc_task_operations;
 static const struct inode_operations proc_task_inode_operations;
 
 static const struct pid_entry tgid_base_stuff[] = {
-	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	TASKLNK("fd"),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
-	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
+	TASKLNK("fdinfo"),
+	TASKLNK("ns"),
 #ifdef CONFIG_NET
-	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
+	TASKLNK("net"),
 #endif
 	REG("environ",    S_IRUSR, proc_environ_operations),
 	REG("auxv",       S_IRUSR, proc_auxv_operations),
@@ -2991,7 +3025,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
 #endif
 #ifdef CONFIG_SECURITY
-	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
+	TASKLNK("attr"),
 #endif
 #ifdef CONFIG_KALLSYMS
 	ONE("wchan",      S_IRUGO, proc_pid_wchan),
-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
                   ` (3 preceding siblings ...)
  2018-04-24  2:21 ` [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid jeffm
@ 2018-04-24  2:21 ` jeffm
  2018-04-24 15:41   ` kbuild test robot
  2018-04-24 15:41   ` [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static kbuild test robot
  2018-04-24  6:17 ` [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks Alexey Dobriyan
  2018-04-24 14:14 ` Eric W. Biederman
  6 siblings, 2 replies; 15+ messages in thread
From: jeffm @ 2018-04-24  2:21 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: Al Viro, Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov,
	Jeff Mahoney

From: Jeff Mahoney <jeffm@suse.com>

When we have a single task with e.g. 4096 threads and 16k files open,
we can create over 134 million inode and dentry pairs just to back the fd
and fdinfo directories.  On smaller systems, memory pressure keeps the
number relatively contained.  On huge systems, all of these can fit
in memory.  The wasted memory is a problem, but the real problem is
what happens when that task exits.  Every task attempts to free its
own proc files, and we end up with a system that becomes unresponsive
for several minutes due to contention on the super's inode list lock.

The thing is, except for threads that have called unshare(CLONE_FILES),
the contents of every one of these directories is identical and comes
from the same files_struct.

This patch uses symbolic links to the thread group leader's fd and
fdinfo directories if the thread and the group leader have the same
files_struct.  If the thread calls unshare(CLONE_FILES), the
d_revalidate callback will bounce the symlink and the lookup will
create a directory.  If it's the thread group leader that calls
unshare, no symlinks will be used.

In the 4096 threads * 16k files case, the total procfs load is
about 600k files instead of 134M.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/proc/base.c | 242 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 231 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 005b4f8a19c2..dbdc2f9b2c58 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -122,6 +122,7 @@ struct pid_entry {
 	umode_t mode;
 	const struct inode_operations *iop;
 	const struct file_operations *fop;
+	const struct dentry_operations *dop;
 	union proc_op op;
 };
 
@@ -2438,6 +2439,7 @@ static const struct file_operations proc_pid_set_timerslack_ns_operations = {
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
+	const struct dentry_operations *dops = &pid_dentry_operations;
 	const struct pid_entry *p = ptr;
 	struct inode *inode;
 	struct proc_inode *ei;
@@ -2454,7 +2456,9 @@ static int proc_pident_instantiate(struct inode *dir,
 	if (p->fop)
 		inode->i_fop = p->fop;
 	ei->op = p->op;
-	d_set_d_op(dentry, &pid_dentry_operations);
+	if (p->dop)
+		dops = p->dop;
+	d_set_d_op(dentry, dops);
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
 	if (pid_revalidate(dentry, 0))
@@ -3482,12 +3486,136 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
 		.permission = proc_tid_comm_permission,
 };
 
+static const char *proc_pid_sibling_get_link(struct dentry *dentry,
+					     struct inode *inode,
+					     struct delayed_call *done)
+{
+	struct task_struct *task;
+	char *link = ERR_PTR(-ENOENT);
+
+	if (!dentry)
+		return ERR_PTR(-ECHILD);
+
+	task = get_proc_task(inode);
+	if (task) {
+		struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+		link = kasprintf(GFP_KERNEL, "../%u/%.*s",
+				 pid_nr_ns(task_tgid(task), ns),
+				 dentry->d_name.len,
+				 dentry->d_name.name);
+		if (link)
+			set_delayed_call(done, kfree_link, link);
+		else
+			link = ERR_PTR(-ENOMEM);
+
+		put_task_struct(task);
+	}
+
+	return link;
+}
+
+static const struct inode_operations proc_pid_sibling_symlink_inode_operations = {
+	.get_link	= proc_pid_sibling_get_link,
+	.setattr	= proc_setattr,
+};
+
+static bool tasks_share_files(const struct task_struct *task)
+{
+	return task->files == task->group_leader->files;
+}
+
+static int proc_pid_files_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct task_struct *task;
+	struct inode *inode;
+	int ret = 1;
+
+	if (flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = d_inode(dentry);
+	task = get_proc_task(inode);
+	if (!task)
+		return -ENOENT;
+
+	pid_revalidate_inode(inode, task);
+
+	/*
+	 * This thread called unshare(CLONE_FILES).
+	 * We need to turn it into a directory.
+	 */
+	if (!thread_group_leader(task) && (inode->i_mode & S_IFLNK) &&
+	    !tasks_share_files(task))
+		ret = 0;
+
+	put_task_struct(task);
+	return ret;
+}
+
+/*
+ * This only gets used with the symbolic links.  Once converted to a
+ * directory, there's no more work to do beyond pid_revalidate_inode, so
+ * we just use the regular pid_dentry_operations.
+ */
+const struct dentry_operations proc_pid_files_link_dentry_operations = {
+	.d_revalidate	= proc_pid_files_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static const struct pid_entry proc_pid_fd_dir_entry = {
+	.name	= "fd",
+	.len	= sizeof("fd") - 1,
+	.mode	= S_IFDIR|S_IRUSR|S_IXUSR,
+	.iop	= &proc_fd_inode_operations,
+	.fop	= &proc_fd_operations,
+};
+
+static const struct pid_entry proc_pid_fd_link_entry = {
+	.name	= "fd",
+	.len	= sizeof("fd") - 1,
+	.mode	= S_IFLNK|S_IRWXUGO,
+	.iop	= &proc_pid_sibling_symlink_inode_operations,
+	.dop	= &proc_pid_files_link_dentry_operations
+};
+
+static const struct pid_entry *proc_pid_fd_pid_entry(struct task_struct *task)
+{
+	if (thread_group_leader(task) || !tasks_share_files(task))
+		return &proc_pid_fd_dir_entry;
+	else
+		return &proc_pid_fd_link_entry;
+}
+
+static const struct pid_entry proc_pid_fdinfo_dir_entry = {
+	.name	= "fdinfo",
+	.len	= sizeof("fdinfo") - 1,
+	.mode	= S_IFDIR|S_IRUSR|S_IXUSR,
+	.iop	= &proc_fdinfo_inode_operations,
+	.fop	= &proc_fdinfo_operations,
+};
+
+static const struct pid_entry proc_pid_fdinfo_link_entry = {
+	.name	= "fdinfo",
+	.len	= sizeof("fdinfo") - 1,
+	.mode	= S_IFLNK|S_IRWXUGO,
+	.iop	= &proc_pid_sibling_symlink_inode_operations,
+	.dop	= &proc_pid_files_link_dentry_operations
+};
+
+static const struct pid_entry *proc_pid_fdinfo_pid_entry(
+						struct task_struct *task)
+{
+	if (thread_group_leader(task) || !tasks_share_files(task))
+		return &proc_pid_fdinfo_dir_entry;
+	else
+		return &proc_pid_fdinfo_link_entry;
+}
+
 /*
  * Tasks
  */
 static const struct pid_entry tid_base_stuff[] = {
-	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3579,14 +3707,71 @@ static const struct pid_entry tid_base_stuff[] = {
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx,
-				   tid_base_stuff, ARRAY_SIZE(tid_base_stuff));
+	const struct pid_entry *entry;
+	struct task_struct *task = get_proc_task(file_inode(file));
+	int i;
+
+	if (!task)
+		return -ENOENT;
+
+	if (!dir_emit_dots(file, ctx))
+		goto out;
+
+	if (ctx->pos == 2) {
+		entry = proc_pid_fd_pid_entry(task);
+
+		if (!proc_fill_cache_entry(file, ctx, entry, task))
+			goto out;
+		ctx->pos++;
+	}
+
+	if (ctx->pos == 3) {
+		entry = proc_pid_fdinfo_pid_entry(task);
+
+		if (!proc_fill_cache_entry(file, ctx, entry, task))
+			goto out;
+		ctx->pos++;
+	}
+
+	for (i = ctx->pos - 4; i < ARRAY_SIZE(tid_base_stuff); i++) {
+		entry = &tid_base_stuff[i];
+
+		if (!proc_fill_cache_entry(file, ctx, entry, task))
+			goto out;
+		ctx->pos++;
+	}
+
+out:
+	put_task_struct(task);
+	return 0;
 }
 
-static struct dentry *proc_tid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+static struct dentry *proc_tid_base_lookup(struct inode *dir,
+					   struct dentry *dentry,
+					   unsigned int flags)
 {
-	return proc_pident_lookup(dir, dentry,
-				  tid_base_stuff, ARRAY_SIZE(tid_base_stuff));
+	struct task_struct *task;
+	int error;
+
+	task = get_proc_task(dir);
+	if (!task)
+		return ERR_PTR(-ENOENT);
+
+	/* /proc/pid/task/pid/fd */
+	if (pid_entry_match_dentry(&proc_pid_fd_dir_entry, dentry))
+		error = proc_pident_instantiate(dir, dentry, task,
+						proc_pid_fd_pid_entry(task));
+	/* /proc/pid/task/pid/fdinfo */
+	else if (pid_entry_match_dentry(&proc_pid_fdinfo_dir_entry, dentry))
+		error = proc_pident_instantiate(dir, dentry, task,
+					proc_pid_fdinfo_pid_entry(task));
+	else
+		error = proc_pident_lookup_task(dir, dentry, tid_base_stuff,
+						ARRAY_SIZE(tid_base_stuff),
+						task);
+
+	put_task_struct(task);
+	return ERR_PTR(error);
 }
 
 static const struct file_operations proc_tid_base_operations = {
@@ -3601,6 +3786,42 @@ static const struct inode_operations proc_tid_base_inode_operations = {
 	.setattr	= proc_setattr,
 };
 
+static int proc_task_count_links(struct task_struct *task)
+{
+	int nlinks = nlink_tid;
+
+	/* Shared files: symlinks for fd and fdinfo */
+	if (!thread_group_leader(task) && tasks_share_files(task))
+		nlinks++;
+
+	return nlinks;
+}
+
+static int tid_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct inode *inode;
+	struct task_struct *task;
+
+	if (flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = d_inode(dentry);
+	task = get_proc_task(inode);
+
+	if (task) {
+		pid_revalidate_inode(inode, task);
+		set_nlink(inode, proc_task_count_links(task));
+		put_task_struct(task);
+		return 1;
+	}
+	return 0;
+}
+
+static const struct dentry_operations proc_tid_dentry_operations = {
+	.d_revalidate	= tid_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
 static int proc_task_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
@@ -3613,9 +3834,8 @@ static int proc_task_instantiate(struct inode *dir,
 	inode->i_fop = &proc_tid_base_operations;
 	inode->i_flags|=S_IMMUTABLE;
 
-	set_nlink(inode, nlink_tid);
-
-	d_set_d_op(dentry, &pid_dentry_operations);
+	set_nlink(inode, proc_task_count_links(task));
+	d_set_d_op(dentry, &proc_tid_dentry_operations);
 
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
-- 
2.12.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
                   ` (4 preceding siblings ...)
  2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
@ 2018-04-24  6:17 ` Alexey Dobriyan
  2018-04-25 18:04   ` Jeff Mahoney
  2018-04-24 14:14 ` Eric W. Biederman
  6 siblings, 1 reply; 15+ messages in thread
From: Alexey Dobriyan @ 2018-04-24  6:17 UTC (permalink / raw)
  To: jeffm
  Cc: linux-kernel, linux-fsdevel, Al Viro, Eric W . Biederman, Oleg Nesterov

On Mon, Apr 23, 2018 at 10:21:01PM -0400, jeffm@suse.com wrote:
> Memory pressure isn't really an issue on this machine, so we
> end up using well over 100GB for proc files.

Text files at scale!

> With these patches applied, running the same testcase, the proc_inode
> cache only gets to about 600k objects, which is about 99.7% fewer.  I
> get that procfs isn't supposed to be scalable, but this is kind of
> extreme. :)

Easy stuff:
* all ->get_link hooks are broken in RCU lookup (use GFP_KERNEL),
* "%.*s" for dentry names is probably unnecessary,
  they're always NUL terminated
* kasprintf() does printing twice, since we're kind of care about /proc
  performance, allocate for the worst case.

* "int nlinks = nlink_tgid;"
  Unsigned police.

* (inode->i_mode & S_IFLNK)
	this is sketchy, S_ISLNK exists.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
                   ` (5 preceding siblings ...)
  2018-04-24  6:17 ` [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks Alexey Dobriyan
@ 2018-04-24 14:14 ` Eric W. Biederman
  2018-04-26 21:03   ` Jeff Mahoney
  2019-03-21 18:30   ` Jeff Mahoney
  6 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2018-04-24 14:14 UTC (permalink / raw)
  To: jeffm
  Cc: linux-kernel, linux-fsdevel, Al Viro, Alexey Dobriyan, Oleg Nesterov

jeffm@suse.com writes:

> From: Jeff Mahoney <jeffm@suse.com>
>
> Hi all -
>
> I recently encountered a customer issue where, on a machine with many TiB
> of memory and a few hundred cores, after a task with a few thousand threads
> and hundreds of files open exited, the system would softlockup.  That
> issue was (is still) being addressed by Nik Borisov's patch to add a
> cond_resched call to shrink_dentry_list.  The underlying issue is still
> there, though.  We just don't complain as loudly.  When a huge task
> exits, now the system is more or less unresponsive for about eight
> minutes.  All CPUs are pinned and every one of them is going through
> dentry and inode eviction for the procfs files associated with each
> thread.  It's made worse by every CPU contending on the super's
> inode list lock.
>
> The numbers get big.  My test case was 4096 threads with 16384 files
> open.  It's a contrived example, but not that far off from the actual
> customer case.  In this case, a simple "find /proc" would create around
> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
> it just takes longer.  On smaller systems, memory pressure starts pushing
> them out. Memory pressure isn't really an issue on this machine, so we
> end up using well over 100GB for proc files.  It's the combination of
> the wasted CPU cycles in teardown and the wasted memory at runtime that
> pushed me to take this approach.
>
> The biggest culprit is the "fd" and "fdinfo" directories, but those are
> made worse by there being multiple copies of them even for the same
> task without threads getting involved:
>
> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>   resources.
>
> - Every /proc/pid/task/*/fd directory in a thread group has identical
>   contents (unless unshare(CLONE_FILES) was called), but share no
>   resources.
>
> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>   we'll get a valid directory.  Inside, there will be a complete
>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>   nothing is shared.
>
> This patch set reduces some (most) of the duplication by conditionally
> replacing some of the directories with symbolic links to copies that are
> identical.
>
> 1) Eliminate the duplication of the task directories between threads.
>    The task directory belongs to the thread leader and the threads
>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>    reduces duplication when individual threads are looked up directly
>    at the tgid level.  The impact varies based on the number of threads.
>    The user has to go out of their way in order to mess up their system
>    in this way.  But if they were so inclined, they could create ~550
>    billion inodes and dentries using the test case.
>
> 2) Eliminate the duplication of directories that are created identically
>    between the tgid-level pid directory and its task directory: fd,
>    fdinfo, ns, net, attr.  There is obviously more duplication between
>    the two directories, but replacing a file with a symbolic link
>    doesn't get us anything.  This reduces the number of files associated
>    with fd and fdinfo by half if threads aren't involved.
>
> 3) Eliminate the duplication of fd and fdinfo directories among threads
>    that share a files_struct.  We check at directory creation time if
>    the task is a group leader and if not, whether it shares ->files with
>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>    We use a d_revalidate callback to check whether the thread has called
>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>    Upon re-lookup, a directory will be created in its place.  This is
>    pretty simple, so if the thread group leader calls unshare, all threads
>    get directories.
>
> With these patches applied, running the same testcase, the proc_inode
> cache only gets to about 600k objects, which is about 99.7% fewer.  I
> get that procfs isn't supposed to be scalable, but this is kind of
> extreme. :)
>
> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
> with more knowledge of the details to pick it apart.  The biggest is that
> I'm not sure if any tools depend on any of these things being directories
> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
> sure there are corner cases I'm missing.  Hopefully, it's not just flat
> out broken since this is a problem that does need solving.
>
> Now I'll go put on the fireproof suit.

This needs to be tested against at least apparmor to see if this breaks
common policies.  Changing files to symlinks in proc has a bad habit of
either breaking apparmor policies or userspace assumptions.   Symbolic
links are unfortunately visible to userspace.

Further the proc structure is tgid/task/tid where the leaf directories
are per thread.

We more likely could get away with some magic symlinks (that would not
be user visible) rather than actual symlinks.

So I think you are probably on the right track to reduce the memory
usage but I think some more work will be needed to make it transparently
backwards compatible.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static
  2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
  2018-04-24 15:41   ` kbuild test robot
@ 2018-04-24 15:41   ` kbuild test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2018-04-24 15:41 UTC (permalink / raw)
  To: jeffm
  Cc: kbuild-all, linux-kernel, linux-fsdevel, Al Viro,
	Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov, Jeff Mahoney


Fixes: fc3babee0341 ("procfs: share fd/fdinfo with thread group leader when files are shared")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 base.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 98a847b..deb0950 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3565,7 +3565,7 @@ static int proc_pid_files_revalidate(struct dentry *dentry, unsigned int flags)
  * directory, there's no more work to do beyond pid_revalidate_inode, so
  * we just use the regular pid_dentry_operations.
  */
-const struct dentry_operations proc_pid_files_link_dentry_operations = {
+static const struct dentry_operations proc_pid_files_link_dentry_operations = {
 	.d_revalidate	= proc_pid_files_revalidate,
 	.d_delete	= pid_delete_dentry,
 };

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared
  2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
@ 2018-04-24 15:41   ` kbuild test robot
  2018-04-24 15:41   ` [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static kbuild test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2018-04-24 15:41 UTC (permalink / raw)
  To: jeffm
  Cc: kbuild-all, linux-kernel, linux-fsdevel, Al Viro,
	Eric W . Biederman, Alexey Dobriyan, Oleg Nesterov, Jeff Mahoney

Hi Jeff,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc2 next-20180424]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/jeffm-suse-com/procfs-reduce-duplication-by-using-symlinks/20180424-203620
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:285:34: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:331:42: sparse: expression using sizeof(void)
   fs/proc/base.c:834:32: sparse: expression using sizeof(void)
   fs/proc/base.c:950:27: sparse: expression using sizeof(void)
   fs/proc/base.c:951:28: sparse: expression using sizeof(void)
   fs/proc/base.c:951:28: sparse: expression using sizeof(void)
   fs/proc/base.c:2085:25: sparse: cast to restricted fmode_t
   fs/proc/base.c:2140:42: sparse: cast from restricted fmode_t
   fs/proc/base.c:2243:48: sparse: cast from restricted fmode_t
>> fs/proc/base.c:3568:32: sparse: symbol 'proc_pid_files_link_dentry_operations' was not declared. Should it be static?
   fs/proc/base.c:1078:36: sparse: context imbalance in '__set_oom_adj' - unexpected unlock
   fs/proc/base.c:2271:13: sparse: context imbalance in 'timers_start' - wrong count at exit
   fs/proc/base.c:2297:36: sparse: context imbalance in 'timers_stop' - unexpected unlock

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2018-04-24  6:17 ` [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks Alexey Dobriyan
@ 2018-04-25 18:04   ` Jeff Mahoney
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Mahoney @ 2018-04-25 18:04 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, linux-fsdevel, Al Viro, Eric W . Biederman, Oleg Nesterov

On 4/24/18 2:17 AM, Alexey Dobriyan wrote:
> On Mon, Apr 23, 2018 at 10:21:01PM -0400, jeffm@suse.com wrote:
>> Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files.
> 
> Text files at scale!
> 
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)>
> Easy stuff:
> * all ->get_link hooks are broken in RCU lookup (use GFP_KERNEL),

It's a pretty common pattern in the kernel, but it's just as easy to set
inode->i_link during instantiation and keep RCU lookup.  There aren't so
many of these to make it a real burden on memory.

> * "%.*s" for dentry names is probably unnecessary,
>   they're always NUL terminated

Ack.

> * kasprintf() does printing twice, since we're kind of care about /proc
>   performance, allocate for the worst case.

Ack, integrated with ->get_link fix.

> * "int nlinks = nlink_tgid;"
>   Unsigned police.

Ack.  nlink_t{,g}id are both u8, but it's easy to make it consistent.

> * (inode->i_mode & S_IFLNK)
> 	this is sketchy, S_ISLNK exists.
> 

Ack.

Notes of my own:

proc_task_count_links also had the logic backward.  It would add an
extra link to the count for the symlink rather than the dir.

proc_pid_files_revalidate only needs to check if the tasks share files
since it won't be called if it's not a symlink.

Thanks for the review,

-Jeff

-- 
Jeff Mahoney
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2018-04-24 14:14 ` Eric W. Biederman
@ 2018-04-26 21:03   ` Jeff Mahoney
  2019-03-21 18:30   ` Jeff Mahoney
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Mahoney @ 2018-04-26 21:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, Al Viro, Alexey Dobriyan, Oleg Nesterov

On 4/24/18 10:14 AM, Eric W. Biederman wrote:
> jeffm@suse.com writes:
> 
>> From: Jeff Mahoney <jeffm@suse.com>
>>
>> Hi all -
>>
>> I recently encountered a customer issue where, on a machine with many TiB
>> of memory and a few hundred cores, after a task with a few thousand threads
>> and hundreds of files open exited, the system would softlockup.  That
>> issue was (is still) being addressed by Nik Borisov's patch to add a
>> cond_resched call to shrink_dentry_list.  The underlying issue is still
>> there, though.  We just don't complain as loudly.  When a huge task
>> exits, now the system is more or less unresponsive for about eight
>> minutes.  All CPUs are pinned and every one of them is going through
>> dentry and inode eviction for the procfs files associated with each
>> thread.  It's made worse by every CPU contending on the super's
>> inode list lock.
>>
>> The numbers get big.  My test case was 4096 threads with 16384 files
>> open.  It's a contrived example, but not that far off from the actual
>> customer case.  In this case, a simple "find /proc" would create around
>> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
>> it just takes longer.  On smaller systems, memory pressure starts pushing
>> them out. Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files.  It's the combination of
>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>> pushed me to take this approach.
>>
>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>> made worse by there being multiple copies of them even for the same
>> task without threads getting involved:
>>
>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>   resources.
>>
>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>   contents (unless unshare(CLONE_FILES) was called), but share no
>>   resources.
>>
>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>   we'll get a valid directory.  Inside, there will be a complete
>>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>>   nothing is shared.
>>
>> This patch set reduces some (most) of the duplication by conditionally
>> replacing some of the directories with symbolic links to copies that are
>> identical.
>>
>> 1) Eliminate the duplication of the task directories between threads.
>>    The task directory belongs to the thread leader and the threads
>>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>>    reduces duplication when individual threads are looked up directly
>>    at the tgid level.  The impact varies based on the number of threads.
>>    The user has to go out of their way in order to mess up their system
>>    in this way.  But if they were so inclined, they could create ~550
>>    billion inodes and dentries using the test case.
>>
>> 2) Eliminate the duplication of directories that are created identically
>>    between the tgid-level pid directory and its task directory: fd,
>>    fdinfo, ns, net, attr.  There is obviously more duplication between
>>    the two directories, but replacing a file with a symbolic link
>>    doesn't get us anything.  This reduces the number of files associated
>>    with fd and fdinfo by half if threads aren't involved.
>>
>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>    that share a files_struct.  We check at directory creation time if
>>    the task is a group leader and if not, whether it shares ->files with
>>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>>    We use a d_revalidate callback to check whether the thread has called
>>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>    Upon re-lookup, a directory will be created in its place.  This is
>>    pretty simple, so if the thread group leader calls unshare, all threads
>>    get directories.
>>
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)
>>
>> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
>> with more knowledge of the details to pick it apart.  The biggest is that
>> I'm not sure if any tools depend on any of these things being directories
>> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
>> sure there are corner cases I'm missing.  Hopefully, it's not just flat
>> out broken since this is a problem that does need solving.
>>
>> Now I'll go put on the fireproof suit.
> 
> This needs to be tested against at least apparmor to see if this breaks
> common policies.  Changing files to symlinks in proc has a bad habit of
> either breaking apparmor policies or userspace assumptions.   Symbolic
> links are unfortunately visible to userspace.

That's my biggest concern as well.

> Further the proc structure is tgid/task/tid where the leaf directories
> are per thread.
> 
> We more likely could get away with some magic symlinks (that would not
> be user visible) rather than actual symlinks.

I'd considered that, but we'll need to instantiate other portions of the
tree in order to use those dentries as magic symlink targets.  That
seemed like it might not fly, so I went with the regular symlink approach.

> So I think you are probably on the right track to reduce the memory
> usage but I think some more work will be needed to make it transparently
> backwards compatible.

Thanks for the review,

-Jeff

-- 
Jeff Mahoney
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2018-04-24 14:14 ` Eric W. Biederman
  2018-04-26 21:03   ` Jeff Mahoney
@ 2019-03-21 18:30   ` Jeff Mahoney
  2019-03-23 15:56     ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Mahoney @ 2019-03-21 18:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, Al Viro, Alexey Dobriyan, Oleg Nesterov


[-- Attachment #1.1: Type: text/plain, Size: 6735 bytes --]

On 4/24/18 10:14 AM, Eric W. Biederman wrote:
> jeffm@suse.com writes:
> 
>> From: Jeff Mahoney <jeffm@suse.com>
>>
>> Hi all -
>>
>> I recently encountered a customer issue where, on a machine with many TiB
>> of memory and a few hundred cores, after a task with a few thousand threads
>> and hundreds of files open exited, the system would softlockup.  That
>> issue was (is still) being addressed by Nik Borisov's patch to add a
>> cond_resched call to shrink_dentry_list.  The underlying issue is still
>> there, though.  We just don't complain as loudly.  When a huge task
>> exits, now the system is more or less unresponsive for about eight
>> minutes.  All CPUs are pinned and every one of them is going through
>> dentry and inode eviction for the procfs files associated with each
>> thread.  It's made worse by every CPU contending on the super's
>> inode list lock.
>>
>> The numbers get big.  My test case was 4096 threads with 16384 files
>> open.  It's a contrived example, but not that far off from the actual
>> customer case.  In this case, a simple "find /proc" would create around
>> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
>> it just takes longer.  On smaller systems, memory pressure starts pushing
>> them out. Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files.  It's the combination of
>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>> pushed me to take this approach.
>>
>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>> made worse by there being multiple copies of them even for the same
>> task without threads getting involved:
>>
>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>   resources.
>>
>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>   contents (unless unshare(CLONE_FILES) was called), but share no
>>   resources.
>>
>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>   we'll get a valid directory.  Inside, there will be a complete
>>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>>   nothing is shared.
>>
>> This patch set reduces some (most) of the duplication by conditionally
>> replacing some of the directories with symbolic links to copies that are
>> identical.
>>
>> 1) Eliminate the duplication of the task directories between threads.
>>    The task directory belongs to the thread leader and the threads
>>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>>    reduces duplication when individual threads are looked up directly
>>    at the tgid level.  The impact varies based on the number of threads.
>>    The user has to go out of their way in order to mess up their system
>>    in this way.  But if they were so inclined, they could create ~550
>>    billion inodes and dentries using the test case.
>>
>> 2) Eliminate the duplication of directories that are created identically
>>    between the tgid-level pid directory and its task directory: fd,
>>    fdinfo, ns, net, attr.  There is obviously more duplication between
>>    the two directories, but replacing a file with a symbolic link
>>    doesn't get us anything.  This reduces the number of files associated
>>    with fd and fdinfo by half if threads aren't involved.
>>
>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>    that share a files_struct.  We check at directory creation time if
>>    the task is a group leader and if not, whether it shares ->files with
>>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>>    We use a d_revalidate callback to check whether the thread has called
>>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>    Upon re-lookup, a directory will be created in its place.  This is
>>    pretty simple, so if the thread group leader calls unshare, all threads
>>    get directories.
>>
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)
>>
>> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
>> with more knowledge of the details to pick it apart.  The biggest is that
>> I'm not sure if any tools depend on any of these things being directories
>> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
>> sure there are corner cases I'm missing.  Hopefully, it's not just flat
>> out broken since this is a problem that does need solving.
>>
>> Now I'll go put on the fireproof suit.

Thanks for your comments.  This ended up having to get back-burnered but
I've finally found some time to get back to it.  I have new patches that
don't treat each entry as a special case and makes more sense, IMO.
They're not worth posting yet since some of the issues below remain.

> This needs to be tested against at least apparmor to see if this breaks
> common policies.  Changing files to symlinks in proc has a bad habit of
> either breaking apparmor policies or userspace assumptions.   Symbolic
> links are unfortunately visible to userspace.

AppArmor uses the @{pids} var in profiles that translates to a numeric
regex.  That means that /proc/pid/task -> /proc/tgid/task won't break
profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
 Apparmor doesn't have a follow_link hook at all, so all that matters is
the final path.  SELinux does have a follow_link hook, but I'm not
familiar enough with it to know whether introducing a symlink in proc
will make a difference.

I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
since that clearly won't work.

> Further the proc structure is tgid/task/tid where the leaf directories
> are per thread.

Yes, but threads are still in /proc for lookup at the tgid level even if
they don't show up in readdir.

> We more likely could get away with some magic symlinks (that would not
> be user visible) rather than actual symlinks.

I think I'm missing something here.  Aren't magic symlinks still
represented to the user as symlinks?

> So I think you are probably on the right track to reduce the memory
> usage but I think some more work will be needed to make it transparently
> backwards compatible.

Yeah, that's going to be the big hiccup.  I think I've resolved the
biggest issue with AppArmor, but I don't think the problem is solvable
without introducing symlinks.

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2019-03-21 18:30   ` Jeff Mahoney
@ 2019-03-23 15:56     ` Eric W. Biederman
  2019-03-24  3:01       ` Jeff Mahoney
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2019-03-23 15:56 UTC (permalink / raw)
  To: Jeff Mahoney
  Cc: linux-kernel, linux-fsdevel, Al Viro, Alexey Dobriyan, Oleg Nesterov

Jeff Mahoney <jeffm@suse.com> writes:

> On 4/24/18 10:14 AM, Eric W. Biederman wrote:
>> jeffm@suse.com writes:
>> 
>>> From: Jeff Mahoney <jeffm@suse.com>
>>>
>>> Hi all -
>>>
>>> I recently encountered a customer issue where, on a machine with many TiB
>>> of memory and a few hundred cores, after a task with a few thousand threads
>>> and hundreds of files open exited, the system would softlockup.  That
>>> issue was (is still) being addressed by Nik Borisov's patch to add a
>>> cond_resched call to shrink_dentry_list.  The underlying issue is still
>>> there, though.  We just don't complain as loudly.  When a huge task
>>> exits, now the system is more or less unresponsive for about eight
>>> minutes.  All CPUs are pinned and every one of them is going through
>>> dentry and inode eviction for the procfs files associated with each
>>> thread.  It's made worse by every CPU contending on the super's
>>> inode list lock.
>>>
>>> The numbers get big.  My test case was 4096 threads with 16384 files
>>> open.  It's a contrived example, but not that far off from the actual
>>> customer case.  In this case, a simple "find /proc" would create around
>>> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
>>> it just takes longer.  On smaller systems, memory pressure starts pushing
>>> them out. Memory pressure isn't really an issue on this machine, so we
>>> end up using well over 100GB for proc files.  It's the combination of
>>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>>> pushed me to take this approach.
>>>
>>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>>> made worse by there being multiple copies of them even for the same
>>> task without threads getting involved:
>>>
>>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>>   resources.
>>>
>>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>>   contents (unless unshare(CLONE_FILES) was called), but share no
>>>   resources.
>>>
>>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>>   we'll get a valid directory.  Inside, there will be a complete
>>>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>>>   nothing is shared.
>>>
>>> This patch set reduces some (most) of the duplication by conditionally
>>> replacing some of the directories with symbolic links to copies that are
>>> identical.
>>>
>>> 1) Eliminate the duplication of the task directories between threads.
>>>    The task directory belongs to the thread leader and the threads
>>>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>>>    reduces duplication when individual threads are looked up directly
>>>    at the tgid level.  The impact varies based on the number of threads.
>>>    The user has to go out of their way in order to mess up their system
>>>    in this way.  But if they were so inclined, they could create ~550
>>>    billion inodes and dentries using the test case.
>>>
>>> 2) Eliminate the duplication of directories that are created identically
>>>    between the tgid-level pid directory and its task directory: fd,
>>>    fdinfo, ns, net, attr.  There is obviously more duplication between
>>>    the two directories, but replacing a file with a symbolic link
>>>    doesn't get us anything.  This reduces the number of files associated
>>>    with fd and fdinfo by half if threads aren't involved.
>>>
>>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>>    that share a files_struct.  We check at directory creation time if
>>>    the task is a group leader and if not, whether it shares ->files with
>>>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>>>    We use a d_revalidate callback to check whether the thread has called
>>>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>>    Upon re-lookup, a directory will be created in its place.  This is
>>>    pretty simple, so if the thread group leader calls unshare, all threads
>>>    get directories.
>>>
>>> With these patches applied, running the same testcase, the proc_inode
>>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>>> get that procfs isn't supposed to be scalable, but this is kind of
>>> extreme. :)
>>>
>>> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
>>> with more knowledge of the details to pick it apart.  The biggest is that
>>> I'm not sure if any tools depend on any of these things being directories
>>> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
>>> sure there are corner cases I'm missing.  Hopefully, it's not just flat
>>> out broken since this is a problem that does need solving.
>>>
>>> Now I'll go put on the fireproof suit.
>
> Thanks for your comments.  This ended up having to get back-burnered but
> I've finally found some time to get back to it.  I have new patches that
> don't treat each entry as a special case and makes more sense, IMO.
> They're not worth posting yet since some of the issues below remain.
>
>> This needs to be tested against at least apparmor to see if this breaks
>> common policies.  Changing files to symlinks in proc has a bad habit of
>> either breaking apparmor policies or userspace assumptions.   Symbolic
>> links are unfortunately visible to userspace.
>
> AppArmor uses the @{pids} var in profiles that translates to a numeric
> regex.  That means that /proc/pid/task -> /proc/tgid/task won't break
> profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
>  Apparmor doesn't have a follow_link hook at all, so all that matters is
> the final path.  SELinux does have a follow_link hook, but I'm not
> familiar enough with it to know whether introducing a symlink in proc
> will make a difference.
>
> I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
> since that clearly won't work.
>
>> Further the proc structure is tgid/task/tid where the leaf directories
>> are per thread.
>
> Yes, but threads are still in /proc for lookup at the tgid level even if
> they don't show up in readdir.
>
>> We more likely could get away with some magic symlinks (that would not
>> be user visible) rather than actual symlinks.
>
> I think I'm missing something here.  Aren't magic symlinks still
> represented to the user as symlinks?
>
>> So I think you are probably on the right track to reduce the memory
>> usage but I think some more work will be needed to make it transparently
>> backwards compatible.
>
> Yeah, that's going to be the big hiccup.  I think I've resolved the
> biggest issue with AppArmor, but I don't think the problem is solvable
> without introducing symlinks.

Has anyone looked at making the fd and fdinfo files hard links.

Alternatively it may make sense to see if there is something that we can
do with the locking to reduce the thundering hurd problem that is being
seen.

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
  2019-03-23 15:56     ` Eric W. Biederman
@ 2019-03-24  3:01       ` Jeff Mahoney
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Mahoney @ 2019-03-24  3:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, Al Viro, Alexey Dobriyan, Oleg Nesterov


[-- Attachment #1.1: Type: text/plain, Size: 7687 bytes --]

On 3/23/19 11:56 AM, Eric W. Biederman wrote:
> Jeff Mahoney <jeffm@suse.com> writes:
> 
>> On 4/24/18 10:14 AM, Eric W. Biederman wrote:
>>> jeffm@suse.com writes:
>>>
>>>> From: Jeff Mahoney <jeffm@suse.com>
>>>>
>>>> Hi all -
>>>>
>>>> I recently encountered a customer issue where, on a machine with many TiB
>>>> of memory and a few hundred cores, after a task with a few thousand threads
>>>> and hundreds of files open exited, the system would softlockup.  That
>>>> issue was (is still) being addressed by Nik Borisov's patch to add a
>>>> cond_resched call to shrink_dentry_list.  The underlying issue is still
>>>> there, though.  We just don't complain as loudly.  When a huge task
>>>> exits, now the system is more or less unresponsive for about eight
>>>> minutes.  All CPUs are pinned and every one of them is going through
>>>> dentry and inode eviction for the procfs files associated with each
>>>> thread.  It's made worse by every CPU contending on the super's
>>>> inode list lock.
>>>>
>>>> The numbers get big.  My test case was 4096 threads with 16384 files
>>>> open.  It's a contrived example, but not that far off from the actual
>>>> customer case.  In this case, a simple "find /proc" would create around
>>>> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
>>>> it just takes longer.  On smaller systems, memory pressure starts pushing
>>>> them out. Memory pressure isn't really an issue on this machine, so we
>>>> end up using well over 100GB for proc files.  It's the combination of
>>>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>>>> pushed me to take this approach.
>>>>
>>>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>>>> made worse by there being multiple copies of them even for the same
>>>> task without threads getting involved:
>>>>
>>>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>>>   resources.
>>>>
>>>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>>>   contents (unless unshare(CLONE_FILES) was called), but share no
>>>>   resources.
>>>>
>>>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>>>   we'll get a valid directory.  Inside, there will be a complete
>>>>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>>>>   nothing is shared.
>>>>
>>>> This patch set reduces some (most) of the duplication by conditionally
>>>> replacing some of the directories with symbolic links to copies that are
>>>> identical.
>>>>
>>>> 1) Eliminate the duplication of the task directories between threads.
>>>>    The task directory belongs to the thread leader and the threads
>>>>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>>>>    reduces duplication when individual threads are looked up directly
>>>>    at the tgid level.  The impact varies based on the number of threads.
>>>>    The user has to go out of their way in order to mess up their system
>>>>    in this way.  But if they were so inclined, they could create ~550
>>>>    billion inodes and dentries using the test case.
>>>>
>>>> 2) Eliminate the duplication of directories that are created identically
>>>>    between the tgid-level pid directory and its task directory: fd,
>>>>    fdinfo, ns, net, attr.  There is obviously more duplication between
>>>>    the two directories, but replacing a file with a symbolic link
>>>>    doesn't get us anything.  This reduces the number of files associated
>>>>    with fd and fdinfo by half if threads aren't involved.
>>>>
>>>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>>>    that share a files_struct.  We check at directory creation time if
>>>>    the task is a group leader and if not, whether it shares ->files with
>>>>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>>>>    We use a d_revalidate callback to check whether the thread has called
>>>>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>>>    Upon re-lookup, a directory will be created in its place.  This is
>>>>    pretty simple, so if the thread group leader calls unshare, all threads
>>>>    get directories.
>>>>
>>>> With these patches applied, running the same testcase, the proc_inode
>>>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>>>> get that procfs isn't supposed to be scalable, but this is kind of
>>>> extreme. :)
>>>>
>>>> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
>>>> with more knowledge of the details to pick it apart.  The biggest is that
>>>> I'm not sure if any tools depend on any of these things being directories
>>>> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
>>>> sure there are corner cases I'm missing.  Hopefully, it's not just flat
>>>> out broken since this is a problem that does need solving.
>>>>
>>>> Now I'll go put on the fireproof suit.
>>
>> Thanks for your comments.  This ended up having to get back-burnered but
>> I've finally found some time to get back to it.  I have new patches that
>> don't treat each entry as a special case and makes more sense, IMO.
>> They're not worth posting yet since some of the issues below remain.
>>
>>> This needs to be tested against at least apparmor to see if this breaks
>>> common policies.  Changing files to symlinks in proc has a bad habit of
>>> either breaking apparmor policies or userspace assumptions.   Symbolic
>>> links are unfortunately visible to userspace.
>>
>> AppArmor uses the @{pids} var in profiles that translates to a numeric
>> regex.  That means that /proc/pid/task -> /proc/tgid/task won't break
>> profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
>>  Apparmor doesn't have a follow_link hook at all, so all that matters is
>> the final path.  SELinux does have a follow_link hook, but I'm not
>> familiar enough with it to know whether introducing a symlink in proc
>> will make a difference.
>>
>> I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
>> since that clearly won't work.
>>
>>> Further the proc structure is tgid/task/tid where the leaf directories
>>> are per thread.
>>
>> Yes, but threads are still in /proc for lookup at the tgid level even if
>> they don't show up in readdir.
>>
>>> We more likely could get away with some magic symlinks (that would not
>>> be user visible) rather than actual symlinks.
>>
>> I think I'm missing something here.  Aren't magic symlinks still
>> represented to the user as symlinks?
>>
>>> So I think you are probably on the right track to reduce the memory
>>> usage but I think some more work will be needed to make it transparently
>>> backwards compatible.
>>
>> Yeah, that's going to be the big hiccup.  I think I've resolved the
>> biggest issue with AppArmor, but I don't think the problem is solvable
>> without introducing symlinks.
> 
> Has anyone looked at making the fd and fdinfo files hard links.

That could work to a certain degree.  It would certainly reduce the
inode count.  It would still create all the dentries, though.  That's
still a n^2 problem where n is the number of threads in the group.

> Alternatively it may make sense to see if there is something that we can
> do with the locking to reduce the thundering hurd problem that is being
> seen.

Yeah, that could still use some attention.  The thundering herd problem
is more of a tap when you reduce the contention by 99% though.

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-03-24  3:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24  2:21 [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks jeffm
2018-04-24  2:21 ` [PATCH 1/5] procfs: factor out a few helpers jeffm
2018-04-24  2:21 ` [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation jeffm
2018-04-24  2:21 ` [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader jeffm
2018-04-24  2:21 ` [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid jeffm
2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
2018-04-24 15:41   ` kbuild test robot
2018-04-24 15:41   ` [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static kbuild test robot
2018-04-24  6:17 ` [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks Alexey Dobriyan
2018-04-25 18:04   ` Jeff Mahoney
2018-04-24 14:14 ` Eric W. Biederman
2018-04-26 21:03   ` Jeff Mahoney
2019-03-21 18:30   ` Jeff Mahoney
2019-03-23 15:56     ` Eric W. Biederman
2019-03-24  3:01       ` Jeff Mahoney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).