LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 1/7] vfs: Fix shrink_submounts
@ 2008-11-06 10:38 Eric W. Biederman
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
  2008-11-07  1:22 ` [PATCH 1/7] vfs: Fix shrink_submounts Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


In the last refactoring of shrink_submounts a variable was
not completely renamed.  So finish the renaming of mnt to m
now.

Without this if you attempt to mount an nfs mount that has
both automatic nfs sub mounts on it, and has normal mounts
on it.  The unmount will succeed when it should not.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/namespace.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cce4670..65b3dc8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1815,8 +1815,8 @@ static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
 		while (!list_empty(&graveyard)) {
 			m = list_first_entry(&graveyard, struct vfsmount,
 						mnt_expire);
-			touch_mnt_namespace(mnt->mnt_ns);
-			umount_tree(mnt, 1, umounts);
+			touch_mnt_namespace(m->mnt_ns);
+			umount_tree(m, 1, umounts);
 		}
 	}
 }
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-06 10:38 [PATCH 1/7] vfs: Fix shrink_submounts Eric W. Biederman
@ 2008-11-06 10:48 ` Eric W. Biederman
  2008-11-06 10:49   ` [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure Eric W. Biederman
                     ` (3 more replies)
  2008-11-07  1:22 ` [PATCH 1/7] vfs: Fix shrink_submounts Andrew Morton
  1 sibling, 4 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


This is a genearl mechanism that is capable of removing
any unused mounts on /proc in any directory.  As we flush
the mounts when a processes dies this mechanism is tailored
for flushing mounts in the per task and per task group
directories.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/Makefile        |    1 +
 fs/proc/automount.c     |   29 +++++++++++++++++++++++++++++
 fs/proc/internal.h      |    3 +++
 include/linux/proc_fs.h |    1 +
 kernel/exit.c           |    1 +
 5 files changed, 35 insertions(+), 0 deletions(-)
 create mode 100644 fs/proc/automount.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 63d9651..862a8fe 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -18,6 +18,7 @@ proc-y	+= meminfo.o
 proc-y	+= stat.o
 proc-y	+= uptime.o
 proc-y	+= version.o
+proc-y	+= automount.o
 proc-$(CONFIG_PROC_SYSCTL)	+= proc_sysctl.o
 proc-$(CONFIG_NET)		+= proc_net.o
 proc-$(CONFIG_PROC_KCORE)	+= kcore.o
diff --git a/fs/proc/automount.c b/fs/proc/automount.c
new file mode 100644
index 0000000..a1fabf2
--- /dev/null
+++ b/fs/proc/automount.c
@@ -0,0 +1,29 @@
+#include <linux/workqueue.h>
+#include <linux/mount.h>
+#include <linux/vfs.h>
+#include "internal.h"
+
+LIST_HEAD(proc_automounts);
+
+static void proc_expire_automounts(struct work_struct *work);
+
+static DECLARE_DELAYED_WORK(proc_automount_task, proc_expire_automounts);
+static int proc_automount_timeout = 500 * HZ;
+
+void proc_shrink_automounts(void)
+{
+	struct list_head *list = &proc_automounts;
+
+	mark_mounts_for_expiry(list);
+	mark_mounts_for_expiry(list);
+	if (list_empty(list))
+		return;
+
+	schedule_delayed_work(&proc_automount_task, proc_automount_timeout);
+}
+
+static void proc_expire_automounts(struct work_struct *work)
+{
+	proc_shrink_automounts();
+}
+
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3e8aeb8..2a8eabb 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -93,3 +93,6 @@ struct pde_opener {
 	int (*release)(struct inode *, struct file *);
 	struct list_head lh;
 };
+
+extern struct list_head proc_automounts;
+
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b8bdb96..3505a72 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -101,6 +101,7 @@ extern spinlock_t proc_subdir_lock;
 
 extern void proc_root_init(void);
 
+void proc_shrink_automounts(void);
 void proc_flush_task(struct task_struct *task);
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *);
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
diff --git a/kernel/exit.c b/kernel/exit.c
index 80137a5..8a8badb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -161,6 +161,7 @@ void release_task(struct task_struct * p)
 repeat:
 	tracehook_prepare_release_task(p);
 	atomic_dec(&p->user->processes);
+	proc_shrink_automounts();
 	proc_flush_task(p);
 	write_lock_irq(&tasklist_lock);
 	tracehook_finish_release_task(p);
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
@ 2008-11-06 10:49   ` Eric W. Biederman
  2008-11-06 10:53     ` [PATCH 4/7] proc: Make /proc/net it's own filesystem Eric W. Biederman
  2008-11-07  1:25   ` [PATCH 2/7] proc: Implement support for automounts in task directories Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


- Implement proc_create_root to create a generic root directory.
- Factor out release_proc_entry so that we can free a generic root directory.
- Remove static from proc_sops so that we can access it in other files
  implementing a proc generic filesystem.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/generic.c  |   30 +++++++++++++++++++++++++++---
 fs/proc/inode.c    |    2 +-
 fs/proc/internal.h |    3 +++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 60a359b..8669cf6 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -532,7 +532,6 @@ static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp
 			dp->proc_fops = &proc_dir_operations;
 			dp->proc_iops = &proc_dir_inode_operations;
 		}
-		dir->nlink++;
 	} else if (S_ISLNK(dp->mode)) {
 		if (dp->proc_iops == NULL)
 			dp->proc_iops = &proc_link_inode_operations;
@@ -555,6 +554,8 @@ static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp
 	dp->next = dir->subdir;
 	dp->parent = dir;
 	dir->subdir = dp;
+	if (S_ISDIR(dp->mode))
+		dir->nlink++;
 	spin_unlock(&proc_subdir_lock);
 
 	return 0;
@@ -599,6 +600,24 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
 	return ent;
 }
 
+struct proc_dir_entry *proc_create_root(void)
+{
+	struct proc_dir_entry *ent, *parent = NULL;
+
+	ent = __proc_create(&parent, "..", S_IFDIR | S_IRUGO | S_IXUGO, 2);
+	if (ent) {
+		ent->proc_fops = &proc_dir_operations;
+		ent->proc_iops = &proc_dir_inode_operations;
+		ent->low_ino = get_inode_number();
+		ent->parent = ent;
+		if (!ent->low_ino) {
+			kfree(ent);
+			ent = NULL;
+		}
+	}
+	return ent;
+}
+
 struct proc_dir_entry *proc_symlink(const char *name,
 		struct proc_dir_entry *parent, const char *dest)
 {
@@ -758,6 +777,8 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
 			de = *p;
 			*p = de->next;
 			de->next = NULL;
+			if (S_ISDIR(de->mode))
+				parent->nlink--;
 			break;
 		}
 	}
@@ -765,6 +786,11 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
 	if (!de)
 		return;
 
+	release_proc_entry(de);
+}
+
+void release_proc_entry(struct proc_dir_entry *de)
+{
 	spin_lock(&de->pde_unload_lock);
 	/*
 	 * Stop accepting new callers into module. If you're
@@ -800,8 +826,6 @@ continue_removing:
 	}
 	spin_unlock(&de->pde_unload_lock);
 
-	if (S_ISDIR(de->mode))
-		parent->nlink--;
 	de->nlink = 0;
 	WARN(de->subdir, KERN_WARNING "%s: removing non-empty directory "
 			"'%s/%s', leaking at least '%s'\n", __func__,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2543fd0..f3f81e2 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -115,7 +115,7 @@ void __init proc_init_inodecache(void)
 					     init_once);
 }
 
-static const struct super_operations proc_sops = {
+const struct super_operations proc_sops = {
 	.alloc_inode	= proc_alloc_inode,
 	.destroy_inode	= proc_destroy_inode,
 	.drop_inode	= generic_delete_inode,
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2a8eabb..6fe00ff 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -86,6 +86,9 @@ struct dentry *proc_lookup_de(struct proc_dir_entry *de, struct inode *ino,
 		struct dentry *dentry);
 int proc_readdir_de(struct proc_dir_entry *de, struct file *filp, void *dirent,
 		filldir_t filldir);
+struct proc_dir_entry *proc_create_root(void);
+void release_proc_entry(struct proc_dir_entry *de);
+extern const struct super_operations proc_sops;
 
 struct pde_opener {
 	struct inode *inode;
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/7] proc: Make /proc/net it's own filesystem
  2008-11-06 10:49   ` [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure Eric W. Biederman
@ 2008-11-06 10:53     ` Eric W. Biederman
  2008-11-06 10:56       ` [PATCH 5/7] proc_net: Don't show the wrong /proc/net after unshare Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


Make the VFS happy with /proc/net by making it it's own
filesystem avoiding issues with hard links to directories
and other silliness that confuse the vfs today.

We preserve backwards compatibility by automatically
mounting /proc/self/net and marking it as a shrinkable
mount so userspace doesn't need to care about it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/base.c              |    6 +-
 fs/proc/proc_net.c          |  212 +++++++++++++++++++++++++++++++------------
 include/linux/magic.h       |    1 +
 include/net/net_namespace.h |    1 +
 security/selinux/hooks.c    |   28 +++++-
 5 files changed, 183 insertions(+), 65 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 486cf3f..9a68fa4 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -128,6 +128,10 @@ struct pid_entry {
 	NOD(NAME, (S_IFREG|(MODE)), 			\
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = &proc_##OTYPE } )
+#define MNT(NAME, MODE, OTYPE) 				\
+	NOD(NAME, (S_IFDIR|(MODE)), 			\
+		&proc_##OTYPE##_inode_operations, NULL,	\
+		{} )
 
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
@@ -2453,7 +2457,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("fd",         S_IRUSR|S_IXUSR, fd),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, fdinfo),
 #ifdef CONFIG_NET
-	DIR("net",        S_IRUGO|S_IXUGO, net),
+	MNT("net",        S_IRUGO|S_IXUGO, net),
 #endif
 	REG("environ",    S_IRUSR, environ),
 	INF("auxv",       S_IRUSR, pid_auxv),
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 7bc296f..57e0f22 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -21,11 +21,13 @@
 #include <linux/smp_lock.h>
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
+#include <linux/namei.h>
 #include <net/net_namespace.h>
 #include <linux/seq_file.h>
 
 #include "internal.h"
 
+static struct file_system_type proc_net_fs_type;
 
 static struct net *get_proc_net(const struct inode *inode)
 {
@@ -118,65 +120,60 @@ static struct net *get_proc_task_net(struct inode *dir)
 	return net;
 }
 
-static struct dentry *proc_tgid_net_lookup(struct inode *dir,
-		struct dentry *dentry, struct nameidata *nd)
+void *proc_net_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
-	struct dentry *de;
+	/* Follow to a mount point of the proper network namespace.
+	 */
+	struct vfsmount *mnt;
 	struct net *net;
-
-	de = ERR_PTR(-ENOENT);
-	net = get_proc_task_net(dir);
-	if (net != NULL) {
-		de = proc_lookup_de(net->proc_net, dir, dentry);
-		put_net(net);
+	int err = -ENOENT;
+
+	/* Which network namespace? */
+	net = get_proc_task_net(dentry->d_inode);
+	if (!net)
+		goto out_err;
+
+	/* Create a new mount. */
+	mnt = kern_mount_data(&proc_net_fs_type, net);
+	if (IS_ERR(mnt))
+		goto out_err;
+
+	dput(nd->path.dentry);
+	nd->path.dentry = dget(dentry);
+
+	/* Add mnt the mount namespace */
+	err = do_add_mount(mntget(mnt), &nd->path, MNT_SHRINKABLE,
+			   &proc_automounts);
+	if (err < 0) {
+		mntput(mnt);
+		if (err == -EBUSY)
+			goto out_follow;
+		goto out_err;
 	}
-	return de;
-}
-
-static int proc_tgid_net_getattr(struct vfsmount *mnt, struct dentry *dentry,
-		struct kstat *stat)
-{
-	struct inode *inode = dentry->d_inode;
-	struct net *net;
-
-	net = get_proc_task_net(inode);
-
-	generic_fillattr(inode, stat);
-
-	if (net != NULL) {
-		stat->nlink = net->proc_net->nlink;
-		put_net(net);
-	}
-
-	return 0;
+	/* Place the mnt on path and return it to the caller */
+	err = 0;
+	path_put(&nd->path);
+	nd->path.mnt = mnt;
+	nd->path.dentry = dget(mnt->mnt_root);
+	put_net(net);
+out:
+	return ERR_PTR(err);
+out_err:
+	path_put(&nd->path);
+	goto out;
+out_follow:
+	/* We raced with ourselves so just walk the mounts */
+	while (d_mountpoint(nd->path.dentry) &&
+		follow_down(&nd->path.mnt, &nd->path.dentry))
+		;
+	err = 0;
+	goto out;
 }
 
 const struct inode_operations proc_net_inode_operations = {
-	.lookup		= proc_tgid_net_lookup,
-	.getattr	= proc_tgid_net_getattr,
-};
-
-static int proc_tgid_net_readdir(struct file *filp, void *dirent,
-		filldir_t filldir)
-{
-	int ret;
-	struct net *net;
-
-	ret = -EINVAL;
-	net = get_proc_task_net(filp->f_path.dentry->d_inode);
-	if (net != NULL) {
-		ret = proc_readdir_de(net->proc_net, filp, dirent, filldir);
-		put_net(net);
-	}
-	return ret;
-}
-
-const struct file_operations proc_net_operations = {
-	.read		= generic_read_dir,
-	.readdir	= proc_tgid_net_readdir,
+	.follow_link	= proc_net_follow_link,
 };
 
-
 struct proc_dir_entry *proc_net_fops_create(struct net *net,
 	const char *name, mode_t mode, const struct file_operations *fops)
 {
@@ -190,21 +187,95 @@ void proc_net_remove(struct net *net, const char *name)
 }
 EXPORT_SYMBOL_GPL(proc_net_remove);
 
+
+static int proc_net_fill_super(struct super_block *sb)
+{
+	struct net *net = sb->s_fs_info;
+	struct proc_dir_entry *netd = net->proc_net;
+	struct inode *root_inode = NULL;
+
+	sb->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = PROC_NET_SUPER_MAGIC;
+	sb->s_op = &proc_sops;
+	sb->s_time_gran = 1;
+
+	de_get(netd);
+	root_inode = proc_get_inode(sb, netd->low_ino, netd);
+	if (!root_inode)
+		goto out_no_root;
+	root_inode->i_uid = 0;
+	root_inode->i_gid = 0;
+	sb->s_root = d_alloc_root(root_inode);
+	if (!sb->s_root)
+		goto out_no_root;
+	return 0;
+
+out_no_root:
+	printk("%s: get root inode failed\n", __func__);
+	iput(root_inode);
+	de_put(netd);
+	return -ENOMEM;
+}
+
+static int proc_net_test_super(struct super_block *sb, void *data)
+{
+	return sb->s_fs_info == data;
+}
+
+static int proc_net_set_super(struct super_block *sb, void *data)
+{
+	sb->s_fs_info = data;
+	return set_anon_super(sb, NULL);
+}
+
+static int proc_net_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	struct super_block *sb;
+
+	if (!(flags & MS_KERNMOUNT))
+		data = current->nsproxy->net_ns;
+
+	sb = sget(fs_type, proc_net_test_super, proc_net_set_super, data);
+	if (IS_ERR(sb))
+		return PTR_ERR(sb);
+
+	if (!sb->s_root) {
+		int err;
+		sb->s_flags = flags;
+		err = proc_net_fill_super(sb);
+		if (err) {
+			up_write(&sb->s_umount);
+			deactivate_super(sb);
+			return err;
+		}
+
+		sb->s_flags |= MS_ACTIVE;
+	}
+
+	return simple_set_mnt(mnt, sb);
+}
+
+static struct file_system_type proc_net_fs_type = {
+	.name		= "proc/net",
+	.get_sb		= proc_net_get_sb,
+	.kill_sb	= kill_litter_super,
+};
+
 static __net_init int proc_net_ns_init(struct net *net)
 {
 	struct proc_dir_entry *netd, *net_statd;
+	struct vfsmount *mnt;
 	int err;
 
 	err = -ENOMEM;
-	netd = kzalloc(sizeof(*netd), GFP_KERNEL);
+	netd = proc_create_root();
 	if (!netd)
 		goto out;
 
 	netd->data = net;
-	netd->nlink = 2;
-	netd->name = "net";
-	netd->namelen = 3;
-	netd->parent = &proc_root;
 
 	err = -EEXIST;
 	net_statd = proc_net_mkdir(net, "stat", netd);
@@ -213,8 +284,17 @@ static __net_init int proc_net_ns_init(struct net *net)
 
 	net->proc_net = netd;
 	net->proc_net_stat = net_statd;
+
+	mnt = kern_mount_data(&proc_net_fs_type, net);
+	if (IS_ERR(mnt))
+		goto free_stat;
+
+	net->proc_mnt = mnt;
+
 	return 0;
 
+free_stat:
+	remove_proc_entry("stat", netd);
 free_net:
 	kfree(netd);
 out:
@@ -224,7 +304,14 @@ out:
 static __net_exit void proc_net_ns_exit(struct net *net)
 {
 	remove_proc_entry("stat", net->proc_net);
-	kfree(net->proc_net);
+	release_proc_entry(net->proc_net);
+	/* We won't be looking up this super block
+	 * any more so set s_fs_info to NULL to ensure
+	 * it doesn't conflict with network namespaces
+	 * allocated in the future at the same address.
+	 */
+	net->proc_mnt->mnt_sb->s_fs_info = NULL;
+	mntput(net->proc_mnt);
 }
 
 static struct pernet_operations __net_initdata proc_net_ns_ops = {
@@ -234,7 +321,16 @@ static struct pernet_operations __net_initdata proc_net_ns_ops = {
 
 int __init proc_net_init(void)
 {
-	proc_symlink("net", NULL, "self/net");
+	struct proc_dir_entry *ent;
+	int err;
+
+	ent = proc_symlink("net", NULL, "self/net");
+	if (!ent)
+		return -EEXIST;
+
+	err = register_filesystem(&proc_net_fs_type);
+	if (err)
+		return err;
 
 	return register_pernet_subsys(&proc_net_ns_ops);
 }
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..2b31c02 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -30,6 +30,7 @@
 #define NFS_SUPER_MAGIC		0x6969
 #define OPENPROM_SUPER_MAGIC	0x9fa1
 #define PROC_SUPER_MAGIC	0x9fa0
+#define PROC_NET_SUPER_MAGIC	0x706e6574
 #define QNX4_SUPER_MAGIC	0x002f		/* qnx4 fs detection */
 
 #define REISERFS_SUPER_MAGIC	0x52654973	/* used by gcc */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 700c53a..77aba2b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -40,6 +40,7 @@ struct net {
 
 	struct proc_dir_entry 	*proc_net;
 	struct proc_dir_entry 	*proc_net_stat;
+	struct vfsmount		*proc_mnt;
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_set	sysctls;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f85597a..b38a2df 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -667,7 +667,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
 		goto out;
 	}
 
-	if (strcmp(sb->s_type->name, "proc") == 0)
+	if (strncmp(sb->s_type->name, "proc", 4) == 0)
 		sbsec->proc = 1;
 
 	/* Determine the labeling behavior to use for this filesystem type. */
@@ -1116,16 +1116,18 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 }
 
 #ifdef CONFIG_PROC_FS
-static int selinux_proc_get_sid(struct proc_dir_entry *de,
+static int selinux_proc_get_sid(struct super_block *sb,
+				struct proc_dir_entry *de,
 				u16 tclass,
 				u32 *sid)
 {
 	int buflen, rc;
 	char *buffer, *path, *end;
 
+	rc = -ENOMEM;
 	buffer = (char *)__get_free_page(GFP_KERNEL);
 	if (!buffer)
-		return -ENOMEM;
+		goto out;
 
 	buflen = PAGE_SIZE;
 	end = buffer+buflen;
@@ -1136,19 +1138,32 @@ static int selinux_proc_get_sid(struct proc_dir_entry *de,
 	while (de && de != de->parent) {
 		buflen -= de->namelen + 1;
 		if (buflen < 0)
-			break;
+			goto out_free;
 		end -= de->namelen;
 		memcpy(end, de->name, de->namelen);
 		*--end = '/';
 		path = end;
 		de = de->parent;
 	}
+	if (strcmp(sb->type->name, "proc") != 0) {
+		const char *name = sb->type->name + 4;
+		int namelen = strlen(name);
+		buflen -= namelen;
+		if (buflen < 0)
+			goto out_free;
+		end -= namelen;
+		memcpy(end, name);
+		path = end;
+	}
 	rc = security_genfs_sid("proc", path, tclass, sid);
+out_free:
 	free_page((unsigned long)buffer);
+out:
 	return rc;
 }
 #else
-static int selinux_proc_get_sid(struct proc_dir_entry *de,
+static int selinux_proc_get_sid(struct super_block *sb,
+				struct proc_dir_entry *de,
 				u16 tclass,
 				u32 *sid)
 {
@@ -1297,7 +1312,8 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
 			struct proc_inode *proci = PROC_I(inode);
 			if (proci->pde) {
 				isec->sclass = inode_mode_to_security_class(inode->i_mode);
-				rc = selinux_proc_get_sid(proci->pde,
+				rc = selinux_proc_get_sid(inode->i_sb,
+							  proci->pde,
 							  isec->sclass,
 							  &sid);
 				if (rc)
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/7] proc_net: Don't show the wrong /proc/net after unshare.
  2008-11-06 10:53     ` [PATCH 4/7] proc: Make /proc/net it's own filesystem Eric W. Biederman
@ 2008-11-06 10:56       ` Eric W. Biederman
  2008-11-06 10:57         ` [PATCH 6/7] proc_net: Simplify network namespace lookup Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


This is accomplished by dropping the /proc/<pid>/net
dentry when we discover an older version of /proc/net
is mounted upon it.  This prevents new lookups from
using the mount and ultimately proc_shrink_automounts
will catch up with it and remove the old mount point.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/base.c     |   11 +++++++----
 fs/proc/internal.h |   11 +++++++++++
 fs/proc/proc_net.c |   33 +++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9a68fa4..8b0d066 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1487,6 +1487,7 @@ static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = get_proc_task(inode);
+	int ret = 0;
 	if (task) {
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
 		    task_dumpable(task)) {
@@ -1497,12 +1498,14 @@ static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 			inode->i_gid = 0;
 		}
 		inode->i_mode &= ~(S_ISUID | S_ISGID);
-		security_task_to_inode(task, inode);
+		ret = proc_net_revalidate(task, dentry, nd);
+		if (ret == 1)
+			security_task_to_inode(task, inode);
 		put_task_struct(task);
-		return 1;
 	}
-	d_drop(dentry);
-	return 0;
+	if (ret == 0)
+		d_drop(dentry);
+	return ret;
 }
 
 static int pid_delete_dentry(struct dentry * dentry)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ffa285e..f9f8de6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -64,6 +64,17 @@ extern const struct file_operations proc_kmsg_operations;
 extern const struct file_operations proc_net_operations;
 extern const struct inode_operations proc_net_inode_operations;
 
+#ifdef CONFIG_NET
+extern int proc_net_revalidate(struct task_struct *task, struct dentry *dentry,
+				struct nameidata *nd);
+#else
+static inline int proc_net_revalidate(struct task_struct *t, struct dentry *d,
+					struct nameidata *nd)
+{
+	return 1;
+}
+#endif
+
 void free_proc_entry(struct proc_dir_entry *de);
 
 void proc_init_inodecache(void);
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 57e0f22..4a7551a 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -174,6 +174,39 @@ const struct inode_operations proc_net_inode_operations = {
 	.follow_link	= proc_net_follow_link,
 };
 
+int proc_net_revalidate(struct task_struct *task, struct dentry *dentry,
+			struct nameidata *nd)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *tdentry;
+	struct vfsmount *tmnt;
+	int ret = 1;
+
+	/* Are we talking about a proc/net mount point? */
+	if (!nd || (inode->i_op != &proc_net_inode_operations))
+		goto out;
+
+	/* If the wrong filesystem is mounted on
+	 * /proc/<pid>/net report the dentry is invalid.
+	 */
+	tdentry = dget(dentry);
+	tmnt = mntget(nd->path.mnt);
+	if (follow_down(&tmnt, &tdentry)) {
+		struct nsproxy *ns;
+		rcu_read_lock();
+		ns = task_nsproxy(task);
+		if ((ns == NULL) ||
+		     (tmnt->mnt_sb->s_magic != PROC_NET_SUPER_MAGIC) ||
+		     (tmnt->mnt_sb->s_fs_info != ns->net_ns))
+			ret = 0;
+		rcu_read_unlock();
+	}
+	mntput(tmnt);
+	dput(tdentry);
+out:
+	return ret;
+}
+
 struct proc_dir_entry *proc_net_fops_create(struct net *net,
 	const char *name, mode_t mode, const struct file_operations *fops)
 {
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/7] proc_net: Simplify network namespace lookup.
  2008-11-06 10:56       ` [PATCH 5/7] proc_net: Don't show the wrong /proc/net after unshare Eric W. Biederman
@ 2008-11-06 10:57         ` Eric W. Biederman
  2008-11-06 10:58           ` [PATCH 7/7] proc: Cleanup proc_flush_task Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


Since the network namespace is recorded in the
superblock we don't need to remember it on
each directory under /proc/net.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/generic.c       |   17 -----------------
 fs/proc/proc_net.c      |   13 ++++++++++---
 include/linux/proc_fs.h |    5 -----
 3 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8669cf6..6db30a8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -658,23 +658,6 @@ struct proc_dir_entry *proc_mkdir_mode(const char *name, mode_t mode,
 	return ent;
 }
 
-struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
-		struct proc_dir_entry *parent)
-{
-	struct proc_dir_entry *ent;
-
-	ent = __proc_create(&parent, name, S_IFDIR | S_IRUGO | S_IXUGO, 2);
-	if (ent) {
-		ent->data = net;
-		if (proc_register(parent, ent) < 0) {
-			kfree(ent);
-			ent = NULL;
-		}
-	}
-	return ent;
-}
-EXPORT_SYMBOL_GPL(proc_net_mkdir);
-
 struct proc_dir_entry *proc_mkdir(const char *name,
 		struct proc_dir_entry *parent)
 {
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 4a7551a..baabb9c 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -31,7 +31,7 @@ static struct file_system_type proc_net_fs_type;
 
 static struct net *get_proc_net(const struct inode *inode)
 {
-	return maybe_get_net(PDE_NET(PDE(inode)));
+	return maybe_get_net(inode->i_sb->s_fs_info);
 }
 
 int seq_open_net(struct inode *ino, struct file *f,
@@ -214,6 +214,15 @@ struct proc_dir_entry *proc_net_fops_create(struct net *net,
 }
 EXPORT_SYMBOL_GPL(proc_net_fops_create);
 
+struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
+		struct proc_dir_entry *parent)
+{
+	if (!parent)
+		parent = net->proc_net;
+	return proc_mkdir(name, parent);
+}
+EXPORT_SYMBOL_GPL(proc_net_mkdir);
+
 void proc_net_remove(struct net *net, const char *name)
 {
 	remove_proc_entry(name, net->proc_net);
@@ -308,8 +317,6 @@ static __net_init int proc_net_ns_init(struct net *net)
 	if (!netd)
 		goto out;
 
-	netd->data = net;
-
 	err = -EEXIST;
 	net_statd = proc_net_mkdir(net, "stat", netd);
 	if (!net_statd)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 3505a72..52cc6bd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -300,11 +300,6 @@ static inline struct proc_dir_entry *PDE(const struct inode *inode)
 	return PROC_I(inode)->pde;
 }
 
-static inline struct net *PDE_NET(struct proc_dir_entry *pde)
-{
-	return pde->parent->data;
-}
-
 struct proc_maps_private {
 	struct pid *pid;
 	struct task_struct *task;
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 7/7] proc: Cleanup proc_flush_task.
  2008-11-06 10:57         ` [PATCH 6/7] proc_net: Simplify network namespace lookup Eric W. Biederman
@ 2008-11-06 10:58           ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-06 10:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Alexey Dobriyan, Al Viro, Linux Containers


- shrink_dcache_parent is guarnateed to only flush dcache entries
  from the specified superblock.  So the the insanely subtle
  PF_EXITING check desgined to avoid flushing filesystems inodes
  that are not safe to call during exit is unecessary.

- The test to avoid excess flushing of dcache entries in proc
  got inverted and generally broken for threads, and frankly
  all it saved was a single d_hash_and_lookup so there was
  very little point to it.  So remove the check and keep
  the code stupid and correct.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/base.c |   11 +++--------
 1 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8b0d066..cb2222a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2560,15 +2560,11 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 	name.len = snprintf(buf, sizeof(buf), "%d", pid);
 	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
 	if (dentry) {
-		if (!(current->flags & PF_EXITING))
-			shrink_dcache_parent(dentry);
+		shrink_dcache_parent(dentry);
 		d_drop(dentry);
 		dput(dentry);
 	}
 
-	if (tgid == 0)
-		goto out;
-
 	name.name = buf;
 	name.len = snprintf(buf, sizeof(buf), "%d", tgid);
 	leader = d_hash_and_lookup(mnt->mnt_root, &name);
@@ -2629,13 +2625,12 @@ void proc_flush_task(struct task_struct *task)
 	struct upid *upid;
 
 	pid = task_pid(task);
-	if (thread_group_leader(task))
-		tgid = task_tgid(task);
+	tgid = task_tgid(task);
 
 	for (i = 0; i <= pid->level; i++) {
 		upid = &pid->numbers[i];
 		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-			tgid ? tgid->numbers[i].nr : 0);
+				    tgid->numbers[i].nr);
 	}
 
 	upid = &pid->numbers[pid->level];
-- 
1.5.3.rc6.17.g1911


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/7] vfs: Fix shrink_submounts
  2008-11-06 10:38 [PATCH 1/7] vfs: Fix shrink_submounts Eric W. Biederman
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
@ 2008-11-07  1:22 ` Andrew Morton
  2008-11-07  2:06   ` Eric W. Biederman
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-11-07  1:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Thu, 06 Nov 2008 02:38:49 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> In the last refactoring of shrink_submounts a variable was
> not completely renamed.  So finish the renaming of mnt to m
> now.
> 
> Without this if you attempt to mount an nfs mount that has
> both automatic nfs sub mounts on it, and has normal mounts
> on it.  The unmount will succeed when it should not.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/namespace.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index cce4670..65b3dc8 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1815,8 +1815,8 @@ static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
>  		while (!list_empty(&graveyard)) {
>  			m = list_first_entry(&graveyard, struct vfsmount,
>  						mnt_expire);
> -			touch_mnt_namespace(mnt->mnt_ns);
> -			umount_tree(mnt, 1, umounts);
> +			touch_mnt_namespace(m->mnt_ns);
> +			umount_tree(m, 1, umounts);
>  		}
>  	}
>  }

OK, so that's a for-2.6.28 bugfix, yes?

The other six patches will fall under Alexey's procfs tree.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
  2008-11-06 10:49   ` [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure Eric W. Biederman
@ 2008-11-07  1:25   ` Andrew Morton
  2008-11-07  2:02     ` Eric W. Biederman
  2008-11-07  1:26   ` Andrew Morton
  2008-11-07  4:41   ` Alexey Dobriyan
  3 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-11-07  1:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Thu, 06 Nov 2008 02:48:35 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> This is a genearl mechanism that is capable of removing
> any unused mounts on /proc in any directory.  As we flush
> the mounts when a processes dies this mechanism is tailored
> for flushing mounts in the per task and per task group
> directories.

What I'm missing here is any sense of what these patches are for,
where they're headed, what the big picture is, etc?

My vague guess is that perhaps it has something to do with mounting
procfs multiple times in separate containers.  How did I do?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
  2008-11-06 10:49   ` [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure Eric W. Biederman
  2008-11-07  1:25   ` [PATCH 2/7] proc: Implement support for automounts in task directories Andrew Morton
@ 2008-11-07  1:26   ` Andrew Morton
  2008-11-07  2:05     ` Eric W. Biederman
  2008-11-07  4:41   ` Alexey Dobriyan
  3 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-11-07  1:26 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Thu, 06 Nov 2008 02:48:35 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> +void proc_shrink_automounts(void)
> +{
> +	struct list_head *list = &proc_automounts;
> +
> +	mark_mounts_for_expiry(list);
> +	mark_mounts_for_expiry(list);

Strange.  In case the first attempt didn't work?

> +	if (list_empty(list))
> +		return;
> +
> +	schedule_delayed_work(&proc_automount_task, proc_automount_timeout);
> +}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  1:25   ` [PATCH 2/7] proc: Implement support for automounts in task directories Andrew Morton
@ 2008-11-07  2:02     ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 06 Nov 2008 02:48:35 -0800
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> This is a genearl mechanism that is capable of removing
>> any unused mounts on /proc in any directory.  As we flush
>> the mounts when a processes dies this mechanism is tailored
>> for flushing mounts in the per task and per task group
>> directories.
>
> What I'm missing here is any sense of what these patches are for,
> where they're headed, what the big picture is, etc?

Sorry.

> My vague guess is that perhaps it has something to do with mounting
> procfs multiple times in separate containers.  How did I do?

The big picture is that right now /proc/<pid>/net/stat
is a directory that is hard linked in different locations.

Which means you can deadlock rename at the vfs level
(despite the fact that proc doesn't support rename).

So this patchset splits /proc/net out into it's own filesystem
so we don't have multiple hard links.

It uses the vfs level automounts  to preserve backwards compatibility
so user space does not need to explicitly mount /proc/<pid>/net.

When Al noticed the problem there was some security drama, and
people were privately cc'd etc.  And however it works I am incompetent
at getting patches merged in that kind of environment.  So these
patches have languished since the middle of September.

On one level these patches constitute a bug fix for the bug
of having multiple hard links in /proc/net.  At another level
these patches are a clean up and a nice to have feature.  Allowing
a network namespace to be monitored in the weird interval between when
the last processes goes away and when the network namespace is destroyed.
Because you can mount /proc/net independently.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  1:26   ` Andrew Morton
@ 2008-11-07  2:05     ` Eric W. Biederman
  2008-11-07  2:49       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07  2:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 06 Nov 2008 02:48:35 -0800
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> +void proc_shrink_automounts(void)
>> +{
>> +	struct list_head *list = &proc_automounts;
>> +
>> +	mark_mounts_for_expiry(list);
>> +	mark_mounts_for_expiry(list);
>
> Strange.  In case the first attempt didn't work?

Yes.  I'd like to say.  Mount just go away but it takes two passes before
a mount is actually removed.

For NFS which does the whole expiry of all inodes where it comes from it
is a good fit.  For /proc where we don't have to guess it isn't the best
fit but it isn't shabby either.

>
>> +	if (list_empty(list))
>> +		return;
>> +
>> +	schedule_delayed_work(&proc_automount_task, proc_automount_timeout);
>> +}


Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/7] vfs: Fix shrink_submounts
  2008-11-07  1:22 ` [PATCH 1/7] vfs: Fix shrink_submounts Andrew Morton
@ 2008-11-07  2:06   ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07  2:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> OK, so that's a for-2.6.28 bugfix, yes?
>
> The other six patches will fall under Alexey's procfs tree.

Fundamentally they are all bug fixes, but whatever works.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  2:05     ` Eric W. Biederman
@ 2008-11-07  2:49       ` Andrew Morton
  2008-11-07  3:51         ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-11-07  2:49 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Thu, 06 Nov 2008 18:05:46 -0800 ebiederm@xmission.com (Eric W. Biederman) wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Thu, 06 Nov 2008 02:48:35 -0800
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> >> +void proc_shrink_automounts(void)
> >> +{
> >> +	struct list_head *list = &proc_automounts;
> >> +
> >> +	mark_mounts_for_expiry(list);
> >> +	mark_mounts_for_expiry(list);
> >
> > Strange.  In case the first attempt didn't work?
> 
> Yes.  I'd like to say.  Mount just go away but it takes two passes before
> a mount is actually removed.

hm.  I stared at mark_mounts_for_expiry() for a while trying to work
out how that can happen and what it semantically *means*, and failed.

I guess I'm just not smart/experienced enough.

> For NFS which does the whole expiry of all inodes where it comes from it
> is a good fit.  For /proc where we don't have to guess it isn't the best
> fit but it isn't shabby either.
> 
> >
> >> +	if (list_empty(list))

So even after two passes through mark_mounts_for_expiry(), there can
still be mounts on our list.

> >> +		return;
> >> +
> >> +	schedule_delayed_work(&proc_automount_task, proc_automount_timeout);

And this causes proc_shrink_automounts() to be called every 500 seconds
for ever and ever, until proc_automounts is empty.

Again, I just don't know how the reader of this file is to understand
why this is done this way.  What is the thinking behind it?  What is
the expected dynamic behaviour?  Under what circumstances will this
very unusual repeated polling activity be triggered?

Obviously, that becomes clearer as one spends more time with the code,
but I wonder whether this has all been made as maintainble as it
possibly could be.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  2:49       ` Andrew Morton
@ 2008-11-07  3:51         ` Eric W. Biederman
  2008-11-07  4:28           ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07  3:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 06 Nov 2008 18:05:46 -0800 ebiederm@xmission.com (Eric W. Biederman)
> wrote:
>
>> Andrew Morton <akpm@linux-foundation.org> writes:
>> 
>> > On Thu, 06 Nov 2008 02:48:35 -0800
>> > ebiederm@xmission.com (Eric W. Biederman) wrote:
>> >
>> >> +void proc_shrink_automounts(void)
>> >> +{
>> >> +	struct list_head *list = &proc_automounts;
>> >> +
>> >> +	mark_mounts_for_expiry(list);
>> >> +	mark_mounts_for_expiry(list);
>> >
>> > Strange.  In case the first attempt didn't work?
>> 
>> Yes.  I'd like to say.  Mount just go away but it takes two passes before
>> a mount is actually removed.
>
> hm.  I stared at mark_mounts_for_expiry() for a while trying to work
> out how that can happen and what it semantically *means*, and failed.
>
> I guess I'm just not smart/experienced enough.

The semantics are a classic two pass exiry.  The mount is not in use
and it was not in use last time we came by remove the mount.

The code is fs/namespace.c is especially clever.  It took me a long
time starring at it to figure out what is going on.  The bug fix comes
from when I was working out what that code does.

>> For NFS which does the whole expiry of all inodes where it comes from it
>> is a good fit.  For /proc where we don't have to guess it isn't the best
>> fit but it isn't shabby either.
>> 
>> >
>> >> +	if (list_empty(list))
>
> So even after two passes through mark_mounts_for_expiry(), there can
> still be mounts on our list.

Only if the mounts are actually in use.

>> >> +		return;
>> >> +
>> >> +	schedule_delayed_work(&proc_automount_task, proc_automount_timeout);
>
> And this causes proc_shrink_automounts() to be called every 500 seconds
> for ever and ever, until proc_automounts is empty.

Yes.

> Again, I just don't know how the reader of this file is to understand
> why this is done this way.  What is the thinking behind it?  What is
> the expected dynamic behaviour?  Under what circumstances will this
> very unusual repeated polling activity be triggered?


Basically if someone does:

cd /proc/<pid>/net/stat  or something like that and holds one of the dentries
alive.  The mount is not allowed to go away.

When a process exits release_task is called which in turn calls
proc_shrink_automounts().  In a normal case it will remove mounts that
are under that process.  

Unfortunately if one of these mounts is busy because the path to a dentry
references it we can not free it (although d_revalidate will cause the
dentry to become unhashed so we can't find the mount going in the other
direction).

So we have to periodically check, is the mount free yet.

If we could do all of this with reference counting so that the
mount would persist exactly until the last user of it has gone
away without a periodic poll I would love it.  But the infrastructure
doesn't support that today, and where this is at least partially
a bug fix I would rather not have the change depend on enhancing
the VFS.

The algorithm is actually very aggressive and in practice you don't
see any /proc/<pid>/net showing up as a mount point.

> Obviously, that becomes clearer as one spends more time with the code,
> but I wonder whether this has all been made as maintainble as it
> possibly could be.

Good question.

In the sense of will we have to go through and futz with the code all
of the time.  The abstraction seems good.   You put a mount on
the proc_automounts list with do_add_mounts and it goes away eventually
with all of the vfs rules maintained.

In the sense of can the code be read?    Perhaps it could be better.
I expect it helps to have run the code and see /proc/net as a filesystem.
that is magically mounted.

In the long term it also seems to make sense maintenance wise to split
/proc into separate parts that can go different directions, and having
the automounting of the old pieces, helps a lot with that.

Unless there are some real bugs my preference is to evolve the code from
where it is today.  The code is at least simple and stupid.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  3:51         ` Eric W. Biederman
@ 2008-11-07  4:28           ` Andrew Morton
  2008-11-07 15:51             ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-11-07  4:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Thu, 06 Nov 2008 19:51:23 -0800 ebiederm@xmission.com (Eric W. Biederman) wrote:

> If we could do all of this with reference counting so that the
> mount would persist exactly until the last user of it has gone
> away without a periodic poll I would love it.  But the infrastructure
> doesn't support that today,

Well that sucks.  The free-on-last-put idiom occurs in so many places
and serves us so well.  I wonder what went wrong here?

I guess it has interactions with dentry and inode cache aging which
could get tricky.

> and where this is at least partially
> a bug fix I would rather not have the change depend on enhancing
> the VFS.
> 
> The algorithm is actually very aggressive and in practice you don't
> see any /proc/<pid>/net showing up as a mount point.

Do you think it has failure modes?  Most particularly: obscure usage
patterns which can cause memory exhaustion?

> > Obviously, that becomes clearer as one spends more time with the code,
> > but I wonder whether this has all been made as maintainble as it
> > possibly could be.
> 
> Good question.
> 
> In the sense of will we have to go through and futz with the code all
> of the time.  The abstraction seems good.   You put a mount on
> the proc_automounts list with do_add_mounts and it goes away eventually
> with all of the vfs rules maintained.
> 
> In the sense of can the code be read?    Perhaps it could be better.
> I expect it helps to have run the code and see /proc/net as a filesystem.
> that is magically mounted.

'twould be a useful contribution if you were to enshrine your
discoveries in /*these things*/.  You knew I was working up to that :)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
                     ` (2 preceding siblings ...)
  2008-11-07  1:26   ` Andrew Morton
@ 2008-11-07  4:41   ` Alexey Dobriyan
  2008-11-07 16:04     ` [PATCH] proc: Supply proc_shrink_automounts when CONFIG_PROC_FS=N Eric W. Biederman
  3 siblings, 1 reply; 25+ messages in thread
From: Alexey Dobriyan @ 2008-11-07  4:41 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, Al Viro, Linux Containers

On Thu, Nov 06, 2008 at 02:48:35AM -0800, Eric W. Biederman wrote:
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -18,6 +18,7 @@ proc-y	+= meminfo.o
>  proc-y	+= stat.o
>  proc-y	+= uptime.o
>  proc-y	+= version.o
> +proc-y	+= automount.o
>  proc-$(CONFIG_PROC_SYSCTL)	+= proc_sysctl.o
>  proc-$(CONFIG_NET)		+= proc_net.o
>  proc-$(CONFIG_PROC_KCORE)	+= kcore.o

> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -101,6 +101,7 @@ extern spinlock_t proc_subdir_lock;
>  
>  extern void proc_root_init(void);
>  
> +void proc_shrink_automounts(void);
>  void proc_flush_task(struct task_struct *task);
>  struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *);
>  int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 80137a5..8a8badb 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -161,6 +161,7 @@ void release_task(struct task_struct * p)
>  repeat:
>  	tracehook_prepare_release_task(p);
>  	atomic_dec(&p->user->processes);
> +	proc_shrink_automounts();

This, of course, won't compile when PROC_FS=n.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07  4:28           ` Andrew Morton
@ 2008-11-07 15:51             ` Eric W. Biederman
  2008-11-07 16:05               ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 06 Nov 2008 19:51:23 -0800 ebiederm@xmission.com (Eric W. Biederman)
> wrote:
>
>> If we could do all of this with reference counting so that the
>> mount would persist exactly until the last user of it has gone
>> away without a periodic poll I would love it.  But the infrastructure
>> doesn't support that today,
>
> Well that sucks.  The free-on-last-put idiom occurs in so many places
> and serves us so well.  I wonder what went wrong here?

> I guess it has interactions with dentry and inode cache aging which
> could get tricky.

At least in part.  If you just have the dentry you can't  easily
find what is mounted on it.  

>> and where this is at least partially
>> a bug fix I would rather not have the change depend on enhancing
>> the VFS.
>> 
>> The algorithm is actually very aggressive and in practice you don't
>> see any /proc/<pid>/net showing up as a mount point.
>
> Do you think it has failure modes?  Most particularly: obscure usage
> patterns which can cause memory exhaustion?

I don't think we can pin anything that way that we can't
pin right now.

You might be able to pin more if you happen to mount something
on top of /proc/<pid>/net/  but that is an unprivileged operation.

>> > Obviously, that becomes clearer as one spends more time with the code,
>> > but I wonder whether this has all been made as maintainble as it
>> > possibly could be.
>> 
>> Good question.
>> 
>> In the sense of will we have to go through and futz with the code all
>> of the time.  The abstraction seems good.   You put a mount on
>> the proc_automounts list with do_add_mounts and it goes away eventually
>> with all of the vfs rules maintained.
>> 
>> In the sense of can the code be read?    Perhaps it could be better.
>> I expect it helps to have run the code and see /proc/net as a filesystem.
>> that is magically mounted.
>
> 'twould be a useful contribution if you were to enshrine your
> discoveries in /*these things*/.  You knew I was working up to that :)

Short of a big fat comment I'm not certain if there is something I can do
better.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] proc: Supply proc_shrink_automounts when CONFIG_PROC_FS=N
  2008-11-07  4:41   ` Alexey Dobriyan
@ 2008-11-07 16:04     ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07 16:04 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Andrew Morton, linux-kernel, Al Viro, Linux Containers


Alexey Dobriyan <adobriyan@gmail.com> writes:

>>  	tracehook_prepare_release_task(p);
>>  	atomic_dec(&p->user->processes);
>> +	proc_shrink_automounts();
>
> This, of course, won't compile when PROC_FS=n.

Which of course leads to this incremental patch.

Oops.  I forgot to account for the rare but possible
case of someone not compiling procfs into their kernel.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/proc_fs.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52cc6bd..394593d 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -207,6 +207,9 @@ static inline void proc_net_remove(struct net *net, const char *name) {}
 static inline void proc_flush_task(struct task_struct *task)
 {
 }
+static inline void proc_shrink_automounts(void)
+{
+}
 
 static inline struct proc_dir_entry *create_proc_entry(const char *name,
 	mode_t mode, struct proc_dir_entry *parent) { return NULL; }
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07 15:51             ` Eric W. Biederman
@ 2008-11-07 16:05               ` Andrew Morton
  2008-11-07 16:58                 ` Eric W. Biederman
  2008-11-13 23:39                 ` Eric W. Biederman
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2008-11-07 16:05 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, adobriyan, viro, containers

On Fri, 07 Nov 2008 07:51:32 -0800 ebiederm@xmission.com (Eric W. Biederman) wrote:

> >> and where this is at least partially
> >> a bug fix I would rather not have the change depend on enhancing
> >> the VFS.
> >> 
> >> The algorithm is actually very aggressive and in practice you don't
> >> see any /proc/<pid>/net showing up as a mount point.
> >
> > Do you think it has failure modes?  Most particularly: obscure usage
> > patterns which can cause memory exhaustion?
> 
> I don't think we can pin anything that way that we can't
> pin right now.
> 
> You might be able to pin more if you happen to mount something
> on top of /proc/<pid>/net/  but that is an unprivileged operation.

I was thinking more along the lines of some repeated operation which
generates reclaimable storage, only nothing reclaims that storage
sufficiently promptly (the 500 second delay, perhaps).

Like the problem we had with SLAB_DESTROY_BY_RCU and, umm, I think it
was route cache entries increasing like mad and not getting reaped.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07 16:05               ` Andrew Morton
@ 2008-11-07 16:58                 ` Eric W. Biederman
  2008-11-13 23:39                 ` Eric W. Biederman
  1 sibling, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-07 16:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers

Andrew Morton <akpm@linux-foundation.org> writes:

> I was thinking more along the lines of some repeated operation which
> generates reclaimable storage, only nothing reclaims that storage
> sufficiently promptly (the 500 second delay, perhaps).
>
> Like the problem we had with SLAB_DESTROY_BY_RCU and, umm, I think it
> was route cache entries increasing like mad and not getting reaped.

Well what happens is I run reclaim that chain whenever a process
exits or every 500s.  The 500s is really the backup in case
we don't have any processes exiting, which is a really
strange workload.

Now maybe there is an extremely perverse case out there that
can trigger some bad behavior.  But I can't think of anything
like that at the moment.  It would require accessing a lot
of /proc/<pid>/net directories and holding them open even after
their files were gone, pinning a lot of mounts, and then
use a lot of memory elsewhere.  Where the problem would
be that the code is not well tied in with memory reclaim.

A mount is not a big or expensive structure and it requires
a fd somewhere to keep it alive past exit.  Thinking back
to my analysis on /proc a few years ago where I introduced
struct pid to prevent lowmem exhaustion.  I can't think of
a case where it would be a problem or user triggerable.
Basically you have to have a fd open for every proc inode
that lives past process exit.  And we have hard limits on
the number of fds a process can open and limits on the
number of processes we can have.

One of the things on my todo list to look at sometime is the issue
that mounts can deny you the permission to delete a file or directory
when the mount is in another mount namespace.  It is a pretty nasty
DOS from my opinion.  Especially when the last process holding open
the mount namespace oopses and there is no way to remove the mount.
However that DOS is only available to root today so it doesn't feel
like a huge danger.  To fix that we would need to introduce some
better mount reaping logic.  Which I expect would remove the need for
the proc_automounts.   That is tricky subtle vfs logic and I don't
plan to rush into it.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-07 16:05               ` Andrew Morton
  2008-11-07 16:58                 ` Eric W. Biederman
@ 2008-11-13 23:39                 ` Eric W. Biederman
  2008-11-19  0:07                   ` Alexey Dobriyan
  1 sibling, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-13 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, adobriyan, viro, containers


On this patchset it looks like the conversation has wound down with no
real objects except that some very simple pieces of the code appear
magic.


What looks to be the best path for getting this set of patches merged
into /proc?

Alexy will you pick this up in your /proc tree as Andrew suggested?
Do I need to keep my own tree?

Can I have some clear guidance on which hurdles to jump through to
make forward progress?

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-13 23:39                 ` Eric W. Biederman
@ 2008-11-19  0:07                   ` Alexey Dobriyan
  2008-11-19  2:35                     ` Alexey Dobriyan
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Dobriyan @ 2008-11-19  0:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, viro, containers

On Thu, Nov 13, 2008 at 03:39:44PM -0800, Eric W. Biederman wrote:
> On this patchset it looks like the conversation has wound down with no
> real objects except that some very simple pieces of the code appear
> magic.
> 
> 
> What looks to be the best path for getting this set of patches merged
> into /proc?

Al's ACK minimum.

> Alexy will you pick this up in your /proc tree as Andrew suggested?

Yep, doing it right now.

> Do I need to keep my own tree?

No need.

> Can I have some clear guidance on which hurdles to jump through to
> make forward progress?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-19  0:07                   ` Alexey Dobriyan
@ 2008-11-19  2:35                     ` Alexey Dobriyan
  2008-11-19 13:20                       ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Alexey Dobriyan @ 2008-11-19  2:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, viro, containers

On Wed, Nov 19, 2008 at 03:07:49AM +0300, Alexey Dobriyan wrote:
> On Thu, Nov 13, 2008 at 03:39:44PM -0800, Eric W. Biederman wrote:
> > On this patchset it looks like the conversation has wound down with no
> > real objects except that some very simple pieces of the code appear
> > magic.
> > 
> > 
> > What looks to be the best path for getting this set of patches merged
> > into /proc?
> 
> Al's ACK minimum.
> 
> > Alexy will you pick this up in your /proc tree as Andrew suggested?
> 
> Yep, doing it right now.
> 
> > Do I need to keep my own tree?
> 
> No need.
> 
> > Can I have some clear guidance on which hurdles to jump through to
> > make forward progress?

OK, I fixed some small issues and pushed out to proc.git so it starts
propagate to -next.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/7] proc: Implement support for automounts in task directories
  2008-11-19  2:35                     ` Alexey Dobriyan
@ 2008-11-19 13:20                       ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2008-11-19 13:20 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Andrew Morton, linux-kernel, viro, containers

Alexey Dobriyan <adobriyan@gmail.com> writes:

> OK, I fixed some small issues and pushed out to proc.git so it starts
> propagate to -next.

Thanks.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-11-19 13:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-06 10:38 [PATCH 1/7] vfs: Fix shrink_submounts Eric W. Biederman
2008-11-06 10:48 ` [PATCH 2/7] proc: Implement support for automounts in task directories Eric W. Biederman
2008-11-06 10:49   ` [PATCH 3/7] proc: Support multiple filesystems using the proc generic infrastructure Eric W. Biederman
2008-11-06 10:53     ` [PATCH 4/7] proc: Make /proc/net it's own filesystem Eric W. Biederman
2008-11-06 10:56       ` [PATCH 5/7] proc_net: Don't show the wrong /proc/net after unshare Eric W. Biederman
2008-11-06 10:57         ` [PATCH 6/7] proc_net: Simplify network namespace lookup Eric W. Biederman
2008-11-06 10:58           ` [PATCH 7/7] proc: Cleanup proc_flush_task Eric W. Biederman
2008-11-07  1:25   ` [PATCH 2/7] proc: Implement support for automounts in task directories Andrew Morton
2008-11-07  2:02     ` Eric W. Biederman
2008-11-07  1:26   ` Andrew Morton
2008-11-07  2:05     ` Eric W. Biederman
2008-11-07  2:49       ` Andrew Morton
2008-11-07  3:51         ` Eric W. Biederman
2008-11-07  4:28           ` Andrew Morton
2008-11-07 15:51             ` Eric W. Biederman
2008-11-07 16:05               ` Andrew Morton
2008-11-07 16:58                 ` Eric W. Biederman
2008-11-13 23:39                 ` Eric W. Biederman
2008-11-19  0:07                   ` Alexey Dobriyan
2008-11-19  2:35                     ` Alexey Dobriyan
2008-11-19 13:20                       ` Eric W. Biederman
2008-11-07  4:41   ` Alexey Dobriyan
2008-11-07 16:04     ` [PATCH] proc: Supply proc_shrink_automounts when CONFIG_PROC_FS=N Eric W. Biederman
2008-11-07  1:22 ` [PATCH 1/7] vfs: Fix shrink_submounts Andrew Morton
2008-11-07  2:06   ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).