Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-22 18:59   ` Coly Li
  2017-12-23  3:03   ` Serge E. Hallyn
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c     |  2 +-
 drivers/mtd/mtdsuper.c    |  2 +-
 fs/block_dev.c            | 13 ++++++++++---
 fs/quota/quota.c          |  2 +-
 include/linux/fs.h        |  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 				  sb);
 	if (IS_ERR(bdev)) {
 		if (bdev == ERR_PTR(-EBUSY)) {
-			bdev = lookup_bdev(strim(path));
+			bdev = lookup_bdev(strim(path), 0);
 			mutex_lock(&bch_register_lock);
 			if (!IS_ERR(bdev) && bch_is_open(bdev))
 				err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
 	dev_t dev;
 	struct block_device *bdev;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		dev = name_to_dev_t(path);
 	else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	/* try the old way - the hack where we allowed users to mount
 	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 	 */
-	bdev = lookup_bdev(dev_name);
+	bdev = lookup_bdev(dev_name, 0);
 	if (IS_ERR(bdev)) {
 		ret = PTR_ERR(bdev);
 		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
 	struct block_device *bdev;
 	int err;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		return bdev;
 
@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:	special file representing the block device
+ * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
 	struct block_device *bdev;
 	struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
 		return ERR_PTR(error);
 
 	inode = d_backing_inode(path.dentry);
+	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+		error = __inode_permission(inode, mask);
+		if (error)
+			goto fail;
+	}
 	error = -ENOTBLK;
 	if (!S_ISBLK(inode->i_mode))
 		goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
 
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	bdev = lookup_bdev(tmp->name);
+	bdev = lookup_bdev(tmp->name, 0);
 	putname(tmp);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2995a271..fce19c49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
 #define BLKDEV_MAJOR_MAX	512
 extern const char *__bdevname(dev_t, char *buffer);
 extern const char *bdevname(struct block_device *bdev, char *buffer);
-extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *lookup_bdev(const char *, int mask);
 extern void blkdev_show(struct seq_file *,off_t);
 
 #else
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:17   ` Serge E. Hallyn
                     ` (2 more replies)
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

From: Eric W. Biederman <ebiederm@xmission.com>

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files.  Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to.  So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee.  I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/attr.c             | 34 ++++++++++++++++++++++++++--------
 fs/proc/base.c        |  7 +++++++
 fs/proc/generic.c     |  7 +++++++
 fs/proc/proc_sysctl.c |  7 +++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    uid_eq(uid, inode->i_uid))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:	dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 		goto kill_priv;
 
 	/* Make sure a caller can chown. */
-	if ((ia_valid & ATTR_UID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
-	if ((ia_valid & ATTR_GID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 
 	if (attr->ia_valid & ATTR_MODE)
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 793a6757..527d46c8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = d_inode(dentry);
 	struct proc_dir_entry *de = PDE(inode);
+	struct user_namespace *s_user_ns;
 	int error;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, iattr);
 	if (error)
 		return error;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff..0f9562d1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
 static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 	int error;
 
 	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:26   ` Serge E. Hallyn
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/inode.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
  */
 int should_remove_suid(struct dentry *dentry)
 {
-	umode_t mode = d_inode(dentry)->i_mode;
+	struct inode *inode = d_inode(dentry);
+	umode_t mode = inode->i_mode;
 	int kill = 0;
 
 	/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
 	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
 		kill |= ATTR_KILL_SGID;
 
-	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+		     S_ISREG(mode)))
 		return kill;
 
 	return 0;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (2 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:30   ` Serge E. Hallyn
  2017-12-22 14:32 ` [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems Dongsu Park
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
 		 * Special case for "unmounting" root ...
 		 * we just try to remount it readonly.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
+		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	down_write(&sb->s_umount);
 	if (ms_flags & MS_BIND)
 		err = change_mount_flags(path->mnt, ms_flags);
-	else if (!capable(CAP_SYS_ADMIN))
+	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		err = -EPERM;
 	else
 		err = do_remount_sb(sb, sb_flags, data, 0);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (3 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:39   ` Serge E. Hallyn
  2018-02-14 12:28   ` Miklos Szeredi
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro

From: Seth Forshee <seth.forshee@canonical.com>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* Thaw */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (4 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:46   ` Serge E. Hallyn
                     ` (3 more replies)
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
                   ` (2 subsequent siblings)
  8 siblings, 4 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel

From: Seth Forshee <seth.forshee@canonical.com>

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c    | 11 ++++++++---
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	fuse_conn_init(&cc->fc, current_user_ns());
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
+	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 
 	return req;
 
@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
+	if (task_active_pid_ns(current) != fc->pid_ns ||
+	    current_user_ns() != fc->user_ns) {
 		rcu_read_lock();
 		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
 		rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5773ca6..364e65c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2f504d61..7f6b2e55 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (5 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:50   ` Serge E. Hallyn
  2018-02-19 23:16   ` Eric W. Biederman
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  8 siblings, 2 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 10/11] fuse: Allow user namespace mounts
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (6 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:51   ` Serge E. Hallyn
  2018-02-14 13:44   ` Miklos Szeredi
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  8 siblings, 2 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel

From: Seth Forshee <seth.forshee@canonical.com>

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuse",
-	.fs_flags	= FS_HAS_SUBTYPE,
+	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 	.mount		= fuse_mount,
 	.kill_sb	= fuse_kill_sb_anon,
 };
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
 	.name		= "fuseblk",
 	.mount		= fuse_mount_blk,
 	.kill_sb	= fuse_kill_sb_blk,
-	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
@ 2017-12-22 18:59   ` Coly Li
  2017-12-23 12:00     ` Dongsu Park
  2017-12-23  3:03   ` Serge E. Hallyn
  1 sibling, 1 reply; 89+ messages in thread
From: Coly Li @ 2017-12-22 18:59 UTC (permalink / raw)
  To: Dongsu Park, linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, dm-devel, linux-bcache,
	linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara, Serge Hallyn

On 22/12/2017 10:32 PM, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel@redhat.com
> Cc: linux-bcache@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mtd@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Jan Kara <jack@suse.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Hi Dongsu,

Could you please use a macro like NO_PERMISSION_CHECK to replace hard
coded 0 ? At least for me, I don't need to check what does 0 mean in the
new lookup_bdev().

Thanks.

Coly Li

> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
[snip]


-- 
Coly Li

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
  2017-12-22 18:59   ` Coly Li
@ 2017-12-23  3:03   ` Serge E. Hallyn
  1 sibling, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:03 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel@redhat.com
> Cc: linux-bcache@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mtd@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Jan Kara <jack@suse.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
>  	dev_t dev;
>  	struct block_device *bdev;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		dev = name_to_dev_t(path);
>  	else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  	/* try the old way - the hack where we allowed users to mount
>  	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>  	 */
> -	bdev = lookup_bdev(dev_name);
> +	bdev = lookup_bdev(dev_name, 0);
>  	if (IS_ERR(bdev)) {
>  		ret = PTR_ERR(bdev);
>  		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
>  	struct block_device *bdev;
>  	int err;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		return bdev;
>  
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
>  /**
>   * lookup_bdev  - lookup a struct block_device by name
>   * @pathname:	special file representing the block device
> + * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>   *
>   * Get a reference to the blockdevice at @pathname in the current
>   * namespace if possible and return it.  Return ERR_PTR(error)
> - * otherwise.
> + * otherwise.  If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
>   */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
>  {
>  	struct block_device *bdev;
>  	struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
>  		return ERR_PTR(error);
>  
>  	inode = d_backing_inode(path.dentry);
> +	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> +		error = __inode_permission(inode, mask);
> +		if (error)
> +			goto fail;
> +	}
>  	error = -ENOTBLK;
>  	if (!S_ISBLK(inode->i_mode))
>  		goto fail;
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 43612e2a..e5d47955 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
>  
>  	if (IS_ERR(tmp))
>  		return ERR_CAST(tmp);
> -	bdev = lookup_bdev(tmp->name);
> +	bdev = lookup_bdev(tmp->name, 0);
>  	putname(tmp);
>  	if (IS_ERR(bdev))
>  		return ERR_CAST(bdev);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2995a271..fce19c49 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
>  #define BLKDEV_MAJOR_MAX	512
>  extern const char *__bdevname(dev_t, char *buffer);
>  extern const char *bdevname(struct block_device *bdev, char *buffer);
> -extern struct block_device *lookup_bdev(const char *);
> +extern struct block_device *lookup_bdev(const char *, int mask);
>  extern void blkdev_show(struct seq_file *,off_t);
>  
>  #else
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
@ 2017-12-23  3:17   ` Serge E. Hallyn
  2018-01-05 19:24   ` Luis R. Rodriguez
  2018-02-13 13:18   ` Miklos Szeredi
  2 siblings, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:17 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
> 
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
> 
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
> 
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to.  So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
> 
> The original version of this patch was written by: Seth Forshee.  I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Inspired-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/attr.c             | 34 ++++++++++++++++++++++++++--------
>  fs/proc/base.c        |  7 +++++++
>  fs/proc/generic.c     |  7 +++++++
>  fs/proc/proc_sysctl.c |  7 +++++++
>  4 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;
>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 793a6757..527d46c8 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
>  {
>  	struct inode *inode = d_inode(dentry);
>  	struct proc_dir_entry *de = PDE(inode);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, iattr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index c5cbbdff..0f9562d1 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
>  static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
>  	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
@ 2017-12-23  3:26   ` Serge E. Hallyn
  2017-12-23 12:38     ` Dongsu Park
  0 siblings, 1 reply; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:26 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong.  Expanding a check brings to my mind making
it stricter, not looser.  How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
> 
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the  inode->i_sb->s_user_ns

> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/inode.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>   */
>  int should_remove_suid(struct dentry *dentry)
>  {
> -	umode_t mode = d_inode(dentry)->i_mode;
> +	struct inode *inode = d_inode(dentry);
> +	umode_t mode = inode->i_mode;
>  	int kill = 0;
>  
>  	/* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>  	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>  		kill |= ATTR_KILL_SGID;
>  
> -	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> +	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> +		     S_ISREG(mode)))
>  		return kill;
>  
>  	return 0;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
@ 2017-12-23  3:30   ` Serge E. Hallyn
  0 siblings, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:30 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
>  		 * Special case for "unmounting" root ...
>  		 * we just try to remount it readonly.
>  		 */
> -		if (!capable(CAP_SYS_ADMIN))
> +		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  			return -EPERM;
>  		down_write(&sb->s_umount);
>  		if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
>  	down_write(&sb->s_umount);
>  	if (ms_flags & MS_BIND)
>  		err = change_mount_flags(path->mnt, ms_flags);
> -	else if (!capable(CAP_SYS_ADMIN))
> +	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		err = -EPERM;
>  	else
>  		err = do_remount_sb(sb, sb_flags, data, 0);
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2017-12-22 14:32 ` [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems Dongsu Park
@ 2017-12-23  3:39   ` Serge E. Hallyn
  2018-02-14 12:28   ` Miklos Szeredi
  1 sibling, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:39 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* Thaw */
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2017-12-23  3:46   ` Serge E. Hallyn
  2018-01-17 10:59   ` Alban Crequy
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
> 
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
> 
>  - The namespace must be the same as s_user_ns.
> 
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
> 
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>  	stat->ino = attr->ino;
>  	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	stat->nlink = attr->nlink;
> -	stat->uid = make_kuid(&init_user_ns, attr->uid);
> -	stat->gid = make_kgid(&init_user_ns, attr->gid);
> +	stat->uid = make_kuid(fc->user_ns, attr->uid);
> +	stat->gid = make_kgid(fc->user_ns, attr->gid);
>  	stat->rdev = inode->i_rdev;
>  	stat->atime.tv_sec = attr->atime;
>  	stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>  	return true;
>  }
>  
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -			   bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>  	unsigned ivalid = iattr->ia_valid;
>  
>  	if (ivalid & ATTR_MODE)
>  		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>  	if (ivalid & ATTR_UID)
> -		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>  	if (ivalid & ATTR_GID)
> -		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>  	if (ivalid & ATTR_SIZE)
>  		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>  	if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>  
>  	memset(&inarg, 0, sizeof(inarg));
>  	memset(&outarg, 0, sizeof(outarg));
> -	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>  	if (file) {
>  		struct fuse_file *ff = file->private_data;
>  		inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>  
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>  	/** The pid namespace for this mount */
>  	struct pid_namespace *pid_ns;
>  
> +	/** The user namespace for this mount */
> +	struct user_namespace *user_ns;
> +
>  	/** Maximum read size */
>  	unsigned max_read;
>  
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>  
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>  	inode->i_ino     = fuse_squash_ino(attr->ino);
>  	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	set_nlink(inode, attr->nlink);
> -	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>  	inode->i_blocks  = attr->blocks;
>  	inode->i_atime.tv_sec   = attr->atime;
>  	inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>  	return err;
>  }
>  
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +			  struct user_namespace *user_ns)
>  {
>  	char *p;
>  	memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_USER_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->user_id = make_kuid(current_user_ns(), uv);
> +			d->user_id = make_kuid(user_ns, uv);
>  			if (!uid_valid(d->user_id))
>  				return 0;
>  			d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_GROUP_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->group_id = make_kgid(current_user_ns(), uv);
> +			d->group_id = make_kgid(user_ns, uv);
>  			if (!gid_valid(d->group_id))
>  				return 0;
>  			d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>  	struct super_block *sb = root->d_sb;
>  	struct fuse_conn *fc = get_fuse_conn_super(sb);
>  
> -	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>  	if (fc->default_permissions)
>  		seq_puts(m, ",default_permissions");
>  	if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>  	fpq->connected = 1;
>  }
>  
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>  	memset(fc, 0, sizeof(*fc));
>  	spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>  	fc->attr_version = 1;
>  	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>  	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +	fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>  
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>  		if (fc->destroy_req)
>  			fuse_request_free(fc->destroy_req);
>  		put_pid_ns(fc->pid_ns);
> +		put_user_ns(fc->user_ns);
>  		fc->release(fc);
>  	}
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>  
> -	if (!parse_fuse_opt(data, &d, is_bdev))
> +	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>  		goto err;
>  
>  	if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!file)
>  		goto err;
>  
> -	if ((file->f_op != &fuse_dev_operations) ||
> -	    (file->f_cred->user_ns != &init_user_ns))
> +	/*
> +	 * Require mount to happen from the same user namespace which
> +	 * opened /dev/fuse to prevent potential attacks.
> +	 */
> +	if (file->f_op != &fuse_dev_operations ||
> +	    file->f_cred->user_ns != sb->s_user_ns)
>  		goto err_fput;
>  
>  	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!fc)
>  		goto err_fput;
>  
> -	fuse_conn_init(fc);
> +	fuse_conn_init(fc, sb->s_user_ns);
>  	fc->release = fuse_free_conn;
>  
>  	fud = fuse_dev_alloc(fc);
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2017-12-23  3:50   ` Serge E. Hallyn
  2018-02-19 23:16   ` Eric W. Biederman
  1 sibling, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:50 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name.  I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
@ 2017-12-23  3:51   ` Serge E. Hallyn
  2018-02-14 13:44   ` Miklos Szeredi
  1 sibling, 0 replies; 89+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:51 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "fuse",
> -	.fs_flags	= FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  	.mount		= fuse_mount,
>  	.kill_sb	= fuse_kill_sb_anon,
>  };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>  	.name		= "fuseblk",
>  	.mount		= fuse_mount_blk,
>  	.kill_sb	= fuse_kill_sb_blk,
> -	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");
>  
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 18:59   ` Coly Li
@ 2017-12-23 12:00     ` Dongsu Park
  0 siblings, 0 replies; 89+ messages in thread
From: Dongsu Park @ 2017-12-23 12:00 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <i@coly.li> wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/super.c |  2 +-
>>  drivers/md/dm-table.c     |  2 +-
>>  drivers/mtd/mtdsuper.c    |  2 +-
>>  fs/block_dev.c            | 13 ++++++++++---
>>  fs/quota/quota.c          |  2 +-
>>  include/linux/fs.h        |  2 +-
>>  6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>>                                 sb);
>>       if (IS_ERR(bdev)) {
>>               if (bdev == ERR_PTR(-EBUSY)) {
>> -                     bdev = lookup_bdev(strim(path));
>> +                     bdev = lookup_bdev(strim(path), 0);
>>                       mutex_lock(&bch_register_lock);
>>                       if (!IS_ERR(bdev) && bch_is_open(bdev))
>>                               err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-23  3:26   ` Serge E. Hallyn
@ 2017-12-23 12:38     ` Dongsu Park
  2018-02-13 13:37       ` Miklos Szeredi
  0 siblings, 1 reply; 89+ messages in thread
From: Dongsu Park @ 2017-12-23 12:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Linux Containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong.  Expanding a check brings to my mind making
> it stricter, not looser.  How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the  inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Serge Hallyn <serge@hallyn.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/inode.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>>   */
>>  int should_remove_suid(struct dentry *dentry)
>>  {
>> -     umode_t mode = d_inode(dentry)->i_mode;
>> +     struct inode *inode = d_inode(dentry);
>> +     umode_t mode = inode->i_mode;
>>       int kill = 0;
>>
>>       /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>>       if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>>               kill |= ATTR_KILL_SGID;
>>
>> -     if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> +     if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> +                  S_ISREG(mode)))
>>               return kill;
>>
>>       return 0;
>> --
>> 2.13.6

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
  2017-12-23  3:17   ` Serge E. Hallyn
@ 2018-01-05 19:24   ` Luis R. Rodriguez
  2018-01-09 15:10     ` Dongsu Park
  2018-02-13 13:18   ` Miklos Szeredi
  2 siblings, 1 reply; 89+ messages in thread
From: Luis R. Rodriguez @ 2018-01-05 19:24 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;

I think this patch would read much better and easier to review if it was
split up by first adding the helpers, and then extending them afterwards.

>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;

Are we sure proc is the only special one? How was it observed first that this was
require for proc? Has anyone tried fuzzing by trying this op with a slew of other
filesystems on all files?

  Luis

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-01-05 19:24   ` Luis R. Rodriguez
@ 2018-01-09 15:10     ` Dongsu Park
  2018-01-09 17:23       ` Luis R. Rodriguez
  0 siblings, 1 reply; 89+ messages in thread
From: Dongsu Park @ 2018-01-09 15:10 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: LKML, Linux Containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Kees Cook

Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>>  #include <linux/evm.h>
>>  #include <linux/ima.h>
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         uid_eq(uid, inode->i_uid))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>>  /**
>>   * setattr_prepare - check if attribute changes to a dentry are allowed
>>   * @dentry:  dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>>               goto kill_priv;
>>
>>       /* Make sure a caller can chown. */
>> -     if ((ia_valid & ATTR_UID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -          !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>>               return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>>       /* Make sure caller can chgrp. */
>> -     if ((ia_valid & ATTR_GID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -         (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>>               return -EPERM;
>>
>>       /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>       int error;
>>       struct inode *inode = d_inode(dentry);
>> +     struct user_namespace *s_user_ns;
>>
>>       if (attr->ia_valid & ATTR_MODE)
>>               return -EPERM;
>>
>> +     /* Don't let anyone mess with weird proc files */
>> +     s_user_ns = inode->i_sb->s_user_ns;
>> +     if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> +         !kgid_has_mapping(s_user_ns, inode->i_gid))
>> +             return -EPERM;
>> +
>>       error = setattr_prepare(dentry, attr);
>>       if (error)
>>               return error;
>
> Are we sure proc is the only special one? How was it observed first that this was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> filesystems on all files?

>From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

>   Luis

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-01-09 15:10     ` Dongsu Park
@ 2018-01-09 17:23       ` Luis R. Rodriguez
  0 siblings, 0 replies; 89+ messages in thread
From: Luis R. Rodriguez @ 2018-01-09 17:23 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Luis R. Rodriguez, LKML, Linux Containers, Alban Crequy,
	Eric W . Biederman, Miklos Szeredi, Seth Forshee, Sargun Dhillon,
	linux-fsdevel, Alexander Viro, Kees Cook

On Tue, Jan 09, 2018 at 04:10:54PM +0100, Dongsu Park wrote:
> On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> > I think this patch would read much better and easier to review if it was
> > split up by first adding the helpers, and then extending them afterwards.
> 
> I'm fine with splitting it up into multiple patches, if the original author
> Eric agrees.

Great.

> > Are we sure proc is the only special one? How was it observed first that this was
> > require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> > filesystems on all files?
>
> Please let me know if there are other good ways to figure out potential issues.

I think the trick would be to create a test which mimicks the issue and then try to
mount and run the test against as many filesystems as we support. So would developing
a test be possible here?

  Luis

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
  2017-12-23  3:46   ` Serge E. Hallyn
@ 2018-01-17 10:59   ` Alban Crequy
  2018-01-17 14:29     ` Seth Forshee
  2018-02-12 15:57   ` Miklos Szeredi
  2018-02-20  2:12   ` Eric W. Biederman
  3 siblings, 1 reply; 89+ messages in thread
From: Alban Crequy @ 2018-01-17 10:59 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

[Adding Tejun, David, Tom for question about cuse]

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.

Was a use case discussed for using cuse in a new unprivileged userns?

I ran some tests yesterday with cusexmp [1] and I could add a new char
device as an unprivileged user with:

$ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
--maj=99 --min=30 --name=foo

where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
Then, I could see the new device:

$ cat /proc/devices | grep foo
 99 foo

On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
seems dangerous if the dev node can be provided otherwise and if we
don't have a use case for it.

Thoughts?

[1] https://github.com/fuse4x/fuse/blob/master/example/cusexmp.c#L9

Cheers,
Alban


> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {
>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 10:59   ` Alban Crequy
@ 2018-01-17 14:29     ` Seth Forshee
  2018-01-17 18:56       ` Alban Crequy
  0 siblings, 1 reply; 89+ messages in thread
From: Seth Forshee @ 2018-01-17 14:29 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> [Adding Tejun, David, Tom for question about cuse]
> 
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> >
> > In order to support mounts from namespaces other than
> > init_user_ns, fuse must translate uids and gids to/from the
> > userns of the process servicing requests on /dev/fuse. This
> > patch does that, with a couple of restrictions on the namespace:
> >
> >  - The userns for the fuse connection is fixed to the namespace
> >    from which /dev/fuse is opened.
> >
> >  - The namespace must be the same as s_user_ns.
> >
> > These restrictions simplify the implementation by avoiding the
> > need to pass around userns references and by allowing fuse to
> > rely on the checks in inode_change_ok for ownership changes.
> > Either restriction could be relaxed in the future if needed.
> >
> > For cuse the namespace used for the connection is also simply
> > current_user_ns() at the time /dev/cuse is opened.
> 
> Was a use case discussed for using cuse in a new unprivileged userns?
> 
> I ran some tests yesterday with cusexmp [1] and I could add a new char
> device as an unprivileged user with:
> 
> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> --maj=99 --min=30 --name=foo
> 
> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> Then, I could see the new device:
> 
> $ cat /proc/devices | grep foo
>  99 foo
> 
> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> seems dangerous if the dev node can be provided otherwise and if we
> don't have a use case for it.
> 
> Thoughts?

I can't remember the specific reasons, but I had concluded that letting
unprivileged users use cuse within a user namespace isn't safe. But
having a cuse device node usable by regular users at all is equally
unsafe I suspect, so I don't think your example demonstrates any problem
specific to user namespaces. There shouldn't be any way to use a user
namespace to gain access permissions towards /dev/cuse, otherwise we
have bigger problems than cuse to worry about.

Seth

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 14:29     ` Seth Forshee
@ 2018-01-17 18:56       ` Alban Crequy
  2018-01-17 19:31         ` Seth Forshee
  0 siblings, 1 reply; 89+ messages in thread
From: Alban Crequy @ 2018-01-17 18:56 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
> On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> [Adding Tejun, David, Tom for question about cuse]
>>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> > From: Seth Forshee <seth.forshee@canonical.com>
>> >
>> > In order to support mounts from namespaces other than
>> > init_user_ns, fuse must translate uids and gids to/from the
>> > userns of the process servicing requests on /dev/fuse. This
>> > patch does that, with a couple of restrictions on the namespace:
>> >
>> >  - The userns for the fuse connection is fixed to the namespace
>> >    from which /dev/fuse is opened.
>> >
>> >  - The namespace must be the same as s_user_ns.
>> >
>> > These restrictions simplify the implementation by avoiding the
>> > need to pass around userns references and by allowing fuse to
>> > rely on the checks in inode_change_ok for ownership changes.
>> > Either restriction could be relaxed in the future if needed.
>> >
>> > For cuse the namespace used for the connection is also simply
>> > current_user_ns() at the time /dev/cuse is opened.
>>
>> Was a use case discussed for using cuse in a new unprivileged userns?
>>
>> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> device as an unprivileged user with:
>>
>> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> --maj=99 --min=30 --name=foo
>>
>> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> Then, I could see the new device:
>>
>> $ cat /proc/devices | grep foo
>>  99 foo
>>
>> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> seems dangerous if the dev node can be provided otherwise and if we
>> don't have a use case for it.
>>
>> Thoughts?
>
> I can't remember the specific reasons, but I had concluded that letting
> unprivileged users use cuse within a user namespace isn't safe. But
> having a cuse device node usable by regular users at all is equally
> unsafe I suspect,

This makes sense.

> so I don't think your example demonstrates any problem
> specific to user namespaces. There shouldn't be any way to use a user
> namespace to gain access permissions towards /dev/cuse, otherwise we
> have bigger problems than cuse to worry about.

>From my tests, the patch seem safe but I don't fully understand why that is.

I am not trying to gain more permissions towards /dev/cuse but to
create another cuse char file from within the unprivileged userns. I
tested the scenario by patching the memfs userspace FUSE driver to
generate the char device whenever the file is named "cuse" (turning
the regular file into a char device with the cuse major/minor behind
the scene):

$ unshare -U -r -m
# memfs /mnt/memfs &
# ls -l /mnt/memfs
# echo -n > /mnt/memfs/cuse
-bash: /mnt/memfs/cuse: Input/output error
# ls -l /mnt/memfs/cuse
crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
# cat /mnt/memfs/cuse
cat: /mnt/memfs/cuse: Permission denied

But then, I could not use that char device, even though it seems to
have the correct major/minor and permissions. The kernel FUSE code
seems to call init_special_inode() to handle character devices. I
don't understand why it seems to be safe.

Thanks!
Alban

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 18:56       ` Alban Crequy
@ 2018-01-17 19:31         ` Seth Forshee
  2018-01-18 10:29           ` Alban Crequy
  0 siblings, 1 reply; 89+ messages in thread
From: Seth Forshee @ 2018-01-17 19:31 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
> <seth.forshee@canonical.com> wrote:
> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> >> [Adding Tejun, David, Tom for question about cuse]
> >>
> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> >> > From: Seth Forshee <seth.forshee@canonical.com>
> >> >
> >> > In order to support mounts from namespaces other than
> >> > init_user_ns, fuse must translate uids and gids to/from the
> >> > userns of the process servicing requests on /dev/fuse. This
> >> > patch does that, with a couple of restrictions on the namespace:
> >> >
> >> >  - The userns for the fuse connection is fixed to the namespace
> >> >    from which /dev/fuse is opened.
> >> >
> >> >  - The namespace must be the same as s_user_ns.
> >> >
> >> > These restrictions simplify the implementation by avoiding the
> >> > need to pass around userns references and by allowing fuse to
> >> > rely on the checks in inode_change_ok for ownership changes.
> >> > Either restriction could be relaxed in the future if needed.
> >> >
> >> > For cuse the namespace used for the connection is also simply
> >> > current_user_ns() at the time /dev/cuse is opened.
> >>
> >> Was a use case discussed for using cuse in a new unprivileged userns?
> >>
> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
> >> device as an unprivileged user with:
> >>
> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> >> --maj=99 --min=30 --name=foo
> >>
> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> >> Then, I could see the new device:
> >>
> >> $ cat /proc/devices | grep foo
> >>  99 foo
> >>
> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> >> seems dangerous if the dev node can be provided otherwise and if we
> >> don't have a use case for it.
> >>
> >> Thoughts?
> >
> > I can't remember the specific reasons, but I had concluded that letting
> > unprivileged users use cuse within a user namespace isn't safe. But
> > having a cuse device node usable by regular users at all is equally
> > unsafe I suspect,
> 
> This makes sense.
> 
> > so I don't think your example demonstrates any problem
> > specific to user namespaces. There shouldn't be any way to use a user
> > namespace to gain access permissions towards /dev/cuse, otherwise we
> > have bigger problems than cuse to worry about.
> 
> From my tests, the patch seem safe but I don't fully understand why that is.
> 
> I am not trying to gain more permissions towards /dev/cuse but to
> create another cuse char file from within the unprivileged userns. I
> tested the scenario by patching the memfs userspace FUSE driver to
> generate the char device whenever the file is named "cuse" (turning
> the regular file into a char device with the cuse major/minor behind
> the scene):
> 
> $ unshare -U -r -m
> # memfs /mnt/memfs &
> # ls -l /mnt/memfs
> # echo -n > /mnt/memfs/cuse
> -bash: /mnt/memfs/cuse: Input/output error
> # ls -l /mnt/memfs/cuse
> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
> # cat /mnt/memfs/cuse
> cat: /mnt/memfs/cuse: Permission denied
> 
> But then, I could not use that char device, even though it seems to
> have the correct major/minor and permissions. The kernel FUSE code
> seems to call init_special_inode() to handle character devices. I
> don't understand why it seems to be safe.

Because for new mounts in non-init user namespaces alloc_super() sets
SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
that filesystem.

Seth

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 19:31         ` Seth Forshee
@ 2018-01-18 10:29           ` Alban Crequy
  0 siblings, 0 replies; 89+ messages in thread
From: Alban Crequy @ 2018-01-18 10:29 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 8:31 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
> On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
>> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
>> <seth.forshee@canonical.com> wrote:
>> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> >> [Adding Tejun, David, Tom for question about cuse]
>> >>
>> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> >> > From: Seth Forshee <seth.forshee@canonical.com>
>> >> >
>> >> > In order to support mounts from namespaces other than
>> >> > init_user_ns, fuse must translate uids and gids to/from the
>> >> > userns of the process servicing requests on /dev/fuse. This
>> >> > patch does that, with a couple of restrictions on the namespace:
>> >> >
>> >> >  - The userns for the fuse connection is fixed to the namespace
>> >> >    from which /dev/fuse is opened.
>> >> >
>> >> >  - The namespace must be the same as s_user_ns.
>> >> >
>> >> > These restrictions simplify the implementation by avoiding the
>> >> > need to pass around userns references and by allowing fuse to
>> >> > rely on the checks in inode_change_ok for ownership changes.
>> >> > Either restriction could be relaxed in the future if needed.
>> >> >
>> >> > For cuse the namespace used for the connection is also simply
>> >> > current_user_ns() at the time /dev/cuse is opened.
>> >>
>> >> Was a use case discussed for using cuse in a new unprivileged userns?
>> >>
>> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> >> device as an unprivileged user with:
>> >>
>> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> >> --maj=99 --min=30 --name=foo
>> >>
>> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> >> Then, I could see the new device:
>> >>
>> >> $ cat /proc/devices | grep foo
>> >>  99 foo
>> >>
>> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> >> seems dangerous if the dev node can be provided otherwise and if we
>> >> don't have a use case for it.
>> >>
>> >> Thoughts?
>> >
>> > I can't remember the specific reasons, but I had concluded that letting
>> > unprivileged users use cuse within a user namespace isn't safe. But
>> > having a cuse device node usable by regular users at all is equally
>> > unsafe I suspect,
>>
>> This makes sense.
>>
>> > so I don't think your example demonstrates any problem
>> > specific to user namespaces. There shouldn't be any way to use a user
>> > namespace to gain access permissions towards /dev/cuse, otherwise we
>> > have bigger problems than cuse to worry about.
>>
>> From my tests, the patch seem safe but I don't fully understand why that is.
>>
>> I am not trying to gain more permissions towards /dev/cuse but to
>> create another cuse char file from within the unprivileged userns. I
>> tested the scenario by patching the memfs userspace FUSE driver to
>> generate the char device whenever the file is named "cuse" (turning
>> the regular file into a char device with the cuse major/minor behind
>> the scene):
>>
>> $ unshare -U -r -m
>> # memfs /mnt/memfs &
>> # ls -l /mnt/memfs
>> # echo -n > /mnt/memfs/cuse
>> -bash: /mnt/memfs/cuse: Input/output error
>> # ls -l /mnt/memfs/cuse
>> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
>> # cat /mnt/memfs/cuse
>> cat: /mnt/memfs/cuse: Permission denied
>>
>> But then, I could not use that char device, even though it seems to
>> have the correct major/minor and permissions. The kernel FUSE code
>> seems to call init_special_inode() to handle character devices. I
>> don't understand why it seems to be safe.
>
> Because for new mounts in non-init user namespaces alloc_super() sets
> SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
> that filesystem.

I see. Thanks for the explanation!

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
  2017-12-23  3:46   ` Serge E. Hallyn
  2018-01-17 10:59   ` Alban Crequy
@ 2018-02-12 15:57   ` Miklos Szeredi
  2018-02-12 16:35     ` Eric W. Biederman
  2018-02-20  2:12   ` Eric W. Biederman
  3 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-12 15:57 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, containers, Alban Crequy, Eric W . Biederman, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.

Can we not introduce potential userspace interface regressions?

The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
allow server to run in different pid_ns") will probably bite us here
as well.

We basically need two modes of operation:

a) old, backward compatible (not introducing any new failure mores),
created with privileged mount
b) new, non-backward compatible, created with unprivileged mount

Technically there would still be a risk from breaking userspace, since
we are using the same entry point for both, but let's hope that no
practical problems come from that.

> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {

I don't get it.  Why recalculate the pid if the user_ns does not match?

>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-12 15:57   ` Miklos Szeredi
@ 2018-02-12 16:35     ` Eric W. Biederman
  2018-02-13 10:20       ` Miklos Szeredi
  0 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-12 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> In order to support mounts from namespaces other than
>> init_user_ns, fuse must translate uids and gids to/from the
>> userns of the process servicing requests on /dev/fuse. This
>> patch does that, with a couple of restrictions on the namespace:
>>
>>  - The userns for the fuse connection is fixed to the namespace
>>    from which /dev/fuse is opened.
>>
>>  - The namespace must be the same as s_user_ns.
>>
>> These restrictions simplify the implementation by avoiding the
>> need to pass around userns references and by allowing fuse to
>> rely on the checks in inode_change_ok for ownership changes.
>> Either restriction could be relaxed in the future if needed.
>
> Can we not introduce potential userspace interface regressions?
>
> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
> allow server to run in different pid_ns") will probably bite us here
> as well.

Maybe, but unlike the pid namespace no one has been able to mount
fuse outside of init_user_ns so we are much less exposed.  I agree we
should be careful.

> We basically need two modes of operation:
>
> a) old, backward compatible (not introducing any new failure mores),
> created with privileged mount
> b) new, non-backward compatible, created with unprivileged mount
>
> Technically there would still be a risk from breaking userspace, since
> we are using the same entry point for both, but let's hope that no
> practical problems come from that.

Answering from a 10,000 foot perspective:

There are two cases.  Requests to read/write the filesystem from outside
of s_user_ns.  These run no risk of breaking userspace as this mode has
not been implemented before.

Restrictions at mount time to ensure we are not dealing with a crazy mix
of namespaces.  This has a small chance of breaking someone's crazy
setup.


Dropping requests to read/write the filesystem when the requester does
not map into s_user_ns should not be a problem to enable universally.  If
s_user_ns is init_user_ns everything maps so there is no restriction.



What we can do if we want to ensure maximum backwards compatibility
is if the fuse filesystem is mounted in init_user_ns but if device for
the communication channel is opened in some other user namespace we
can just force the communication channel to operate in init_user_ns.

That will be 100% backwards compatible in all cases and as far as I can
see remove the need for having different ``modes'' of operation.



This does look like the time to give all of this a hard look and see if
we can get these patches in shape to be merged.

Eric



>> For cuse the namespace used for the connection is also simply
>> current_user_ns() at the time /dev/cuse is opened.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/fuse/cuse.c   |  3 ++-
>>  fs/fuse/dev.c    | 11 ++++++++---
>>  fs/fuse/dir.c    | 14 +++++++-------
>>  fs/fuse/fuse_i.h |  6 +++++-
>>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>>  5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
>> index e9e97803..b1b83259 100644
>> --- a/fs/fuse/cuse.c
>> +++ b/fs/fuse/cuse.c
>> @@ -48,6 +48,7 @@
>>  #include <linux/stat.h>
>>  #include <linux/module.h>
>>  #include <linux/uio.h>
>> +#include <linux/user_namespace.h>
>>
>>  #include "fuse_i.h"
>>
>> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>>         if (!cc)
>>                 return -ENOMEM;
>>
>> -       fuse_conn_init(&cc->fc);
>> +       fuse_conn_init(&cc->fc, current_user_ns());
>>
>>         fud = fuse_dev_alloc(&cc->fc);
>>         if (!fud) {
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 17f0d05b..0f780e16 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>>
>>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>>  }
>>
>> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>
>>         return req;
>>
>> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>>         in = &req->in;
>>         reqsize = in->h.len;
>>
>> -       if (task_active_pid_ns(current) != fc->pid_ns) {
>> +       if (task_active_pid_ns(current) != fc->pid_ns ||
>> +           current_user_ns() != fc->user_ns) {
>
> I don't get it.  Why recalculate the pid if the user_ns does not match?
>
>>                 rcu_read_lock();
>>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>>                 rcu_read_unlock();
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 24967382..ad1cfac1 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>>         stat->ino = attr->ino;
>>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         stat->nlink = attr->nlink;
>> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
>> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
>> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
>> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>>         stat->rdev = inode->i_rdev;
>>         stat->atime.tv_sec = attr->atime;
>>         stat->atime.tv_nsec = attr->atimensec;
>> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>>         return true;
>>  }
>>
>> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
>> -                          bool trust_local_cmtime)
>> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
>> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>>  {
>>         unsigned ivalid = iattr->ia_valid;
>>
>>         if (ivalid & ATTR_MODE)
>>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>>         if (ivalid & ATTR_UID)
>> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
>> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>>         if (ivalid & ATTR_GID)
>> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
>> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>>         if (ivalid & ATTR_SIZE)
>>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>>         if (ivalid & ATTR_ATIME) {
>> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>>
>>         memset(&inarg, 0, sizeof(inarg));
>>         memset(&outarg, 0, sizeof(outarg));
>> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
>> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>>         if (file) {
>>                 struct fuse_file *ff = file->private_data;
>>                 inarg.valid |= FATTR_FH;
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index d5773ca6..364e65c8 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -26,6 +26,7 @@
>>  #include <linux/xattr.h>
>>  #include <linux/pid_namespace.h>
>>  #include <linux/refcount.h>
>> +#include <linux/user_namespace.h>
>>
>>  /** Max number of pages that can be used in a single read request */
>>  #define FUSE_MAX_PAGES_PER_REQ 32
>> @@ -466,6 +467,9 @@ struct fuse_conn {
>>         /** The pid namespace for this mount */
>>         struct pid_namespace *pid_ns;
>>
>> +       /** The user namespace for this mount */
>> +       struct user_namespace *user_ns;
>> +
>>         /** Maximum read size */
>>         unsigned max_read;
>>
>> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>>  /**
>>   * Initialize fuse_conn
>>   */
>> -void fuse_conn_init(struct fuse_conn *fc);
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>>
>>  /**
>>   * Release reference to fuse_conn
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 2f504d61..7f6b2e55 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>>         inode->i_ino     = fuse_squash_ino(attr->ino);
>>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         set_nlink(inode, attr->nlink);
>> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
>> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
>> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
>> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>>         inode->i_blocks  = attr->blocks;
>>         inode->i_atime.tv_sec   = attr->atime;
>>         inode->i_atime.tv_nsec  = attr->atimensec;
>> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>>         return err;
>>  }
>>
>> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
>> +                         struct user_namespace *user_ns)
>>  {
>>         char *p;
>>         memset(d, 0, sizeof(struct fuse_mount_data));
>> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_USER_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->user_id = make_kuid(current_user_ns(), uv);
>> +                       d->user_id = make_kuid(user_ns, uv);
>>                         if (!uid_valid(d->user_id))
>>                                 return 0;
>>                         d->user_id_present = 1;
>> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_GROUP_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->group_id = make_kgid(current_user_ns(), uv);
>> +                       d->group_id = make_kgid(user_ns, uv);
>>                         if (!gid_valid(d->group_id))
>>                                 return 0;
>>                         d->group_id_present = 1;
>> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>>         struct super_block *sb = root->d_sb;
>>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>>
>> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
>> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
>> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
>> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>>         if (fc->default_permissions)
>>                 seq_puts(m, ",default_permissions");
>>         if (fc->allow_other)
>> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>>         fpq->connected = 1;
>>  }
>>
>> -void fuse_conn_init(struct fuse_conn *fc)
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>>  {
>>         memset(fc, 0, sizeof(*fc));
>>         spin_lock_init(&fc->lock);
>> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>>         fc->attr_version = 1;
>>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> +       fc->user_ns = get_user_ns(user_ns);
>>  }
>>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>>
>> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>>                 if (fc->destroy_req)
>>                         fuse_request_free(fc->destroy_req);
>>                 put_pid_ns(fc->pid_ns);
>> +               put_user_ns(fc->user_ns);
>>                 fc->release(fc);
>>         }
>>  }
>> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>
>>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>>
>> -       if (!parse_fuse_opt(data, &d, is_bdev))
>> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>>                 goto err;
>>
>>         if (is_bdev) {
>> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!file)
>>                 goto err;
>>
>> -       if ((file->f_op != &fuse_dev_operations) ||
>> -           (file->f_cred->user_ns != &init_user_ns))
>> +       /*
>> +        * Require mount to happen from the same user namespace which
>> +        * opened /dev/fuse to prevent potential attacks.
>> +        */
>> +       if (file->f_op != &fuse_dev_operations ||
>> +           file->f_cred->user_ns != sb->s_user_ns)
>>                 goto err_fput;
>>
>>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!fc)
>>                 goto err_fput;
>>
>> -       fuse_conn_init(fc);
>> +       fuse_conn_init(fc, sb->s_user_ns);
>>         fc->release = fuse_free_conn;
>>
>>         fud = fuse_dev_alloc(fc);
>> --
>> 2.13.6
>>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-12 16:35     ` Eric W. Biederman
@ 2018-02-13 10:20       ` Miklos Szeredi
  2018-02-16 21:52         ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-13 10:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>
>>> In order to support mounts from namespaces other than
>>> init_user_ns, fuse must translate uids and gids to/from the
>>> userns of the process servicing requests on /dev/fuse. This
>>> patch does that, with a couple of restrictions on the namespace:
>>>
>>>  - The userns for the fuse connection is fixed to the namespace
>>>    from which /dev/fuse is opened.
>>>
>>>  - The namespace must be the same as s_user_ns.
>>>
>>> These restrictions simplify the implementation by avoiding the
>>> need to pass around userns references and by allowing fuse to
>>> rely on the checks in inode_change_ok for ownership changes.
>>> Either restriction could be relaxed in the future if needed.
>>
>> Can we not introduce potential userspace interface regressions?
>>
>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>> allow server to run in different pid_ns") will probably bite us here
>> as well.
>
> Maybe, but unlike the pid namespace no one has been able to mount
> fuse outside of init_user_ns so we are much less exposed.  I agree we
> should be careful.

Have to wrap my head around all the rules here.

There's the may_mount() one:

    ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)

Um, first of all, why isn't it checking current->cred->user_ns?

Ah, there it is in sget():

    ns_capable(user_ns, CAP_SYS_ADMIN)

I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
mounts from being created when (current->cred->user_ns !=
&init_user_ns).

Maybe there's a logic to this web of namespaces, but I don't yet see
it.  Is it documented somewhere?

>> We basically need two modes of operation:
>>
>> a) old, backward compatible (not introducing any new failure mores),
>> created with privileged mount
>> b) new, non-backward compatible, created with unprivileged mount
>>
>> Technically there would still be a risk from breaking userspace, since
>> we are using the same entry point for both, but let's hope that no
>> practical problems come from that.
>
> Answering from a 10,000 foot perspective:
>
> There are two cases.  Requests to read/write the filesystem from outside
> of s_user_ns.  These run no risk of breaking userspace as this mode has
> not been implemented before.

This comes from the fact that (s_user_ns == &init_user_ns) and all
user namespaces are "inside" init_user_ns, right?

One question: why does current code use the from_[ug]id_munged()
variant, when the conversion can never fail.  Or can it?

> Restrictions at mount time to ensure we are not dealing with a crazy mix
> of namespaces.  This has a small chance of breaking someone's crazy
> setup.
>
>
> Dropping requests to read/write the filesystem when the requester does
> not map into s_user_ns should not be a problem to enable universally.  If
> s_user_ns is init_user_ns everything maps so there is no restriction.
>
>
>
> What we can do if we want to ensure maximum backwards compatibility
> is if the fuse filesystem is mounted in init_user_ns but if device for
> the communication channel is opened in some other user namespace we
> can just force the communication channel to operate in init_user_ns.
>
> That will be 100% backwards compatible in all cases and as far as I can
> see remove the need for having different ``modes'' of operation.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
  2017-12-23  3:17   ` Serge E. Hallyn
  2018-01-05 19:24   ` Luis R. Rodriguez
@ 2018-02-13 13:18   ` Miklos Szeredi
  2018-02-16 22:00     ` Eric W. Biederman
  2 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:18 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, containers, Alban Crequy, Eric W . Biederman, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro, Luis R. Rodriguez,
	Kees Cook

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.

How can the filesystem be corrupted if chown is denied?

It is not clear to me what the purpose of this patch is or what the
exact usecase this is fixing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-23 12:38     ` Dongsu Park
@ 2018-02-13 13:37       ` Miklos Szeredi
  0 siblings, 0 replies; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Serge E. Hallyn, LKML, Linux Containers, Alban Crequy,
	Eric W . Biederman, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Sat, Dec 23, 2017 at 1:38 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> Hi,
>
> On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>
>>> Expand the check in should_remove_suid() to keep privileges for
>>
>> I realize this description came from Seth, but reading it now,
>> 'Expand' seems wrong.  Expanding a check brings to my mind making
>> it stricter, not looser.  How about 'Relax the check' ?
>
> Makes sense. Will do.
>
>>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>>
>>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>>
>>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>>
>> Why exactly?
>>
>> This is wrong, because capable_wrt_inode_uidgid() does a check
>> against current_user_ns, not the  inode->i_sb->s_user_ns

I'm thoroughly confused.   s_user_ns is supposed to be about the
usernamespace the filesystem perceives to be in, right?  How does that
come into play when checking permissions to do something?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2017-12-22 14:32 ` [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems Dongsu Park
  2017-12-23  3:39   ` Serge E. Hallyn
@ 2018-02-14 12:28   ` Miklos Szeredi
  2018-02-19 22:56     ` Eric W. Biederman
  1 sibling, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-14 12:28 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Alexander Viro

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.

Why is this required for unprivileged fuse?

Fuse doesn't support freeze, so this seems to make no sense in the
context of this patchset.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
  2017-12-23  3:51   ` Serge E. Hallyn
@ 2018-02-14 13:44   ` Miklos Szeredi
  2018-02-15  8:46     ` Miklos Szeredi
  1 sibling, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-14 13:44 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>         .owner          = THIS_MODULE,
>         .name           = "fuse",
> -       .fs_flags       = FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>         .mount          = fuse_mount,
>         .kill_sb        = fuse_kill_sb_anon,
>  };

I think enabling FS_USERNS_MOUNT should be pretty safe.

I was thinking opting out should be as simple as "chmod o-rw
/dev/fuse".  But that breaks libfuse, even though fusermount opens
/dev/fuse in privileged mode, so it shouldn't.  That can be fixed in
libfuse, but it's an unfortunate bug and it also means /dev/fuse is
configured with "crw-rw-rw-" in most cases.  Which means it will be
opting out, not opting in, which is the less safe version.

> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>         .name           = "fuseblk",
>         .mount          = fuse_mount_blk,
>         .kill_sb        = fuse_kill_sb_blk,
> -       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");

As I said, this hunk should be dropped from the first version, because
it's possibly unsafe.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2018-02-14 13:44   ` Miklos Szeredi
@ 2018-02-15  8:46     ` Miklos Szeredi
  0 siblings, 0 replies; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-15  8:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

On Wed, Feb 14, 2018 at 2:44 PM, Miklos Szeredi <mszeredi@redhat.com> wrote:
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> To be able to mount fuse from non-init user namespaces, it's necessary
>> to set FS_USERNS_MOUNT flag to fs_flags.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> [dongsu: add a simple commit messasge]
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/fuse/inode.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 7f6b2e55..8c98edee 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>>  static struct file_system_type fuse_fs_type = {
>>         .owner          = THIS_MODULE,
>>         .name           = "fuse",
>> -       .fs_flags       = FS_HAS_SUBTYPE,
>> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>>         .mount          = fuse_mount,
>>         .kill_sb        = fuse_kill_sb_anon,
>>  };
>
> I think enabling FS_USERNS_MOUNT should be pretty safe.
>
> I was thinking opting out should be as simple as "chmod o-rw
> /dev/fuse".  But that breaks libfuse, even though fusermount opens
> /dev/fuse in privileged mode, so it shouldn't.

I'm talking rubbish, /dev/fuse is opened without privs in fusermount as well.

So there's not way to differentiate user_ns unpriv mounts from suid
fusermount unpriv mounts.

Maybe that's just as well...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-13 10:20       ` Miklos Szeredi
@ 2018-02-16 21:52         ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-16 21:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Miklos Szeredi <mszeredi@redhat.com> writes:
>>
>>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>>
>>>> In order to support mounts from namespaces other than
>>>> init_user_ns, fuse must translate uids and gids to/from the
>>>> userns of the process servicing requests on /dev/fuse. This
>>>> patch does that, with a couple of restrictions on the namespace:
>>>>
>>>>  - The userns for the fuse connection is fixed to the namespace
>>>>    from which /dev/fuse is opened.
>>>>
>>>>  - The namespace must be the same as s_user_ns.
>>>>
>>>> These restrictions simplify the implementation by avoiding the
>>>> need to pass around userns references and by allowing fuse to
>>>> rely on the checks in inode_change_ok for ownership changes.
>>>> Either restriction could be relaxed in the future if needed.
>>>
>>> Can we not introduce potential userspace interface regressions?
>>>
>>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>>> allow server to run in different pid_ns") will probably bite us here
>>> as well.
>>
>> Maybe, but unlike the pid namespace no one has been able to mount
>> fuse outside of init_user_ns so we are much less exposed.  I agree we
>> should be careful.
>
> Have to wrap my head around all the rules here.
>
> There's the may_mount() one:
>
>     ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
>
> Um, first of all, why isn't it checking current->cred->user_ns?
>
> Ah, there it is in sget():
>
>     ns_capable(user_ns, CAP_SYS_ADMIN)
>
> I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
> doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
> mounts from being created when (current->cred->user_ns !=
> &init_user_ns).
>
> Maybe there's a logic to this web of namespaces, but I don't yet see
> it.  Is it documented somewhere?

I think this is a bit simpler than the fiddly details in the
implementation might make it look.

The fundamental idea is that permission to have full control over
a mount namespace, is different than permission to have full control
over an instance of a filesystem.

Implementing that separation of permission checks gets a little bit
fiddly.  The first challenge is that there are several filesystems like
sysfs and proc whose internal mount is created outside of a process.
Then there are the file systems like nfs and afs that have ``referral
points'' that transition you to other instances of those filesystems
when you transition over them.  That is the reason why there are
exceptions for SB_KERNMOUNT and SB_SUBMOUNT.

may_mount is just the permission check for the mount namespace.  It
checks that the current process has CAP_SYS_ADMIN in the user namespace
that owns the current mount namespace.  AKA is the process allowed to
change the mount namespace.

sget is just the permission check for mounting a filesystem.  It checks
that the mounter has CAP_SYS_ADMIN over the user namespace that will own
the newly mounted filesystem.

By the time execition gets to to sget_userns in general all of the
permission checks have all been made.  But if the filesystem is not one
that supports mounting within a user namespace the code checks
capable(CAP_SYS_ADMIN).

That is more convoluted than I would like but the checks derive from the
definition of what we are doing.

>
>>> We basically need two modes of operation:
>>>
>>> a) old, backward compatible (not introducing any new failure mores),
>>> created with privileged mount
>>> b) new, non-backward compatible, created with unprivileged mount
>>>
>>> Technically there would still be a risk from breaking userspace, since
>>> we are using the same entry point for both, but let's hope that no
>>> practical problems come from that.
>>
>> Answering from a 10,000 foot perspective:
>>
>> There are two cases.  Requests to read/write the filesystem from outside
>> of s_user_ns.  These run no risk of breaking userspace as this mode has
>> not been implemented before.
>
> This comes from the fact that (s_user_ns == &init_user_ns) and all
> user namespaces are "inside" init_user_ns, right?

Yes.

> One question: why does current code use the from_[ug]id_munged()
> variant, when the conversion can never fail.  Or can it?

There is always at least (uid_t)-1 that can fail if it shows up on a
filesystem.  As far as I can tell no one was using it for a uid, there
were already uses of (uid_t)-1 as a special case, and I just grabbed it
to become INVALID_UID.

In practice the mapping can't fail unless someone malicious starts using
that id.

I believe I picked the _munged variant so in case that version hits
we are guaranteed to return the 16bit nobody user.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-02-13 13:18   ` Miklos Szeredi
@ 2018-02-16 22:00     ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-16 22:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro, Luis R. Rodriguez,
	Kees Cook

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Eric W. Biederman <ebiederm@xmission.com>
>>
>> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
>> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
>> sufficient to allow access to files but when the underlying filesystem
>> has uids or gids that don't map to the current user namespace it is
>> not enough, so the chown permission checks need to be extended to
>> allow this case.
>>
>> Calling chown on filesystem nodes whose uid or gid don't map is
>> necessary if those nodes are going to be modified as writing back
>> inodes which contain uids or gids that don't map is likely to cause
>> filesystem corruption of the uid or gid fields.
>
> How can the filesystem be corrupted if chown is denied?
>
> It is not clear to me what the purpose of this patch is or what the
> exact usecase this is fixing.

It isn't a fix and we can delay this one and similar patches
that enable things until we are certain all of the necessary
restrictions are in place.  This is not essential for safely getting
fully unprivileged mounting of fuse to work.

The overall strategy has been to handle as many of the generic concerns
at the vfs level as possible to separate filesystem concerns and generic
concerns.

In this case the generic concern is what happens when the uid is read
from the filesystem and it gets mapped to INVALID_UID and then the inode
for that file is written back.

That is a trap for the unwary filesystem implementation and not a case
that I think anyone will actually care about.  It is just not useful
to mount a filesystem and to not map some of it's ids.   So the generic
vfs code just denies writes to files like show with uid of INVALID_UID
or gid of INVALID_GID.  Just to ensure that problems don't show up.

This patch gets through those defenses.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2018-02-14 12:28   ` Miklos Szeredi
@ 2018-02-19 22:56     ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-19 22:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, Linux Containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> The user in control of a super block should be allowed to freeze
>> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
>> ioctls to require CAP_SYS_ADMIN in s_user_ns.
>
> Why is this required for unprivileged fuse?
>
> Fuse doesn't support freeze, so this seems to make no sense in the
> context of this patchset.

This isn't required to support fuse.  It is a relaxation in permissions
so it isn't strictly necessary for anything.

Until just recently Seth and I work working through the vfs looking at
what we need in general for unprivileged mounts.  With fuse as our focus
but we were not limiting ourselves to fuse.

I have been putting off relaxation of permissions like this because they
are not necessary for safety.  But in general they do make sense.

In practice I think all we need to worry about for fuse is the last 4 patches.


Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
  2017-12-23  3:50   ` Serge E. Hallyn
@ 2018-02-19 23:16   ` Eric W. Biederman
  1 sibling, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-19 23:16 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Serge Hallyn

Dongsu Park <dongsu@kinvolk.io> writes:

> From: Seth Forshee <seth.forshee@canonical.com>
>
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);
>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
                     ` (2 preceding siblings ...)
  2018-02-12 15:57   ` Miklos Szeredi
@ 2018-02-20  2:12   ` Eric W. Biederman
  3 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-20  2:12 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

Dongsu Park <dongsu@kinvolk.io> writes:

> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
As noticed in the review this should probably say:
	if (current_user_ns() != &init_user_ns)
		return -EINVAL;

Just so we don't need to think about cuse being opened in a user
namespace at this point.  It is probably harmless.  But it isn't
what we are focusing on.

> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {


> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();

The hunk above is a rebase error.  I believe it started out by erroring
out in the same case the pid namespace case errored out.  Miklos has a
good point that we need to handle the case where we have servers running
in jails of one sort or another because at least sandstorm runs
applications in that fashion, and we have previously had error reports
about that configuration breaking.

I think we can easily fix that.  Either by adding extra translation as
we did for the pid namespace or changing the user namespace used on the
connection.  I believe extra translation like we did with the pid
namespace will be more consistent.  And again it won't be a special
case except possibly during mount.  Of course there is weirdness there.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 0/6] fuse: mounts from non-init user namespaces
       [not found] <cover.1512741134.git.dongsu@kinvolk.io>
                   ` (7 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
@ 2018-02-21 20:24 ` Eric W. Biederman
  2018-02-21 20:29   ` [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
                     ` (5 more replies)
  8 siblings, 6 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

I had to change the core of this patchset around some as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

My apologies if I have lost someone's ack or review somewhere.  Let me
know and I will fix it.

These changes are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v6
  
Eric W. Biederman (4):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fuse: Support fuse filesystems outside of init_user_ns
      fuse: Ensure posix acls are translated outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           |  4 ++--
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 26 +++++++++++++-------------
 fs/fuse/dir.c           | 16 ++++++++--------
 fs/fuse/fuse_i.h        |  7 ++++++-
 fs/fuse/inode.c         | 38 ++++++++++++++++++++++++++------------
 fs/fuse/xattr.c         | 43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c |  1 +
 8 files changed, 105 insertions(+), 37 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
@ 2018-02-21 20:29   ` Eric W. Biederman
  2018-02-22 10:13     ` Miklos Szeredi
  2018-02-21 20:29   ` [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-02-21 20:29   ` [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
@ 2018-02-21 20:29   ` Eric W. Biederman
  2018-02-22 10:26     ` Miklos Szeredi
  2018-02-21 20:29   ` [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..216db3f51a31 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 }
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-02-21 20:29   ` [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
  2018-02-21 20:29   ` [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
@ 2018-02-21 20:29   ` Eric W. Biederman
  2018-02-21 20:29   ` [PATCH v6 4/5] fuse: Ensure posix acls are translated " Eric W. Biederman
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 216db3f51a31..338cfda3eb8f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                     ` (2 preceding siblings ...)
  2018-02-21 20:29   ` [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
@ 2018-02-21 20:29   ` Eric W. Biederman
  2018-02-22 11:40     ` Miklos Szeredi
  2018-02-21 20:29   ` [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
  5 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls.  This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/fuse_i.h |  1 +
 fs/fuse/inode.c  |  7 +++++++
 fs/fuse/xattr.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7772e2b4057e..986fa2b043ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
 extern const struct xattr_handler *fuse_acl_xattr_handlers[];
+extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e018dc3999f4..a52cf2019a58 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
+	/*
+	 * If we are not in the initial user namespace posix
+	 * acls must be translated.
+	 */
+	if (sb->s_user_ns != &init_user_ns)
+		sb->s_xattr = fuse_no_acl_xattr_handlers;
+
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
 	err = -ENOMEM;
 	if (!fc)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..433717640f78 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
 	return fuse_setxattr(inode, name, value, size, flags);
 }
 
+static bool no_xattr_list(struct dentry *dentry)
+{
+	return false;
+}
+
+static int no_xattr_get(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *inode,
+			const char *name, void *value, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static int no_xattr_set(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *nodee,
+			const char *name, const void *value,
+			size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
 static const struct xattr_handler fuse_xattr_handler = {
 	.prefix = "",
 	.get    = fuse_xattr_get,
@@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&fuse_xattr_handler,
 	NULL
 };
+
+static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_ACCESS,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_DEFAULT,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
+	&fuse_no_acl_access_xattr_handler,
+	&fuse_no_acl_default_xattr_handler,
+	&fuse_xattr_handler,
+	NULL
+};
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                     ` (3 preceding siblings ...)
  2018-02-21 20:29   ` [PATCH v6 4/5] fuse: Ensure posix acls are translated " Eric W. Biederman
@ 2018-02-21 20:29   ` Eric W. Biederman
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
  5 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-21 20:29   ` [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
@ 2018-02-22 10:13     ` Miklos Szeredi
  2018-02-22 19:04       ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-22 10:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> At the point of fuse_dev_do_read the user space process that initiated the
> action on the fuse filesystem may no longer exist.  The process have been
> killed or may have fired an asynchronous request and exited.
>
> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
> the pid has been reallocated it can return practically any pid.  Any pid is
> possible as the pid allocator allocates pid numbers in different pid
> namespaces independently.
>
> The only way to make translation in fuse_dev_do_read reliable is to call
> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
> fuse_dev_do_read.  That reference counting in other contexts has been shown
> to bounce cache lines between processors and in general be slow.  So that is
> not desirable.
>
> The only known user of running the fuse server in a different pid namespace
> from the filesystem does not care what the pids are in the fuse messages
> so removing this code should not matter.

Shouldn't we at least zero out the pid in that case?

Thanks,
Miklos


>
> Getting the translation to a server running outside of the pid namespace
> of a container can still be achieved by playing setns games at mount time.
> It is also possible to add an option to pass a pid namespace into the fuse
> filesystem at mount time.
>
> Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/fuse/dev.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 5d06384c2cae..0fb58f364fa6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> -               rcu_read_lock();
> -               in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> -               rcu_read_unlock();
> -       }
> -
>         /* If request is too large, reply with an error and restart the read */
>         if (nbytes < reqsize) {
>                 req->out.h.error = -EIO;
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-21 20:29   ` [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
@ 2018-02-22 10:26     ` Miklos Szeredi
  2018-02-22 18:15       ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-22 10:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Upon a cursory examinination the uid and gid of a fuse request are
> necessary for correct operation.  Failing a fuse request where those
> values are not reliable seems a straight forward and reliable means of
> ensuring that fuse requests with bad data are not sent or processed.
>
> In most cases the vfs will avoid actions it suspects will cause
> an inode write back of an inode with an invalid uid or gid.  But that does
> not map precisely to what fuse is doing, so test for this and solve
> this at the fuse level as well.
>
> Performing this work in fuse_req_init_context is cheap as the code is
> already performing the translation here and only needs to check the
> result of the translation to see if things are not representable in
> a form the fuse server can handle.
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/fuse/dev.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0fb58f364fa6..216db3f51a31 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>         refcount_dec(&req->count);
>  }
>
> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> +
> +       return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>  }
>
>  void fuse_set_initialized(struct fuse_conn *fc)
> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>                         wake_up(&fc->blocked_waitq);
>                 goto out;
>         }
> -
> -       fuse_req_init_context(fc, req);
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> -
> +       if (unlikely(!fuse_req_init_context(fc, req))) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>         return req;
>
>   out:
> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>         if (!req)
>                 req = get_reserved_req(fc, file);
>
> -       fuse_req_init_context(fc, req);
>         __set_bit(FR_WAITING, &req->flags);
>         __clear_bit(FR_BACKGROUND, &req->flags);
> +       if (unlikely(!fuse_req_init_context(fc, req))) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }

I think failing the "_nofail" variant is the wrong thing to do.  This
is called to allocate a FLUSH request on close() and in readdirplus to
allocate a FORGET request.  Failing the latter results in refcount
leak in userspace.   Failing the former results in missing unlock on
close() of posix locks.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-21 20:29   ` [PATCH v6 4/5] fuse: Ensure posix acls are translated " Eric W. Biederman
@ 2018-02-22 11:40     ` Miklos Szeredi
  2018-02-22 19:18       ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-22 11:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Ensure the translation happens by failing to read or write
> posix acls when the filesystem has not indicated it supports
> posix acls.

For the first iteration this is fine, but  we could convert the raw
xattrs as well, if we later want to, right?

Thanks,
Miklos

>
> This ensures that modern cached posix acl support is available
> and used when dealing with posix acls.  This is important
> because only that path has the code to convernt the uids and
> gids in posix acls into the user namespace of a fuse filesystem.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/fuse/fuse_i.h |  1 +
>  fs/fuse/inode.c  |  7 +++++++
>  fs/fuse/xattr.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7772e2b4057e..986fa2b043ab 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
>  int fuse_removexattr(struct inode *inode, const char *name);
>  extern const struct xattr_handler *fuse_xattr_handlers[];
>  extern const struct xattr_handler *fuse_acl_xattr_handlers[];
> +extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
>
>  struct posix_acl;
>  struct posix_acl *fuse_get_acl(struct inode *inode, int type);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e018dc3999f4..a52cf2019a58 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>             file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
> +       /*
> +        * If we are not in the initial user namespace posix
> +        * acls must be translated.
> +        */
> +       if (sb->s_user_ns != &init_user_ns)
> +               sb->s_xattr = fuse_no_acl_xattr_handlers;
> +
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>         err = -ENOMEM;
>         if (!fc)
> diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> index 3caac46b08b0..433717640f78 100644
> --- a/fs/fuse/xattr.c
> +++ b/fs/fuse/xattr.c
> @@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
>         return fuse_setxattr(inode, name, value, size, flags);
>  }
>
> +static bool no_xattr_list(struct dentry *dentry)
> +{
> +       return false;
> +}
> +
> +static int no_xattr_get(const struct xattr_handler *handler,
> +                       struct dentry *dentry, struct inode *inode,
> +                       const char *name, void *value, size_t size)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int no_xattr_set(const struct xattr_handler *handler,
> +                       struct dentry *dentry, struct inode *nodee,
> +                       const char *name, const void *value,
> +                       size_t size, int flags)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
>  static const struct xattr_handler fuse_xattr_handler = {
>         .prefix = "",
>         .get    = fuse_xattr_get,
> @@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
>         &fuse_xattr_handler,
>         NULL
>  };
> +
> +static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
> +       .name  = XATTR_NAME_POSIX_ACL_ACCESS,
> +       .flags = ACL_TYPE_ACCESS,
> +       .list  = no_xattr_list,
> +       .get   = no_xattr_get,
> +       .set   = no_xattr_set,
> +};
> +
> +static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
> +       .name  = XATTR_NAME_POSIX_ACL_DEFAULT,
> +       .flags = ACL_TYPE_ACCESS,
> +       .list  = no_xattr_list,
> +       .get   = no_xattr_get,
> +       .set   = no_xattr_set,
> +};
> +
> +const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
> +       &fuse_no_acl_access_xattr_handler,
> +       &fuse_no_acl_default_xattr_handler,
> +       &fuse_xattr_handler,
> +       NULL
> +};
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-22 10:26     ` Miklos Szeredi
@ 2018-02-22 18:15       ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-22 18:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Upon a cursory examinination the uid and gid of a fuse request are
>> necessary for correct operation.  Failing a fuse request where those
>> values are not reliable seems a straight forward and reliable means of
>> ensuring that fuse requests with bad data are not sent or processed.
>>
>> In most cases the vfs will avoid actions it suspects will cause
>> an inode write back of an inode with an invalid uid or gid.  But that does
>> not map precisely to what fuse is doing, so test for this and solve
>> this at the fuse level as well.
>>
>> Performing this work in fuse_req_init_context is cheap as the code is
>> already performing the translation here and only needs to check the
>> result of the translation to see if things are not representable in
>> a form the fuse server can handle.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>  fs/fuse/dev.c | 20 +++++++++++++-------
>>  1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 0fb58f364fa6..216db3f51a31 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>>         refcount_dec(&req->count);
>>  }
>>
>> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> +
>> +       return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>>  }
>>
>>  void fuse_set_initialized(struct fuse_conn *fc)
>> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>                         wake_up(&fc->blocked_waitq);
>>                 goto out;
>>         }
>> -
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> -
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>         return req;
>>
>>   out:
>> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>>         if (!req)
>>                 req = get_reserved_req(fc, file);
>>
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         __clear_bit(FR_BACKGROUND, &req->flags);
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>
> I think failing the "_nofail" variant is the wrong thing to do.  This
> is called to allocate a FLUSH request on close() and in readdirplus to
> allocate a FORGET request.  Failing the latter results in refcount
> leak in userspace.   Failing the former results in missing unlock on
> close() of posix locks.

Doh!  You are quite correct.

Modifying fuse_get_req_nofail_nopages to fail is a bug.

I am thinking the proper solution is to write:

    static void fuse_req_init_context_nofail(struct fuse_req *req)
    {
            req->in.h.uid = 0;
            req->in.h.gid = 0;
            req->in.h.pid = 0;
    }

And use that in the nofail case.  As it appears neither flush nor
the eviction of inodes is a user space triggered action and as such
user space identifiers are nonsense in those cases.

I will respin this patch shortly.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-22 10:13     ` Miklos Szeredi
@ 2018-02-22 19:04       ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:04 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> At the point of fuse_dev_do_read the user space process that initiated the
>> action on the fuse filesystem may no longer exist.  The process have been
>> killed or may have fired an asynchronous request and exited.
>>
>> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
>> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
>> the pid has been reallocated it can return practically any pid.  Any pid is
>> possible as the pid allocator allocates pid numbers in different pid
>> namespaces independently.
>>
>> The only way to make translation in fuse_dev_do_read reliable is to call
>> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
>> fuse_dev_do_read.  That reference counting in other contexts has been shown
>> to bounce cache lines between processors and in general be slow.  So that is
>> not desirable.
>>
>> The only known user of running the fuse server in a different pid namespace
>> from the filesystem does not care what the pids are in the fuse messages
>> so removing this code should not matter.
>
> Shouldn't we at least zero out the pid in that case?

This is an explicit case of passing a file descriptor between pid
namespaces.  So I think there are plenty of buyer be ware signs out.
So I don't know if there are any real world advantages of zeroing the
pid.

I can see a case for using the pid namespace of the opener of /dev/fuse
instead of the pid namespace of the mounter of the fuse filesystem.
Although in practice I would be surprised if they were different.

I am very leary about caring during a read operation.  Caring about the
current processes during read/write tends to break caching, is error prone
as the need for this patch demonstrates, and is generally likely to be
slower than not caring.

So yes we can zero the pid.   I don't think it is wise to zero the pid
unless we zero the pid in fuse_req_init_context.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 11:40     ` Miklos Szeredi
@ 2018-02-22 19:18       ` Eric W. Biederman
  2018-02-22 22:50         ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Ensure the translation happens by failing to read or write
>> posix acls when the filesystem has not indicated it supports
>> posix acls.
>
> For the first iteration this is fine, but  we could convert the raw
> xattrs as well, if we later want to, right?

I will say maybe.  This is tricky.   The code would not be too hard,
and the function to do the work posix_acl_fix_xattr_userns already
exists in fs/posix_acl.c

I don't actually expect that to work longterm.  I expect the direction
the kernel internals are moving is that all filesystems that implement
posix acls will be expected to implement .get_acl and .set_acl.

I would have to reread the old thread that got us to this point with
posix acls before I could really understand the backwards compatible
fuse use case, and I would have to reread the rest of the acl processing
in the kernel before I could recall exactly what makes sense.

If there was an obvious way to whitelist xattrs that fuse can support
for user namespaces I think I would go for that.  Just to avoid future
problems with future xattrs.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 19:18       ` Eric W. Biederman
@ 2018-02-22 22:50         ` Eric W. Biederman
  2018-02-26  7:47           ` Miklos Szeredi
  0 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-22 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Ensure the translation happens by failing to read or write
>>> posix acls when the filesystem has not indicated it supports
>>> posix acls.
>>
>> For the first iteration this is fine, but  we could convert the raw
>> xattrs as well, if we later want to, right?
>
> I will say maybe.  This is tricky.   The code would not be too hard,
> and the function to do the work posix_acl_fix_xattr_userns already
> exists in fs/posix_acl.c
>
> I don't actually expect that to work longterm.  I expect the direction
> the kernel internals are moving is that all filesystems that implement
> posix acls will be expected to implement .get_acl and .set_acl.
>
> I would have to reread the old thread that got us to this point with
> posix acls before I could really understand the backwards compatible
> fuse use case, and I would have to reread the rest of the acl processing
> in the kernel before I could recall exactly what makes sense.
>
> If there was an obvious way to whitelist xattrs that fuse can support
> for user namespaces I think I would go for that.  Just to avoid future
> problems with future xattrs.

I am remembering why this is such a sticky issue.

Today when a posix acl is read from user space the code does:
      posix_acl_to_xattr(&init_user_ns, ...) in posix_acl_xattr_get
      posix_acl_fix_xattr_to_user() in getxattr

Similary when a posix acl is written from user space the code does:
      posix_acl_fix_xattr_from_user() in setxattr
      posix_acl_from_xattr(&init_user_us, ...) in posix_acl_xattr_set

If every posix acl supporting filesystem in the kernel would use
posix_acl_access_xattr_handler and posix_acl_default_xattr_handler the
function posix_acl_fix_xattr_to_user and posix_acl_fix_xattr_from_user
and posix_acl_fix_xattr_userns could all be removed and the posix acl
handling could be that little bit simpler and faster.

So if we could figure out how to use the generic acl support for the old
brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
easier to support them long term.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 22:50         ` Eric W. Biederman
@ 2018-02-26  7:47           ` Miklos Szeredi
  2018-02-26 16:35             ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-26  7:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:

> So if we could figure out how to use the generic acl support for the old
> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
> easier to support them long term.

Simplest and most robust way seems to be to do everything the same (as
with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-26  7:47           ` Miklos Szeredi
@ 2018-02-26 16:35             ` Eric W. Biederman
  2018-02-26 21:51               ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
>> So if we could figure out how to use the generic acl support for the old
>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>> easier to support them long term.
>
> Simplest and most robust way seems to be to do everything the same (as
> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Good point.  That sounds like for the !fc->posix_acl case we just
need a careful use of "forget_all_cached_acls(inode)".

I will take a quick look at that, and see if that is easy/sufficient to
cover the legacy fuse case.  Otherwise I will go with what I already
have here.

That feels like a better path.  And internally I would call what is
today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
Fingers crossed.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-26 16:35             ` Eric W. Biederman
@ 2018-02-26 21:51               ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 21:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>
>>> So if we could figure out how to use the generic acl support for the old
>>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>>> easier to support them long term.
>>
>> Simplest and most robust way seems to be to do everything the same (as
>> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.
>
> Good point.  That sounds like for the !fc->posix_acl case we just
> need a careful use of "forget_all_cached_acls(inode)".
>
> I will take a quick look at that, and see if that is easy/sufficient to
> cover the legacy fuse case.  Otherwise I will go with what I already
> have here.
>
> That feels like a better path.  And internally I would call what is
> today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
> Fingers crossed.

It looks like simply setting
"inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;" is the secret
sauce needed to disable caching in the legacy case and make everything
work.

I had to tweak the calls to forget_all_cached_acls so that won't clear
the ACL_DONT_CACHE status but otherwise that was an absolutely trivial
change to combine those two code paths.

I will post my updated patches shortly.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 0/7] fuse: mounts from non-init user namespaces
  2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                     ` (4 preceding siblings ...)
  2018-02-21 20:29   ` [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
@ 2018-02-26 23:52   ` Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
                       ` (7 more replies)
  5 siblings, 8 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v7

Eric W. Biederman (6):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
      fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
      fuse: Simplfiy the posix acl handling logic.
      fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           | 10 +++++-----
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 30 +++++++++++++++++-------------
 fs/fuse/dir.c           | 27 +++++++++++++--------------
 fs/fuse/fuse_i.h        | 11 ++++++++---
 fs/fuse/inode.c         | 44 +++++++++++++++++++++++++++++---------------
 fs/fuse/xattr.c         |  6 +-----
 fs/posix_acl.c          |  7 +++++--
 kernel/user_namespace.c |  1 +
 9 files changed, 85 insertions(+), 58 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
@ 2018-02-26 23:52     ` Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
@ 2018-02-26 23:52     ` Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+	req->in.h.uid = 0;
+	req->in.h.gid = 0;
+	req->in.h.pid = 0;
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
+	fuse_req_init_context_nofail(req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
 	return req;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
  2018-02-26 23:52     ` [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
@ 2018-02-26 23:52     ` Eric W. Biederman
  2018-02-27  1:13       ` Linus Torvalds
  2018-02-26 23:52     ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Fuse is about to join overlayfs in relying on get_acl respecting
ACL_DONT_CACHE so update the documentation in get_acl to reflect that
fact.  The comment and this change description should give people a
clue that respecting ACL_DONT_CACHE in get_acl is important, and they
should audit the filesystems before removing that support.

Additionaly update the comment above the call to get_acl itself and
remove the wrong information that an implementation of get_acl can
prevent caching by calling forget_cached_acl.  Replace that with the
correct information that to prevent caching all that is necessary is
to set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE when the
inode is initialized.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/posix_acl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..3c24fc263401 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -121,14 +121,17 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	 * could wait for that other task to complete its job, but it's easier
 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
 	 * be an unlikely race.)
+	 *
+	 * ACL_DONT_CACHE is treated as another task updating the acl and
+	 * remains set.
 	 */
 	if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
 		/* fall through */ ;
 
 	/*
 	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * A filesystem can prevent that by calling setting
+	 * inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE.
 	 *
 	 * If the filesystem doesn't have a get_acl() function at all, we'll
 	 * just create the negative cache entry.
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
                       ` (2 preceding siblings ...)
  2018-02-26 23:52     ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
@ 2018-02-26 23:52     ` Eric W. Biederman
  2018-02-26 23:53     ` [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

When FUSE_GETXATTR will never return anything call cache_no_acl to
cache that state in the vfs as well in fuse with fc->no_getxattr.

The only code path this affects are the code paths that call
fuse_get_acl and caching a NULL or returning it immediately
is exactly the same effect so this should not effect anything.

This keeps the vfs from waisting it's time calling down into fuse
when fuse isn't going to do anything, and it makes it clear
when a NULL should be cached for optimal performance.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/xattr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..0520a4f47226 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -82,6 +82,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 		ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
 	if (ret == -ENOSYS) {
 		fc->no_getxattr = 1;
+		cache_no_acl(inode);
 		ret = -EOPNOTSUPP;
 	}
 	return ret;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
                       ` (3 preceding siblings ...)
  2018-02-26 23:52     ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
@ 2018-02-26 23:53     ` Eric W. Biederman
  2018-02-27  9:00       ` Miklos Szeredi
  2018-02-26 23:53     ` [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means.  That fuse will cache and operate on the
cached value of the posix acl.

When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
so that get_acl and friends won't cache the acl values even if they
are called.

Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
wrapper only takes effect when cached_posix_acl is true to prevent
losing the nocache or noxattr status in when posix acls are not
cached.

Always use posix_acl_access_xattr_handler so the fuse code
benefits from the generic posix acl handlers as much as possible.
This will become important as the code works on translation
of uid and gid in the posix acls when fuse is not mounted in
the initial user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  6 +++---
 fs/fuse/dir.c    | 11 +++++------
 fs/fuse/fuse_i.h |  5 +++--
 fs/fuse/inode.c  | 13 ++++++++++---
 fs/fuse/xattr.c  |  5 -----
 5 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..8fb2153dbf50 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 	void *value = NULL;
 	struct posix_acl *acl;
 
-	if (!fc->posix_acl || fc->no_getxattr)
+	if (fc->no_getxattr)
 		return NULL;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -53,7 +53,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	const char *name;
 	int ret;
 
-	if (!fc->posix_acl || fc->no_setxattr)
+	if (fc->no_setxattr)
 		return -EOPNOTSUPP;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -92,7 +92,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	} else {
 		ret = fuse_removexattr(inode, name);
 	}
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	fuse_invalidate_attr(inode);
 
 	return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..a44ca509db4f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -237,7 +237,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 		if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
 			goto invalid;
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &outarg.attr,
 				       entry_attr_timeout(&outarg),
 				       attr_version);
@@ -930,7 +930,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
 	int err = 0;
 
 	if (time_before64(fi->i_time, get_jiffies_64())) {
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		err = fuse_do_getattr(inode, stat, file);
 	} else if (stat) {
 		generic_fillattr(inode, stat);
@@ -1076,7 +1076,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask)
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	return fuse_do_getattr(inode, NULL, NULL);
 }
 
@@ -1246,7 +1246,7 @@ static int fuse_direntplus_link(struct file *file,
 		fi->nlookup++;
 		spin_unlock(&fc->lock);
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &o->attr,
 				       entry_attr_timeout(o),
 				       attr_version);
@@ -1764,8 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
 		 * If filesystem supports acls it may have updated acl xattrs in
 		 * the filesystem, so forget cached acls for the inode.
 		 */
-		if (fc->posix_acl)
-			forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 
 		/* Directory mode changed, may need to revalidate access */
 		if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..3cf296d60bc0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
 	unsigned no_lseek:1;
 
 	/** Does the filesystem support posix acls? */
-	unsigned posix_acl:1;
+	unsigned cached_posix_acl:1;
 
 	/** Check permissions based on the file mode or not? */
 	unsigned default_permissions:1;
@@ -913,6 +913,8 @@ void fuse_release_nowrite(struct inode *inode);
 
 u64 fuse_get_attr_version(struct fuse_conn *fc);
 
+void fuse_forget_cached_acls(struct inode *inode);
+
 /**
  * File-system tells the kernel to invalidate cache for the given node id.
  */
@@ -974,7 +976,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..0c3ccca7c554 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -313,6 +313,8 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		if (!fc->writeback_cache || !S_ISREG(attr->mode))
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
+		if (!fc->cached_posix_acl)
+			inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
@@ -331,6 +333,12 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	return inode;
 }
 
+void fuse_forget_cached_acls(struct inode *inode)
+{
+	if (get_fuse_conn(inode)->cached_posix_acl)
+		forget_all_cached_acls(inode);
+}
+
 int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 			     loff_t offset, loff_t len)
 {
@@ -343,7 +351,7 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 		return -ENOENT;
 
 	fuse_invalidate_attr(inode);
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	if (offset >= 0) {
 		pg_start = offset >> PAGE_SHIFT;
 		if (len <= 0)
@@ -915,8 +923,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->sb->s_time_gran = arg->time_gran;
 			if ((arg->flags & FUSE_POSIX_ACL)) {
 				fc->default_permissions = 1;
-				fc->posix_acl = 1;
-				fc->sb->s_xattr = fuse_acl_xattr_handlers;
+				fc->cached_posix_acl = 1;
 			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0520a4f47226..48a95e1bb020 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -200,11 +200,6 @@ static const struct xattr_handler fuse_xattr_handler = {
 };
 
 const struct xattr_handler *fuse_xattr_handlers[] = {
-	&fuse_xattr_handler,
-	NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&posix_acl_access_xattr_handler,
 	&posix_acl_default_xattr_handler,
 	&fuse_xattr_handler,
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
                       ` (4 preceding siblings ...)
  2018-02-26 23:53     ` [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
@ 2018-02-26 23:53     ` Eric W. Biederman
  2018-02-26 23:53     ` [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  7 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8fb2153dbf50..5a67c80e21d6 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a44ca509db4f..79cca1687457 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3cf296d60bc0..eba0beea8634 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0c3ccca7c554..cd3d29610688 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -485,7 +485,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -521,7 +522,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -530,7 +531,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -573,8 +574,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -605,7 +606,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -629,6 +630,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -638,6 +640,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1068,7 +1071,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1093,8 +1096,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1102,7 +1109,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
                       ` (5 preceding siblings ...)
  2018-02-26 23:53     ` [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
@ 2018-02-26 23:53     ` Eric W. Biederman
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  7 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 79cca1687457..0cbd1ff3dd48 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-26 23:52     ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
@ 2018-02-27  1:13       ` Linus Torvalds
  2018-02-27  2:53         ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2018-02-27  1:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 3:52 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Additionaly update the comment above the call to get_acl itself and
> remove the wrong information that an implementation of get_acl can
> prevent caching by calling forget_cached_acl.

This part is just confusing.

First off, that comment is correct: a filesystem _can_ prevent the
returning of cached data by just calling forget_cached_acl().

Note that there are two different cases: saying that you _never_ want
to cache things (ACL_DONT_CACHE) and saying that there _currently_ is
no cached data (ACL_NOT_CACHED).

forget_cached_acl() just removes the current cache.

You're just replacing one case of "no cached" information with the other.

Just explain the two cases, don't try to muddy the waters even more..

PLUS you are just confusing things entirely. That whole new comment of yours:

+        * ACL_DONT_CACHE is treated as another task updating the acl and
+        * remains set.

is just garbage.

The code is very clear - it will only replace a ACL_NOT_CACHED entry.
The code is clear:

        if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
                /* fall through */ ;

this is basically just an atomic "if *p == ACL_NOT_CACHED then replace
it with 'sentinel'".

Your comment does not add any clarity at all, and only confuses
things. It has nothing to do with "treated as another task updating
the acl".

The fact is, ACL_DONT_CACHE is treated as if the cache is simply
already filled - it's just filled with "no cache".

So the only thing special is ACL_NOT_CACHED, which is the only thing
we will try to _replace_.

So NAK on this patch entirely. It's just adding confusion, not adding
clarifications.

                Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  1:13       ` Linus Torvalds
@ 2018-02-27  2:53         ` Eric W. Biederman
  2018-02-27  3:14           ` Eric W. Biederman
  2018-02-27  3:36           ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Linus Torvalds
  0 siblings, 2 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-27  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn


So the purpose for having a patch in the first place is that
2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
which addded ACL_DONT_CACHED did not result in any comment updates
to get_acl.

Which mean that if you read the comments in get_acl() that you
don't even think of ACL_DONT_CACHED.

Which means that this comment:
	/*
	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
	 * sentinel will not be set; another task will update the cache.  We
	 * could wait for that other task to complete its job, but it's easier
	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
	 * be an unlikely race.)
	 */

Which presumes the only reason the acl could be anything other
ACL_NOT_CACHED is because get_acl() is already being called upon it in
another task.

I wanted something to mention ACL_DONT_CACHED so someone would at least
think about that case if they ever step up to modify the code.

The code is perfectly clear, the comment is not.   That scares me.

And I had to read the code about a dozen times before I realized the
ACL_DONT_CACHED case even exists.   Not useful when I am need to use
that to preserve historical fuse semantics.

So something is missing here even if my wording does not improve things.



Then we get this comment:
	/*
	 * Normally, the ACL returned by ->get_acl will be cached.
	 * A filesystem can prevent that by calling
	 * forget_cached_acl(inode, type) in ->get_acl.
	 */

Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
That comment is and always has been rubbish.

I don't have a clue what it is trying to say but it is not something
a person can use to write filesystem code with.


Truths:
- forget_cached_acl(inode, type) can be used to invalidate the acl
  cache.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a cached value from being returend because
  ->get_acl will be set.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a returned value from being cached
  because it the caching happens after ->get_acl returns.

- Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
  a value from ->get_acl from being cached.
  

In summary I only care about two things.
1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
   at the code, and people updating the code will have a hint that they
   need to consider that case.

2) That misleading completely bogus comment being removed/fixed.


And yes I agree the code is clear.  The comments are not.


Does this look better as a comment updating patch?

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..5453094b8828 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
        struct posix_acl **p;
        struct posix_acl *acl;
 
+       /*
+        * To avoid caching the result of ->get_acl
+        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
+        */
+
        /*
         * The sentinel is used to detect when another operation like
         * set_cached_acl() or forget_cached_acl() races with get_acl().
@@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
                /* fall through */ ;
 
        /*
-        * Normally, the ACL returned by ->get_acl will be cached.
-        * A filesystem can prevent that by calling
-        * forget_cached_acl(inode, type) in ->get_acl.
+        * The ACL returned by ->get_acl will be cached.
         *
         * If the filesystem doesn't have a get_acl() function at all, we'll
         * just create the negative cache entry.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  2:53         ` Eric W. Biederman
@ 2018-02-27  3:14           ` Eric W. Biederman
  2018-02-27  3:41             ` Linus Torvalds
  2018-02-27  3:36           ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-02-27  3:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

2> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.
>
> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.
>
> Which means that this comment:
> 	/*
> 	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
> 	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
> 	 * sentinel will not be set; another task will update the cache.  We
> 	 * could wait for that other task to complete its job, but it's easier
> 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
> 	 * be an unlikely race.)
> 	 */
>
> Which presumes the only reason the acl could be anything other
> ACL_NOT_CACHED is because get_acl() is already being called upon it in
> another task.
>
> I wanted something to mention ACL_DONT_CACHED so someone would at least
> think about that case if they ever step up to modify the code.
>
> The code is perfectly clear, the comment is not.   That scares me.
>
> And I had to read the code about a dozen times before I realized the
> ACL_DONT_CACHED case even exists.   Not useful when I am need to use
> that to preserve historical fuse semantics.
>
> So something is missing here even if my wording does not improve things.
>
>
>
> Then we get this comment:
> 	/*
> 	 * Normally, the ACL returned by ->get_acl will be cached.
> 	 * A filesystem can prevent that by calling
> 	 * forget_cached_acl(inode, type) in ->get_acl.
> 	 */
>
> Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
> That comment is and always has been rubbish.
>
> I don't have a clue what it is trying to say but it is not something
> a person can use to write filesystem code with.
>
>
> Truths:
> - forget_cached_acl(inode, type) can be used to invalidate the acl
>   cache.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a cached value from being returend because
>   ->get_acl will be set.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a returned value from being cached
>   because it the caching happens after ->get_acl returns.

Sigh.  Yes it will because we set the special sentinel value,
and forget_cached_acl will replace the sentinel value with
ACL_NOT_CACHED.

It is a terribly brittle and racy thing to do, and it probably won't
work to say cache this acl but not this one on a case by case bases
in ->get_acl.

As such I believe that usage of forget_cached_acl should be subsumed by
using ACL_NOT_CACHED.  If not we should really come up with a different
helper function name to call from ->get_acl.  Preferably one that does
"cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.


> - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
>   a value from ->get_acl from being cached.
>   
>
> In summary I only care about two things.
> 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
>    at the code, and people updating the code will have a hint that they
>    need to consider that case.
>
> 2) That misleading completely bogus comment being removed/fixed.
>
>
> And yes I agree the code is clear.  The comments are not.
>
>
> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>  
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>  
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.
>          *
>          * If the filesystem doesn't have a get_acl() function at all, we'll
>          * just create the negative cache entry.
>
> Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  2:53         ` Eric W. Biederman
  2018-02-27  3:14           ` Eric W. Biederman
@ 2018-02-27  3:36           ` Linus Torvalds
  1 sibling, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 6:53 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.

I'm not opposed to just updating the comments.

I just think your updates were somewhat misleading.

> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.

Right. By all means add a comment about ACL_DONT_CACHE disabling the
cache entirely.

But don't _remove_ the other valid way to flush the cache, and don't
make that comment above cmpxchg() be even more confusing than the code
is.

> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.

Why do you hate forget_cached_acl()?

It's perfectly valid too. Don't remove that comment. Maybe reword it
to talk not about "preventing", but about "invalidating the cache".

But the old comment that you remove isn't _wrong_, it's just that the
"preventing" from returning the cached state with forget_cached_acl()
is just a one-time thing.

So forget_cached_acl() exists, and it works, and it does exactly what
its name says. It is a perfectly valid way to prevent the current
entry from being used in the future.

See? I object to you removing that, and trying to make it be like
ACL_DONT_CACHE is the *onyl* way to not cache something.

Because honestly, that's what your comment updates do. They take the
comments about _one_ case, and switch it over to be about the _othger_
case.

But dammit, there are _two_ ways to not cache things.

"Fixing" the comment to talk about one and removing the other isn't a
fix. It's just a stupid change that now has the problem the other way
around!

So fix the comment to really just talk about both things.

First: talk about how to avoid caching entirely (ACL_DONT_CACHE).
Then, talk about how to invalidate the cache once it has been
instantiated (forget_cached_acl()).

Don't do this idiotic "remove the valid comment just because you
happened to care about the _other_ case"


              Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  3:14           ` Eric W. Biederman
@ 2018-02-27  3:41             ` Linus Torvalds
  2018-03-02 19:53               ` [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 7:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> As such I believe that usage of forget_cached_acl should be subsumed by
> using ACL_NOT_CACHED.  If not we should really come up with a different
> helper function name to call from ->get_acl.  Preferably one that does
> "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.

You make your bias very clear, by simply trying to hide the other case.

But for chrissake, that's not the state right now. That other case
exists. You can't - and shouldn't - try to just hide it.

Besides, that "forget_cached_acl()" approach actually has a valid use
case. Maybe you _do_ want to cache ACL's, but with a timeout or
revalidation.

ACL_DONT_CACHE really is a big hammer that makes caching not work at
all. It's not necessarily the right thing to do at all.

                Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-26 23:53     ` [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
@ 2018-02-27  9:00       ` Miklos Szeredi
  2018-03-02 21:49         ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-02-27  9:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Rename the fuse connection flag posix_acl to cached_posix_acl as that
> is what it actually means.  That fuse will cache and operate on the
> cached value of the posix acl.
>
> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
> so that get_acl and friends won't cache the acl values even if they
> are called.
>
> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
> wrapper only takes effect when cached_posix_acl is true to prevent
> losing the nocache or noxattr status in when posix acls are not
> cached.

Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
it makes sense to generally not clear ACL_DONT_CACHE, since it's not
an actual acl value that needs forgetting.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping
  2018-02-27  3:41             ` Linus Torvalds
@ 2018-03-02 19:53               ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn


The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Linus my issue with the forget_cached_acl case was really that it was
too big of a hammer.  If you care about caching acls only somtimes
forget_cached_acl called from ->get_acl can stomp that acl you
explicitly cached with set_cached_acl.

With this change I can unify the legacy horrible fuse posix acl case
that requires not caching acls with a single if statement in the get_acl
method. AKA:

+	if (!IS_ERR(acl) && !fc->posix_acl)
+		acl = to_uncacheable_acl(acl);
 	return acl;

That code I know is locally correct even if later fuse decides to cache
negative acls when the underlying filesystem does not support xattrs.

 fs/posix_acl.c            | 56 ++++++++++++++++++++++++++++++++++-------------
 include/linux/posix_acl.h | 17 ++++++++++++++
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..e58a68e18603 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 {
 	void *sentinel;
 	struct posix_acl **p;
-	struct posix_acl *acl;
+	struct posix_acl *acl, *to_cache;
 
 	/*
 	 * The sentinel is used to detect when another operation like
 	 * set_cached_acl() or forget_cached_acl() races with get_acl().
 	 * It is guaranteed that is_uncached_acl(sentinel) is true.
+	 *
+	 * This is sufficient to prevent races between ->set_acl
+	 * calling set_cached_acl (outside of filesystem specific
+	 * locking) and get_acl() caching the returned acl.
 	 */
 
 	acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 		/* fall through */ ;
 
 	/*
-	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * Normally, the ACL returned by ->get_acl() will be cached.
+	 *
+	 * A filesystem can prevent the acl returned by ->get_acl()
+	 * from being cached by wrapping it with to_uncachable_acl().
+	 *
+	 * A filesystem can at anytime effect the cache directly and
+	 * cause in process calls of get_acl() not to update the cache
+	 * by calling forget_cache_acl(inode, type) or
+	 * set_cached_acl(inode, type, acl).
 	 *
-	 * If the filesystem doesn't have a get_acl() function at all, we'll
-	 * just create the negative cache entry.
+	 * If the filesystem doesn't have a ->get_acl() function at
+	 * all, we'll just create the negative cache entry.
 	 */
 	if (!inode->i_op->get_acl) {
 		set_cached_acl(inode, type, NULL);
@@ -139,21 +149,37 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	}
 	acl = inode->i_op->get_acl(inode, type);
 
+
+	/* To keep the logic simple default to not caching an acl when
+	 * the sentinel is cleared.
+	 */
+	to_cache = ACL_NOT_CACHED;
 	if (IS_ERR(acl)) {
-		/*
-		 * Remove our sentinel so that we don't block future attempts
-		 * to cache the ACL.
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return an error.
 		 */
-		cmpxchg(p, sentinel, ACL_NOT_CACHED);
-		return acl;
+	}
+	else if (is_uncacheable_acl(acl)) {
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return a valid ACL.
+		 */
+		acl = to_cacheable_acl(acl);
+	}
+	else {
+		to_cache = acl;
+		posix_acl_dup(to_cache);
 	}
 
 	/*
-	 * Cache the result, but only if our sentinel is still in place.
+	 * Remove the sentinel and replace it with the value that
+	 * needs to be cached, but only if the sentinel is still in
+	 * place.
 	 */
-	posix_acl_dup(acl);
-	if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
-		posix_acl_release(acl);
+	if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+		if (!is_uncached_acl(to_cache))
+			posix_acl_release(to_cache);
+	}
+
 	return acl;
 }
 EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
 		kfree_rcu(acl, a_rcu);
 }
 
+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+	return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}
 
 /* posix_acl.c */
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-27  9:00       ` Miklos Szeredi
@ 2018-03-02 21:49         ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Rename the fuse connection flag posix_acl to cached_posix_acl as that
>> is what it actually means.  That fuse will cache and operate on the
>> cached value of the posix acl.
>>
>> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
>> so that get_acl and friends won't cache the acl values even if they
>> are called.
>>
>> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
>> wrapper only takes effect when cached_posix_acl is true to prevent
>> losing the nocache or noxattr status in when posix acls are not
>> cached.
>
> Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
> it makes sense to generally not clear ACL_DONT_CACHE, since it's not
> an actual acl value that needs forgetting.

After stopping to make certain I understand the issues, I don't think
it makes sense to teach forget_cached_acl about ACL_DONT_CACHE.

If you are fogetting a cached attribute ACL_DONT_CACHE simply doesn't
make sense.

Further it makes sense to cache a negative result for fuse when
!fc->no_getxattr.  Even if you would ordinarily not cache posix acls.

So I think the better plan is to teach the posix acl code how to not
cache results on a case by case basis.  As I did in my rfc patch I
posted a little earlier today.  That works with forget_cached_acl and it
supports local reasoning.  Further while the performance might not be as
good as ACL_DONT_CACHE I don't think that matters as always going to the
fuse server to get acls is almost certainly going to dominate the acl
costs.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 0/6] fuse: mounts from non-init user namespaces
  2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
                       ` (6 preceding siblings ...)
  2018-02-26 23:53     ` [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
@ 2018-03-02 21:58     ` Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
                         ` (6 more replies)
  7 siblings, 7 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
vfs patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Since v7 I have:
- Rethought and reworked how I am unifying the cached and the non-cached
  posix acl case so the code is cleaner and simpler.
  - I have dropped enhancements to caching negative acls when
    fc->no_getxattr is set.
  - Removed the need to wrap forget_all_cached_acls in fuse.
- Reorder the patches so the posix acl work comes first

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v8

Eric W. Biederman (5):
      fs/posix_acl: Update the comments and support lightweight cache skipping
      fuse: Simplfiy the posix acl handling logic.
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant


 fs/fuse/acl.c             | 10 ++++++----
 fs/fuse/cuse.c            |  7 ++++++-
 fs/fuse/dev.c             | 30 ++++++++++++++++------------
 fs/fuse/dir.c             | 18 ++++++++---------
 fs/fuse/fuse_i.h          |  9 ++++++---
 fs/fuse/inode.c           | 34 +++++++++++++++++++-------------
 fs/fuse/xattr.c           |  5 -----
 fs/posix_acl.c            | 50 ++++++++++++++++++++++++++++++++---------------
 include/linux/posix_acl.h | 17 ++++++++++++++++
 kernel/user_namespace.c   |  1 +
 10 files changed, 116 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-05  9:53         ` Miklos Szeredi
  2018-03-02 21:59       ` [PATCH v8 2/6] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
                         ` (5 subsequent siblings)
  6 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/posix_acl.c            | 50 ++++++++++++++++++++++++++++++++---------------
 include/linux/posix_acl.h | 17 ++++++++++++++++
 2 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..00281bc30854 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 {
 	void *sentinel;
 	struct posix_acl **p;
-	struct posix_acl *acl;
+	struct posix_acl *acl, *to_cache;
 
 	/*
 	 * The sentinel is used to detect when another operation like
 	 * set_cached_acl() or forget_cached_acl() races with get_acl().
 	 * It is guaranteed that is_uncached_acl(sentinel) is true.
+	 *
+	 * This is sufficient to prevent races between ->set_acl
+	 * calling set_cached_acl (outside of filesystem specific
+	 * locking) and get_acl() caching the returned acl.
 	 */
 
 	acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 		/* fall through */ ;
 
 	/*
-	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * Normally, the ACL returned by ->get_acl() will be cached.
+	 *
+	 * A filesystem can prevent the acl returned by ->get_acl()
+	 * from being cached by wrapping it with to_uncachable_acl().
 	 *
-	 * If the filesystem doesn't have a get_acl() function at all, we'll
-	 * just create the negative cache entry.
+	 * A filesystem can at anytime effect the cache directly and
+	 * cause in process calls of get_acl() not to update the cache
+	 * by calling forget_cache_acl(inode, type) or
+	 * set_cached_acl(inode, type, acl).
+	 *
+	 * If the filesystem doesn't have a ->get_acl() function at
+	 * all, we'll just create the negative cache entry.
 	 */
 	if (!inode->i_op->get_acl) {
 		set_cached_acl(inode, type, NULL);
@@ -140,20 +150,28 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	acl = inode->i_op->get_acl(inode, type);
 
 	if (IS_ERR(acl)) {
-		/*
-		 * Remove our sentinel so that we don't block future attempts
-		 * to cache the ACL.
-		 */
-		cmpxchg(p, sentinel, ACL_NOT_CACHED);
-		return acl;
+		/* Don't cache an acl just return an error. */
+		to_cache = ACL_NOT_CACHED;
+	}
+	else if (is_uncacheable_acl(acl)) {
+		/* Don't cache an acl, but return one. */
+		to_cache = ACL_NOT_CACHED;
+		acl = to_cacheable_acl(acl);
+	}
+	else {
+		/* Cache and return the acl. */
+		to_cache = posix_acl_dup(acl);
 	}
 
 	/*
-	 * Cache the result, but only if our sentinel is still in place.
+	 * Remove the sentinel and replace it with the value to
+	 * cache, but only if the sentinel is still in place.
 	 */
-	posix_acl_dup(acl);
-	if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
-		posix_acl_release(acl);
+	if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+		if (!is_uncached_acl(to_cache))
+			posix_acl_release(to_cache);
+	}
+
 	return acl;
 }
 EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
 		kfree_rcu(acl, a_rcu);
 }
 
+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+	return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}
 
 /* posix_acl.c */
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 2/6] fuse: Simplfiy the posix acl handling logic.
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 3/6] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means.  That fuse will cache and operate on the
cached value of the posix acl.

Always use posix_acl_access_xattr_handler so the fuse code benefits
from the generic posix acl handlers as much as possible.  This will
become important as the code works on translation of uid and gid in
the posix acls when fuse is not mounted in the initial user namespace.

Update fuse_get_acl so that it does not cache the acl if the code is
not caching the acl.  This is all that is needed to ensure the
fuse_getxattr calls down into the fuse server when posix_acl_xattr_get
is called.  The updated code goes through fuse_getacl, and as such has
posix acl specific sanity checks and attribute handling but no real
difference from the previous code that skipped it.

It can safely be assumed that fuse filesystems where acls are not
cached in the kernel do not set fc->default_permissions as
default_permissions only checked posix acls if .get_acl was defined
and before the cached acl flag was introduced fuse did not implement a
get_acl method.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/acl.c    | 6 ++++--
 fs/fuse/dir.c    | 2 +-
 fs/fuse/fuse_i.h | 3 +--
 fs/fuse/inode.c  | 3 +--
 fs/fuse/xattr.c  | 5 -----
 5 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..cfa58ee0c10b 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 	void *value = NULL;
 	struct posix_acl *acl;
 
-	if (!fc->posix_acl || fc->no_getxattr)
+	if (fc->no_getxattr)
 		return NULL;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -44,6 +44,8 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		acl = ERR_PTR(size);
 
 	kfree(value);
+	if (!IS_ERR(acl) && !fc->cached_posix_acl)
+		acl = to_uncacheable_acl(acl);
 	return acl;
 }
 
@@ -53,7 +55,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	const char *name;
 	int ret;
 
-	if (!fc->posix_acl || fc->no_setxattr)
+	if (fc->no_setxattr)
 		return -EOPNOTSUPP;
 
 	if (type == ACL_TYPE_ACCESS)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..43a45e83d313 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1764,7 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
 		 * If filesystem supports acls it may have updated acl xattrs in
 		 * the filesystem, so forget cached acls for the inode.
 		 */
-		if (fc->posix_acl)
+		if (fc->cached_posix_acl)
 			forget_all_cached_acls(inode);
 
 		/* Directory mode changed, may need to revalidate access */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..74ce02fb16d6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
 	unsigned no_lseek:1;
 
 	/** Does the filesystem support posix acls? */
-	unsigned posix_acl:1;
+	unsigned cached_posix_acl:1;
 
 	/** Check permissions based on the file mode or not? */
 	unsigned default_permissions:1;
@@ -974,7 +974,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..507f780046c5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -915,8 +915,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->sb->s_time_gran = arg->time_gran;
 			if ((arg->flags & FUSE_POSIX_ACL)) {
 				fc->default_permissions = 1;
-				fc->posix_acl = 1;
-				fc->sb->s_xattr = fuse_acl_xattr_handlers;
+				fc->cached_posix_acl = 1;
 			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..ed64c508585a 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -199,11 +199,6 @@ static const struct xattr_handler fuse_xattr_handler = {
 };
 
 const struct xattr_handler *fuse_xattr_handlers[] = {
-	&fuse_xattr_handler,
-	NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&posix_acl_access_xattr_handler,
 	&posix_acl_default_xattr_handler,
 	&fuse_xattr_handler,
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 3/6] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 2/6] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 4/6] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 4/6] fuse: Fail all requests with invalid uids or gids
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                         ` (2 preceding siblings ...)
  2018-03-02 21:59       ` [PATCH v8 3/6] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 5/6] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+	req->in.h.uid = 0;
+	req->in.h.gid = 0;
+	req->in.h.pid = 0;
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
+	fuse_req_init_context_nofail(req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
 	return req;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 5/6] fuse: Support fuse filesystems outside of init_user_ns
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                         ` (3 preceding siblings ...)
  2018-03-02 21:59       ` [PATCH v8 4/6] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-02 21:59       ` [PATCH v8 6/6] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
  6 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index cfa58ee0c10b..0472735a89c3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -83,7 +83,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 43a45e83d313..c749a4bd4ea3 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 74ce02fb16d6..dbb1d4ef1a0b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 507f780046c5..b5b2e1fc5bfd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1060,7 +1063,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1085,8 +1088,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1094,7 +1101,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v8 6/6] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                         ` (4 preceding siblings ...)
  2018-03-02 21:59       ` [PATCH v8 5/6] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
@ 2018-03-02 21:59       ` Eric W. Biederman
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
  6 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds, Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c749a4bd4ea3..5461b63bb2a4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping
  2018-03-02 21:59       ` [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
@ 2018-03-05  9:53         ` Miklos Szeredi
  2018-03-05 13:53           ` Eric W. Biederman
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-03-05  9:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds

On Fri, Mar 2, 2018 at 10:59 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> The code has been missing a way for a ->get_acl method to not cache
> a return value without risking invalidating a cached value
> that was set while get_acl() was returning.
>
> Add that support by implementing to_uncachable_acl, to_cachable_acl,
> is_uncacheable_acl, and dealing with uncachable acls in get_acl().

I don't like the pointer magic here.  Can't the uncachable bit just be
added to struct posix_acl?

 AFAICS that can be done even without increasing the size of that
struct (e.g. by unioning it with the rcu_head).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping
  2018-03-05  9:53         ` Miklos Szeredi
@ 2018-03-05 13:53           ` Eric W. Biederman
  0 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-05 13:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Linus Torvalds

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Mar 2, 2018 at 10:59 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> The code has been missing a way for a ->get_acl method to not cache
>> a return value without risking invalidating a cached value
>> that was set while get_acl() was returning.
>>
>> Add that support by implementing to_uncachable_acl, to_cachable_acl,
>> is_uncacheable_acl, and dealing with uncachable acls in get_acl().
>
> I don't like the pointer magic here.  Can't the uncachable bit just be
> added to struct posix_acl?
>
>  AFAICS that can be done even without increasing the size of that
> struct (e.g. by unioning it with the rcu_head).

Except that would:
- add a possible cache line miss.
- make it unusable for overlayfs.

I am after very light-weight semantics that say don't cache this return
value but don't have any effects elsewhere.

We are already playing pointer magic games in this code.  This just uses
those games for the last piece of information to keep the logic clean.

I see two possible implementation alternatives:
- Make get_acl return a struct that returns the acl and cachability flag
- Add a helper that does"cmpxchg(p, sentinel, ACL_NOT_CACHED)".
  Such a heleper function seems like a waste, it does side effect magic
  which is never particularly pleasant, and it is more code to execute
  in practice.  Though honestly it is my second choice.

  void dont_cache_my_return_acl(struct inode *inode, int type)
  {
    	/* Valid only inside ->get_acl implementations */
        struct posix_acl **p = get_acl_type(inode, type);
        struct posix_acl *sentinel = uncached_acl_sentinel(current);
        cmpxchg(p, sentinel, ACL_NOT_CACHED);
  }
  EXPORT_SYMBOL(dont_cache_my_return_acl);

  It is just a few instructions more so I guess it isn't that bad.
  Especially for something that is not a common case.

Do you think you could live with dont_cache_my_return_acl?

Otherwise I think I will respin this patch set without the acl
unification.  There is plenty of evidence what it will look like
now.  We can deal with the rest of the patches.  Then we can come back
to exactly what acl unification in fuse should look like.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v9 0/4] fuse: mounts from non-init user namespaces
  2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
                         ` (5 preceding siblings ...)
  2018-03-02 21:59       ` [PATCH v8 6/6] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
@ 2018-03-08 21:23       ` Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
                           ` (4 more replies)
  6 siblings, 5 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-08 21:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
vfs patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Since v7 I have:
- Rethought and reworked how I am unifying the cached and the non-cached
  posix acl case so the code is cleaner and simpler.
  - I have dropped enhancements to caching negative acls when
    fc->no_getxattr is set.
  - Removed the need to wrap forget_all_cached_acls in fuse.
- Reorder the patches so the posix acl work comes first

Since v8 I have:
- Dropped and postponed the unification of the uncached and the cached
  posix acls case.  The code is not hard but tricky enough it needs
  to be considered on it's own on it's own merits.

Miklos can you take a look and see what you think?

Miklos if you could pick these up I would appreciate it.  If not I can
merge these through the userns tree.

These changes are also available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v9

Eric W. Biederman (3):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           |  4 ++--
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 30 +++++++++++++++++-------------
 fs/fuse/dir.c           | 16 ++++++++--------
 fs/fuse/fuse_i.h        |  6 +++++-
 fs/fuse/inode.c         | 31 +++++++++++++++++++------------
 kernel/user_namespace.c |  1 +
 7 files changed, 58 insertions(+), 37 deletions(-)

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
@ 2018-03-08 21:24         ` Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 2/4] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-08 21:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v9 2/4] fuse: Fail all requests with invalid uids or gids
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
@ 2018-03-08 21:24         ` Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 3/4] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-08 21:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+	req->in.h.uid = 0;
+	req->in.h.gid = 0;
+	req->in.h.pid = 0;
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
+	fuse_req_init_context_nofail(req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
 	return req;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v9 3/4] fuse: Support fuse filesystems outside of init_user_ns
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 2/4] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
@ 2018-03-08 21:24         ` Eric W. Biederman
  2018-03-08 21:24         ` [PATCH v9 4/4] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
  2018-03-20 16:25         ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Miklos Szeredi
  4 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-08 21:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v9 4/4] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
                           ` (2 preceding siblings ...)
  2018-03-08 21:24         ` [PATCH v9 3/4] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
@ 2018-03-08 21:24         ` Eric W. Biederman
  2018-03-20 16:25         ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Miklos Szeredi
  4 siblings, 0 replies; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-08 21:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces
  2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
                           ` (3 preceding siblings ...)
  2018-03-08 21:24         ` [PATCH v9 4/4] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
@ 2018-03-20 16:25         ` Miklos Szeredi
  2018-03-20 18:27           ` Eric W. Biederman
  4 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2018-03-20 16:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Thu, Mar 8, 2018 at 10:23 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> This patchset builds on the work by Donsu Park and Seth Forshee and is
> reduced to the set of patches that just affect fuse.  The non-fuse
> vfs patches are far enough along we can ignore them except possibly for the
> question of when does FS_USERNS_MOUNT get set in fuse_fs_type.
>
> Fuse with a block device has been left as an exercise for a later time.
>
> Since v5 I changed the core of this patchset around as the previous
> patches were showing signs of bitrot.  Some important explanations were
> missing, some important functionality was missing, and xattr handling
> was completely absent.
>
> Since v6 I have:
> - Removed the failure case from fuse_get_req_nofail_nopages that I
>   added.
> - Updated fuse to always to use posix_acl_access_xattr_handler, and
>   posix_acl_default_xattr_handler, by teaching fuse to set
>   ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.
>
> Since v7 I have:
> - Rethought and reworked how I am unifying the cached and the non-cached
>   posix acl case so the code is cleaner and simpler.
>   - I have dropped enhancements to caching negative acls when
>     fc->no_getxattr is set.
>   - Removed the need to wrap forget_all_cached_acls in fuse.
> - Reorder the patches so the posix acl work comes first
>
> Since v8 I have:
> - Dropped and postponed the unification of the uncached and the cached
>   posix acls case.  The code is not hard but tricky enough it needs
>   to be considered on it's own on it's own merits.
>
> Miklos can you take a look and see what you think?
>
> Miklos if you could pick these up I would appreciate it.  If not I can
> merge these through the userns tree.

Thank you Eric for moving this along.  Patches pushed to:

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

I did just one modification to "fuse: Fail all requests with invalid
uids or gids": instead of zeroing out the context for the nofail case,
continue to use the "_munged" variants. I don't think this hurts and
is better for backward compatibility (I guess the only relevant use
would be for debugging output, but we don't want to regress even for
that if not necessary).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces
  2018-03-20 16:25         ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Miklos Szeredi
@ 2018-03-20 18:27           ` Eric W. Biederman
  2018-03-21  8:38             ` Miklos Szeredi
  0 siblings, 1 reply; 89+ messages in thread
From: Eric W. Biederman @ 2018-03-20 18:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Thu, Mar 8, 2018 at 10:23 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> This patchset builds on the work by Donsu Park and Seth Forshee and is
>> reduced to the set of patches that just affect fuse.  The non-fuse
>> vfs patches are far enough along we can ignore them except possibly for the
>> question of when does FS_USERNS_MOUNT get set in fuse_fs_type.
>>
>> Fuse with a block device has been left as an exercise for a later time.
>>
>> Since v5 I changed the core of this patchset around as the previous
>> patches were showing signs of bitrot.  Some important explanations were
>> missing, some important functionality was missing, and xattr handling
>> was completely absent.
>>
>> Since v6 I have:
>> - Removed the failure case from fuse_get_req_nofail_nopages that I
>>   added.
>> - Updated fuse to always to use posix_acl_access_xattr_handler, and
>>   posix_acl_default_xattr_handler, by teaching fuse to set
>>   ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.
>>
>> Since v7 I have:
>> - Rethought and reworked how I am unifying the cached and the non-cached
>>   posix acl case so the code is cleaner and simpler.
>>   - I have dropped enhancements to caching negative acls when
>>     fc->no_getxattr is set.
>>   - Removed the need to wrap forget_all_cached_acls in fuse.
>> - Reorder the patches so the posix acl work comes first
>>
>> Since v8 I have:
>> - Dropped and postponed the unification of the uncached and the cached
>>   posix acls case.  The code is not hard but tricky enough it needs
>>   to be considered on it's own on it's own merits.
>>
>> Miklos can you take a look and see what you think?
>>
>> Miklos if you could pick these up I would appreciate it.  If not I can
>> merge these through the userns tree.
>
> Thank you Eric for moving this along.  Patches pushed to:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
>
> I did just one modification to "fuse: Fail all requests with invalid
> uids or gids": instead of zeroing out the context for the nofail case,
> continue to use the "_munged" variants. I don't think this hurts and
> is better for backward compatibility (I guess the only relevant use
> would be for debugging output, but we don't want to regress even for
> that if not necessary)

Hmm...

The thing is the failure doesn't come in the difference between the
_munged and the normal variants.  The difference between
munged and non-munged variants is how they handled failure ((uid16_t)-2)
aka 0xfffe for munged and -1 for the non-munged case.

The failures are introduced by changing &init_user_ns to fc->user_ns.

The operations in question are iop->flush and fuse_force_forget (on an
error).   I don't know what value having ids on those paths will do
they are operations that must succeed, and they should not change the
on-disk ids.  I was thinking saying the most privileged id was asking
for the oepration would seem to make sense.

With the munged variants we will get (uid16_t)-2 aka 0xfffe aka
nobody asking for the operation if things don't map.  In practice
the don't map case is new.

Since the id's should not be looked at anyway I don't see it makes
much difference which ids we use so the munged case seems at least
plausible.

It might be better to use the non-munghed variant and do:
	if (req->in.h.uid == (uid_t)-1)
		req.in.h.uid = 0;
	if (req->in.h.gid == (gid_t)-1)
        	req.in.h.gid = 0;

That might be less surprising to userspace.  As I don't think the
unmapped case has ever occurred in practice yet.  The vfs will work hard
to keep the unmapped case from happening but only in the context of
i_uid and i_gid not current_fsuid and current_fsgid.

Eric

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces
  2018-03-20 18:27           ` Eric W. Biederman
@ 2018-03-21  8:38             ` Miklos Szeredi
  0 siblings, 0 replies; 89+ messages in thread
From: Miklos Szeredi @ 2018-03-21  8:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Tue, Mar 20, 2018 at 7:27 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> writes:

>> I did just one modification to "fuse: Fail all requests with invalid
>> uids or gids": instead of zeroing out the context for the nofail case,
>> continue to use the "_munged" variants. I don't think this hurts and
>> is better for backward compatibility (I guess the only relevant use
>> would be for debugging output, but we don't want to regress even for
>> that if not necessary)
>
> Hmm...
>
> The thing is the failure doesn't come in the difference between the
> _munged and the normal variants.  The difference between
> munged and non-munged variants is how they handled failure ((uid16_t)-2)
> aka 0xfffe for munged and -1 for the non-munged case.
>
> The failures are introduced by changing &init_user_ns to fc->user_ns.

Right.

> The operations in question are iop->flush and fuse_force_forget (on an
> error).   I don't know what value having ids on those paths will do
> they are operations that must succeed, and they should not change the
> on-disk ids.  I was thinking saying the most privileged id was asking
> for the oepration would seem to make sense.

I don't think anybody should actually *care* about the id's in flush,
but I'd still not change the current behavior for change's sake.

>
> With the munged variants we will get (uid16_t)-2 aka 0xfffe aka
> nobody asking for the operation if things don't map.  In practice
> the don't map case is new.
>
> Since the id's should not be looked at anyway I don't see it makes
> much difference which ids we use so the munged case seems at least
> plausible.
>
> It might be better to use the non-munghed variant and do:
>         if (req->in.h.uid == (uid_t)-1)
>                 req.in.h.uid = 0;
>         if (req->in.h.gid == (gid_t)-1)
>                 req.in.h.gid = 0;
>
> That might be less surprising to userspace.  As I don't think the
> unmapped case has ever occurred in practice yet.

Right, that would work too, but I don't think it actually matters, so
unless you can think of an actual security issue arising from using
the munged variants, I'd just leave it as it is.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2018-03-21  8:38 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1512741134.git.dongsu@kinvolk.io>
2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
2017-12-22 18:59   ` Coly Li
2017-12-23 12:00     ` Dongsu Park
2017-12-23  3:03   ` Serge E. Hallyn
2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
2017-12-23  3:17   ` Serge E. Hallyn
2018-01-05 19:24   ` Luis R. Rodriguez
2018-01-09 15:10     ` Dongsu Park
2018-01-09 17:23       ` Luis R. Rodriguez
2018-02-13 13:18   ` Miklos Szeredi
2018-02-16 22:00     ` Eric W. Biederman
2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
2017-12-23  3:26   ` Serge E. Hallyn
2017-12-23 12:38     ` Dongsu Park
2018-02-13 13:37       ` Miklos Szeredi
2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
2017-12-23  3:30   ` Serge E. Hallyn
2017-12-22 14:32 ` [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems Dongsu Park
2017-12-23  3:39   ` Serge E. Hallyn
2018-02-14 12:28   ` Miklos Szeredi
2018-02-19 22:56     ` Eric W. Biederman
2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
2017-12-23  3:46   ` Serge E. Hallyn
2018-01-17 10:59   ` Alban Crequy
2018-01-17 14:29     ` Seth Forshee
2018-01-17 18:56       ` Alban Crequy
2018-01-17 19:31         ` Seth Forshee
2018-01-18 10:29           ` Alban Crequy
2018-02-12 15:57   ` Miklos Szeredi
2018-02-12 16:35     ` Eric W. Biederman
2018-02-13 10:20       ` Miklos Szeredi
2018-02-16 21:52         ` Eric W. Biederman
2018-02-20  2:12   ` Eric W. Biederman
2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
2017-12-23  3:50   ` Serge E. Hallyn
2018-02-19 23:16   ` Eric W. Biederman
2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
2017-12-23  3:51   ` Serge E. Hallyn
2018-02-14 13:44   ` Miklos Szeredi
2018-02-15  8:46     ` Miklos Szeredi
2018-02-21 20:24 ` [PATCH v6 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
2018-02-21 20:29   ` [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
2018-02-22 10:13     ` Miklos Szeredi
2018-02-22 19:04       ` Eric W. Biederman
2018-02-21 20:29   ` [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
2018-02-22 10:26     ` Miklos Szeredi
2018-02-22 18:15       ` Eric W. Biederman
2018-02-21 20:29   ` [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
2018-02-21 20:29   ` [PATCH v6 4/5] fuse: Ensure posix acls are translated " Eric W. Biederman
2018-02-22 11:40     ` Miklos Szeredi
2018-02-22 19:18       ` Eric W. Biederman
2018-02-22 22:50         ` Eric W. Biederman
2018-02-26  7:47           ` Miklos Szeredi
2018-02-26 16:35             ` Eric W. Biederman
2018-02-26 21:51               ` Eric W. Biederman
2018-02-21 20:29   ` [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
2018-02-26 23:52   ` [PATCH v7 0/7] fuse: mounts from non-init user namespaces Eric W. Biederman
2018-02-26 23:52     ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
2018-02-26 23:52     ` [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
2018-02-26 23:52     ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
2018-02-27  1:13       ` Linus Torvalds
2018-02-27  2:53         ` Eric W. Biederman
2018-02-27  3:14           ` Eric W. Biederman
2018-02-27  3:41             ` Linus Torvalds
2018-03-02 19:53               ` [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
2018-02-27  3:36           ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Linus Torvalds
2018-02-26 23:52     ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
2018-02-26 23:53     ` [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
2018-02-27  9:00       ` Miklos Szeredi
2018-03-02 21:49         ` Eric W. Biederman
2018-02-26 23:53     ` [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
2018-02-26 23:53     ` [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
2018-03-02 21:58     ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping Eric W. Biederman
2018-03-05  9:53         ` Miklos Szeredi
2018-03-05 13:53           ` Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 2/6] fuse: Simplfiy the posix acl handling logic Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 3/6] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 4/6] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 5/6] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
2018-03-02 21:59       ` [PATCH v8 6/6] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
2018-03-08 21:23       ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Eric W. Biederman
2018-03-08 21:24         ` [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
2018-03-08 21:24         ` [PATCH v9 2/4] fuse: Fail all requests with invalid uids or gids Eric W. Biederman
2018-03-08 21:24         ` [PATCH v9 3/4] fuse: Support fuse filesystems outside of init_user_ns Eric W. Biederman
2018-03-08 21:24         ` [PATCH v9 4/4] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
2018-03-20 16:25         ` [PATCH v9 0/4] fuse: mounts from non-init user namespaces Miklos Szeredi
2018-03-20 18:27           ` Eric W. Biederman
2018-03-21  8:38             ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).