LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [git pull] mount API series
@ 2018-10-31  5:33 Al Viro
  2018-10-31 15:38 ` Eric W. Biederman
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Al Viro @ 2018-10-31  5:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, linux-kernel

	mount API series from David Howells.  Last cycle's objections
had been of the "I'd do it differently" variety and with no such
differently done variants having ever materialized over several
cycles...

Conflicts: two trivial ones in
	drivers/infiniband/Kconfig (removal of select ANON_INODES)
	fs/f2fs/super.c (->remount signature change)
and a non-trivial one in fs/proc/inode.c - there we have
mainline adding
	/* procfs dentries and inodes don't require IO to create */
	s->s_shrink.seeks = 0;
to proc_fill_super() (in 4b85afbdacd2 "mm: zero-seek shrinkers")
while that series moves the sucker to fs/proc/root.c.  Resolved by
removing the old copy from fs/proc/inode.c and adding the same lines
into the new copy in fs/proc/root.c.
I'd put a variant of resolution into #proposed-merge.

David's cover letter follows; it's obviously over the top for commit
message of the merge.  Where to cut it is up to you...

=========================================================================

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, looking
up/creating the superblock, querying it and then effecting the mount.  This
is also used for remount since much of the parsing stuff is common in many
filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.  This is done with something like:

	fd = fsopen("nfs", 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	struct fsinfo_statfs statfs;
	fsinfo(fd, NULL, NULL, &statfs, sizeof(statfs));
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

The new move_mount() syscall can also be used simply to move mounts around:

	move_mount(AT_FDCWD, "/mnt", AT_FDCWD, "/mnt2", 0);

And, in conjunction with the open_tree() syscall, can be used to clone
mounts:

	fd = open_tree(AT_FDCWD, "/mnt", AT_RECURSIVE | OPEN_TREE_CLONE);
	move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

File descriptors can be used as mountpoint references:

	fd = open_tree(AT_FDCWD, "/mnt", 0);
	move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
	move_mount(mfd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);

which, in this example, will *move* the mount at /mnt to /mnt2 and thence
to /mnt3.

Superblocks can be picked and reconfigured:

	fd = fspick(AT_FDCWD, "/mnt", 0)
	fsconfig(fd, FSCONFIG_SET_STRING, "option", "other-val", 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "option2", "true", 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "option3", "1234", 0);
	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

Filesystem parameters and other attributes can also be queried from the fd
returned by fsopen() or fspick():

	fd = fspick(AT_FDCWD, "/mnt", 0)
	struct fsinfo_params params = {
		.request	= FSINFO_ATTR_PARAMETER,
		.Nth		= 1,
		.Mth		= 3,
	};
	char param_buf[4096];
	fsinfo(fd, NULL, &params, param_buf, sizeof(param_buf));

which will retrieve the 4th value of the 2nd parameter (0 being first) as a
printable string.

Parameters and attributes can also be queried by path or on an ordinary fd:

	struct fsinfo_params params = {
		.request	= FSINFO_ATTR_VOLUME_NAME,
	};
	char param_buf[4096];
	fsinfo(AT_FDCWD, "/etc/passwd", &params, param_buf,
	       sizeof(param_buf));

The details of a filesystem's parser can also be queried:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.request	= FSINFO_ATTR_PARAM_NAME,
		.Nth		= 1,
	};
	char param_buf[4096];
	fsinfo(fd, NULL, &params, param_buf, sizeof(param_buf));

which, in this instance, will retrieve the name of parameter #1.

I have implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.  Unconverted filesystems
are handled by a legacy filesystem wrapper for the moment.

Note that I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and
     weird combinations of flags make it do different things beyond the
     original specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have
     to be mapped back to a small error code.  Yes, there's dmesg - if you
     have it configured - but you can't necessarily see that if you're
     doing a mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name
     and options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace
     if I can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel API:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types
     with different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you
     have to generate or otherwise provide text config data which then gets
     parsed, when some of the time you could bypass the parsing stage
     entirely.

 (3) The amount of data in the data buffer is not known, but the data
     buffer might be on a kernel stack somewhere, leading to the
     possibility of tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times
     so that it can have multiple parsers applied to it or because it has
     to be parsed multiple times - for instance, once to get the
     preliminary info required to access the on-disk superblock and then
     again to update the superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to
     translate source identifiers (which may be foreign numeric UIDs/GIDs,
     text names, GUIDs) into system identifiers.  This needs to be done
     before the superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

     The translation information also needs to propagate over an automount
     in some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);

 (3) Namespace propagation.  I want to have a properly defined mechanism
     for propagating namespace configuration over automounts within the
     kernel.  This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created
     superblock to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query
     configuration that's stored in userspace when an automount is made.
     For instance, to look up network parameters for NFS or to find a cache
     selector for fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events
     and EIO.

 (7) Large and binary parameters.  There might be at some point a need to
     pass large/binary objects like Microsoft PACs around.  If I understand
     PACs correctly, you can obtain these from the Kerberos server and then
     pass them to the file server when you connect.

     Having it possible to pass large or binary objects as individual
     fsconfig calls make parsing these trivial.  OTOH, some or all of this
     can potentially be handled with the use of the keyrings interface - as
     the afs filesystem does for passing kerberos tokens around; it's just
     that that seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

 ver #13:

 (*) Fix the default handling of the source parameter for a filesystem that
     doesn't support it (stash the string to fc->source).

 (*) Fix cgroup mounting.  This is slightly awkward as we can't call
     vfs_get_tree() from within the ->get_tree() op as the former drops
     s_umount before returning.

 (*) Fixes/cleanups from Eric Biederman, including:

     - Fix error handling in do_remount().

 (*) In sample programs, define syscall symbols (__NR_xxx) to -1 if not
     defined in the header files so that the samples compile, but fail
     gracefully with ENOSYS.

 ver #12:

 (*) Rebased on v4.19-rc3.

 (*) Added three new context purposes: mount for hidden root, reconfigure
     for unmount, reconfigure for emergency remount.

 (*) Added a parameter for the new purpose into vfs_dup_fs_context().

 (*) Moved the reconfiguration hook from struct super_operations to struct
     fs_context_operations so they can be handled through the legacy
     wrapper.  mount -o remount now goes through that.

 (*) Changed the parameter description in the following ways:

     - Nominated one master name for each parameter, held in a simple
       string pointer array.  This makes it easy to simply look up a name
       for that parameter for logging.

     - Added a table of additional names for parameters.  The name chosen
       can be used to influence the action of the parameter.

     - Noted which parameter is the source specifier, if there is one.

 (*) Use correct user_ns for a new pidns superblock.

 (*) Fix mqueue to not crash on mounting.

 (*) Make VFS sample programs dependent on X86 to avoid errors in
     autobuilders due to unset syscall IDs in other arches.

 (*) [Miklós] Fixed subtype handling.

 ver #11:

 (*) Fixed AppArmor.

 (*) Capitalised all the UAPI constants.

 (*) Explicitly numbered the FSCONFIG_* UAPI constants.

 (*) Removed all the places ANON_INODES is selected.

 (*) Fixed a bug whereby the context gets freed twice (which broke mounts of
     procfs).

 (*) Split fsinfo() off into its own patch series.

 ver #10:

 (*) Renamed "option" to "parameter" in a number of places.

 (*) Replaced the use of write() to drive the configuration with an fsconfig()
     syscall.  This also allows at-style paths and fds to be presented as typed
     object.

 (*) Routed the key=value parameter concept all the way through from the
     fsconfig() system call to the LSM and filesystem.

 (*) Added a parameter-description concept and helper functions to help
     interpret a parameter and possibly convert the value.

 (*) Made it possible to query the parameter description using the fsinfo()
     syscall.  Added a test-fs-query sample to dump the parameters used by a
     filesystem.

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

=========================================================================

The following changes since commit 11da3a7f84f19c26da6f86af878298694ede0804:

  Linux 4.19-rc3 (2018-09-09 17:26:43 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.mount

for you to fetch changes up to 2dcc1f3b7dcb58e6108b5a45a9dcccd6ab5fec19:

  vfs: Fix error handling in do_remount() (2018-10-30 15:58:06 -0400)

----------------------------------------------------------------
Al Viro (1):
      vfs: syscall: Add open_tree(2) to reference or clone a mount

David Howells (40):
      vfs: Require specification of size of mount data for internal mounts
      vfs: syscall: Add move_mount(2) to move mounts around
      teach move_mount(2) to work with OPEN_TREE_CLONE
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      vfs: Put security flags into the fs_context struct
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: Add a sample program for the new mount API
      vfs: syscall: Add fsinfo() to query filesystem information
      afs: Add fsinfo support
      vfs: Allow fsinfo() to query what's in an fs_context
      vfs: Allow fsinfo() to be used to query an fs parameter description
      vfs: Implement parameter value retrieval with fsinfo()
      vfs: Fix error handling in do_remount()

 Documentation/filesystems/mount_api.txt   | 741 +++++++++++++++++++++++
 arch/arc/kernel/setup.c                   |   1 +
 arch/arm/kernel/atags_parse.c             |   1 +
 arch/arm/kvm/Kconfig                      |   1 -
 arch/arm64/kvm/Kconfig                    |   1 -
 arch/ia64/kernel/perfmon.c                |   3 +-
 arch/mips/kvm/Kconfig                     |   1 -
 arch/powerpc/kvm/Kconfig                  |   1 -
 arch/powerpc/platforms/cell/spufs/inode.c |   6 +-
 arch/s390/hypfs/inode.c                   |   7 +-
 arch/s390/kvm/Kconfig                     |   1 -
 arch/sh/kernel/setup.c                    |   1 +
 arch/sparc/kernel/setup_32.c              |   1 +
 arch/sparc/kernel/setup_64.c              |   1 +
 arch/x86/Kconfig                          |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl    |   7 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   7 +
 arch/x86/kernel/cpu/intel_rdt.h           |  15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  | 183 +++---
 arch/x86/kernel/setup.c                   |   1 +
 arch/x86/kvm/Kconfig                      |   1 -
 drivers/base/Kconfig                      |   1 -
 drivers/base/devtmpfs.c                   |   7 +-
 drivers/char/tpm/Kconfig                  |   1 -
 drivers/dax/super.c                       |   2 +-
 drivers/dma-buf/Kconfig                   |   1 -
 drivers/gpio/Kconfig                      |   1 -
 drivers/gpu/drm/drm_drv.c                 |   3 +-
 drivers/gpu/drm/i915/i915_gemfs.c         |   2 +-
 drivers/iio/Kconfig                       |   1 -
 drivers/infiniband/Kconfig                |   1 -
 drivers/infiniband/hw/qib/qib_fs.c        |   7 +-
 drivers/misc/cxl/api.c                    |   3 +-
 drivers/misc/ibmasm/ibmasmfs.c            |  11 +-
 drivers/mtd/mtdsuper.c                    |  26 +-
 drivers/oprofile/oprofilefs.c             |   8 +-
 drivers/scsi/cxlflash/ocxl_hw.c           |   2 +-
 drivers/staging/erofs/super.c             |  13 +-
 drivers/usb/gadget/function/f_fs.c        |   7 +-
 drivers/usb/gadget/legacy/inode.c         |   7 +-
 drivers/vfio/Kconfig                      |   1 -
 drivers/virtio/virtio_balloon.c           |   2 +-
 drivers/xen/xenfs/super.c                 |   7 +-
 fs/9p/vfs_super.c                         |   2 +-
 fs/Kconfig                                |   7 +
 fs/Makefile                               |   5 +-
 fs/adfs/super.c                           |   9 +-
 fs/affs/super.c                           |  13 +-
 fs/afs/internal.h                         |  10 +-
 fs/afs/mntpt.c                            | 147 ++---
 fs/afs/super.c                            | 634 +++++++++++++-------
 fs/afs/volume.c                           |   4 +-
 fs/aio.c                                  |   3 +-
 fs/anon_inodes.c                          |   3 +-
 fs/autofs/autofs_i.h                      |   2 +-
 fs/autofs/init.c                          |   4 +-
 fs/autofs/inode.c                         |   3 +-
 fs/befs/linuxvfs.c                        |  11 +-
 fs/bfs/inode.c                            |   8 +-
 fs/binfmt_misc.c                          |   7 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/super.c                          |  30 +-
 fs/btrfs/tests/btrfs-tests.c              |   2 +-
 fs/ceph/super.c                           |   3 +-
 fs/cifs/cifs_dfs_ref.c                    |   3 +-
 fs/cifs/cifsfs.c                          |  18 +-
 fs/coda/inode.c                           |  11 +-
 fs/configfs/mount.c                       |   7 +-
 fs/cramfs/inode.c                         |  17 +-
 fs/debugfs/inode.c                        |  14 +-
 fs/devpts/inode.c                         |  10 +-
 fs/ecryptfs/main.c                        |   2 +-
 fs/efivarfs/super.c                       |   9 +-
 fs/efs/super.c                            |  14 +-
 fs/exofs/super.c                          |   7 +-
 fs/ext2/super.c                           |  14 +-
 fs/ext4/super.c                           |  16 +-
 fs/f2fs/super.c                           |  13 +-
 fs/fat/inode.c                            |   3 +-
 fs/fat/namei_msdos.c                      |   8 +-
 fs/fat/namei_vfat.c                       |   8 +-
 fs/file_table.c                           |   9 +-
 fs/filesystems.c                          |   4 +
 fs/freevxfs/vxfs_super.c                  |  12 +-
 fs/fs_context.c                           | 776 ++++++++++++++++++++++++
 fs/fs_parser.c                            | 555 +++++++++++++++++
 fs/fsopen.c                               | 568 ++++++++++++++++++
 fs/fuse/control.c                         |   9 +-
 fs/fuse/inode.c                           |  16 +-
 fs/gfs2/ops_fstype.c                      |   6 +-
 fs/gfs2/super.c                           |   4 +-
 fs/hfs/super.c                            |  12 +-
 fs/hfsplus/super.c                        |  12 +-
 fs/hostfs/hostfs_kern.c                   |   7 +-
 fs/hpfs/super.c                           |  11 +-
 fs/hugetlbfs/inode.c                      | 454 +++++++++-----
 fs/internal.h                             |  19 +-
 fs/isofs/inode.c                          |  11 +-
 fs/jffs2/super.c                          |  10 +-
 fs/jfs/super.c                            |  11 +-
 fs/kernfs/mount.c                         | 103 ++--
 fs/libfs.c                                |  20 +-
 fs/minix/inode.c                          |  14 +-
 fs/namei.c                                |   4 +-
 fs/namespace.c                            | 952 +++++++++++++++++++++++-------
 fs/nfs/internal.h                         |   4 +-
 fs/nfs/namespace.c                        |   3 +-
 fs/nfs/nfs4namespace.c                    |   3 +-
 fs/nfs/nfs4super.c                        |  27 +-
 fs/nfs/super.c                            |  22 +-
 fs/nfsd/nfsctl.c                          |   8 +-
 fs/nilfs2/super.c                         |  10 +-
 fs/notify/fanotify/Kconfig                |   1 -
 fs/notify/inotify/Kconfig                 |   1 -
 fs/nsfs.c                                 |   3 +-
 fs/ntfs/super.c                           |  13 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |   5 +-
 fs/ocfs2/super.c                          |  14 +-
 fs/omfs/inode.c                           |   9 +-
 fs/openpromfs/inode.c                     |  11 +-
 fs/orangefs/orangefs-kernel.h             |   2 +-
 fs/orangefs/super.c                       |   5 +-
 fs/overlayfs/super.c                      |  11 +-
 fs/pipe.c                                 |   3 +-
 fs/pnode.c                                |   1 +
 fs/proc/inode.c                           |  49 +-
 fs/proc/internal.h                        |   5 +-
 fs/proc/root.c                            | 253 ++++++--
 fs/pstore/inode.c                         |  10 +-
 fs/qnx4/inode.c                           |  14 +-
 fs/qnx6/inode.c                           |  14 +-
 fs/ramfs/inode.c                          |   6 +-
 fs/reiserfs/super.c                       |  14 +-
 fs/romfs/super.c                          |  13 +-
 fs/squashfs/super.c                       |  12 +-
 fs/statfs.c                               | 587 ++++++++++++++++++
 fs/super.c                                | 486 +++++++++++----
 fs/sysfs/mount.c                          |  67 ++-
 fs/sysv/inode.c                           |   3 +-
 fs/sysv/super.c                           |  16 +-
 fs/tracefs/inode.c                        |  10 +-
 fs/ubifs/super.c                          |   5 +-
 fs/udf/super.c                            |  16 +-
 fs/ufs/super.c                            |  11 +-
 fs/xfs/xfs_super.c                        |  10 +-
 include/linux/cgroup.h                    |   3 +-
 include/linux/debugfs.h                   |   8 +-
 include/linux/errno.h                     |   1 +
 include/linux/fs.h                        |  47 +-
 include/linux/fs_context.h                | 215 +++++++
 include/linux/fs_parser.h                 | 119 ++++
 include/linux/fsinfo.h                    |  41 ++
 include/linux/kernfs.h                    |  43 +-
 include/linux/lsm_hooks.h                 |  84 ++-
 include/linux/module.h                    |   6 +
 include/linux/mount.h                     |  10 +-
 include/linux/mtd/super.h                 |   4 +-
 include/linux/ramfs.h                     |   4 +-
 include/linux/security.h                  |  70 ++-
 include/linux/shmem_fs.h                  |   3 +-
 include/linux/syscalls.h                  |  13 +
 include/uapi/linux/fcntl.h                |   2 +
 include/uapi/linux/fs.h                   |  56 +-
 include/uapi/linux/fsinfo.h               | 303 ++++++++++
 include/uapi/linux/mount.h                | 120 ++++
 init/Kconfig                              |  10 -
 init/do_mounts.c                          |   5 +-
 init/do_mounts_initrd.c                   |   1 +
 ipc/mqueue.c                              | 106 +++-
 ipc/namespace.c                           |   2 +-
 kernel/bpf/inode.c                        |   7 +-
 kernel/cgroup/cgroup-internal.h           |  50 +-
 kernel/cgroup/cgroup-v1.c                 | 413 ++++++++-----
 kernel/cgroup/cgroup.c                    | 291 ++++++---
 kernel/cgroup/cpuset.c                    |  85 ++-
 kernel/trace/trace.c                      |   7 +-
 mm/shmem.c                                |  10 +-
 mm/zsmalloc.c                             |   3 +-
 net/socket.c                              |   3 +-
 net/sunrpc/rpc_pipe.c                     |   7 +-
 samples/Kconfig                           |   9 +-
 samples/Makefile                          |   2 +-
 samples/statx/Makefile                    |   7 -
 samples/vfs/Makefile                      |  16 +
 samples/vfs/test-fs-query.c               | 145 +++++
 samples/vfs/test-fsinfo.c                 | 593 +++++++++++++++++++
 samples/vfs/test-fsmount.c                | 133 +++++
 samples/{statx => vfs}/test-statx.c       |   7 +-
 security/apparmor/apparmorfs.c            |   8 +-
 security/apparmor/include/mount.h         |  11 +-
 security/apparmor/lsm.c                   | 111 +++-
 security/apparmor/mount.c                 |  47 ++
 security/inode.c                          |   7 +-
 security/security.c                       |  64 +-
 security/selinux/hooks.c                  | 388 ++++++++----
 security/selinux/include/security.h       |  16 +-
 security/selinux/selinuxfs.c              |   8 +-
 security/smack/smack.h                    |  21 +-
 security/smack/smack_lsm.c                | 367 ++++++++++--
 security/smack/smackfs.c                  |   9 +-
 security/tomoyo/common.h                  |   3 +
 security/tomoyo/mount.c                   |  46 ++
 security/tomoyo/tomoyo.c                  |  19 +-
 203 files changed, 9699 insertions(+), 2025 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/mount.h
 delete mode 100644 samples/statx/Makefile
 create mode 100644 samples/vfs/Makefile
 create mode 100644 samples/vfs/test-fs-query.c
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-fsmount.c
 rename samples/{statx => vfs}/test-statx.c (98%)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31  5:33 [git pull] mount API series Al Viro
@ 2018-10-31 15:38 ` Eric W. Biederman
  2018-10-31 16:18   ` Eric W. Biederman
                     ` (2 more replies)
  2018-10-31 16:18 ` Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 25+ messages in thread
From: Eric W. Biederman @ 2018-10-31 15:38 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel

Al Viro <viro@ZenIV.linux.org.uk> writes:

> 	mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
  and mount propagation.

- Bisection will not work with the cpuset filesystem patch.  At least
  cpuset looks like it may be mountable now.

- The setting of fc->user_ns on proc remains broken.  In particular if you
  create a child user namespace and attempt to mount proc it will succeed
  instead of fail.

- The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I suspect I didn't report it well but I reported both the proc and the
mqueue filesystem weeks before the merge window opened.


I am going to stop there.  I believe there are more issues in the code.
I am relieved that I am not seeing the loss of some of the security
hooks that I thought I saw last time I looked at the code.


Given both that I have reported bugs that remain unfixed and it's
non-evolutionary nature that makes this patchset hard to review I have
no confidence in this set of patches.

I think the scope has been too large for anyone to properly review the
code and it should be broken into smaller pieces.  That there were
significant open_tree and move_mount issues being found just before the
merge window opened seems a testament to that.  (AKA the first 3 or so
patches in the patchset and fundamental to the rest).


My apologies that we have not been able communication better and that I
have not been able to clearly point out all of the bugs.   My apologies
I have not yet been able to give more constructive feedback.

Still at the end of the day there remain regressions of existing
functionality in that tree.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31  5:33 [git pull] mount API series Al Viro
  2018-10-31 15:38 ` Eric W. Biederman
@ 2018-10-31 16:18 ` Linus Torvalds
  2018-11-01 10:53   ` Steven Whitehouse
  2018-10-31 18:39 ` David Howells
  2018-10-31 18:45 ` [PATCH] vfs: Fix incorrect user_ns assignment in proc and mqueue David Howells
  3 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2018-10-31 16:18 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linux Kernel Mailing List

On Tue, Oct 30, 2018 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Having just lurked on the discussions about the bugs in here, I don't
think this is ready. Maybe they got fixed, but if so, it was recent
and it was pretty fundamental.

The stated aim of the series is to make the mount API _better_, not
worse. And right now it looks like a "two steps back, one theoretical
step forwards" kind of better.

              Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 15:38 ` Eric W. Biederman
@ 2018-10-31 16:18   ` Eric W. Biederman
  2018-10-31 16:36   ` Al Viro
  2018-11-10 14:19   ` Steven Whitehouse
  2 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2018-10-31 16:18 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel

ebiederm@xmission.com (Eric W. Biederman) writes:

> I am going to stop there.  I believe there are more issues in the code.
> I am relieved that I am not seeing the loss of some of the security
> hooks that I thought I saw last time I looked at the code.

Bah.  Now I see the missing security hook.

There are a set of security hooks that allow security modules to parse
mount options.

On a good day they look like:

	security_mnt_opts opts;
        char *secdata;
        
	secdata = alloc_secdata();
        security_sb_copy_data("a,mount,options,string", secdata);

	security_init_mnt_opts(&opts);
	security_parse_opts_str(secdata, &opts);
        security_set_mnt_opts(sb, &opts, 0, NULL);
	security_free_mnt_opts(&opts);

In practice however things are not that explicit.  With
security_sb_kern_mount performing all of the mnt_opts work.

However after the rewrite in the patchset.

The function sb_kern_mount no longer exists and it's replacement
sb_get_tree out of necessity does not call parse_opts_str.  This is
because the mount options can no longer be passed as a string.

The legacy compatibility code also does not call sb_parse_opts_str.

The result is using the existing apis all of the security module command
line parsing except for (btrfs and nfs) no longer works.


The changes are not structured in a way that makes any of this easy to
find.  Which is why I have been saying I wouldn't do it that way.  It
also is the case that this pattern repeats through out the patches.
Replacing code with something brand new, instead of evolving what is
there.  That makes it easy for this kind of thing to slip through.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 15:38 ` Eric W. Biederman
  2018-10-31 16:18   ` Eric W. Biederman
@ 2018-10-31 16:36   ` Al Viro
  2018-11-01 16:51     ` Al Viro
  2018-11-10 14:19   ` Steven Whitehouse
  2 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2018-10-31 16:36 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel

On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote:
> A couple of bugs that I can see quickly.  Several of which I have
> previously reported:
> 
> - There is an easily triggered NULL pointer deference with open_tree
>   and mount propagation.

What the hell?  If the fixes that went in do not handle something,
especially if you have testcases, where the fuck have you been and
where _are_ those testcases, while we are at it?

Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer
dereference in..., here it is" is hard to miscommunicate.  Sure, any such
needs to be fixed.  For crying out loud, that thing has not been hidden -
it sat in -next, it's been reposted several times...

I'm glad that you can see bugs.  How about _showing_ them?  Because I don't
see anything that would prevent this reply of yours from being posted
pretty much verbatim cycle after cycle, no matter what gets done.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31  5:33 [git pull] mount API series Al Viro
  2018-10-31 15:38 ` Eric W. Biederman
  2018-10-31 16:18 ` Linus Torvalds
@ 2018-10-31 18:39 ` David Howells
  2018-10-31 20:49   ` Miklos Szeredi
  2018-10-31 18:45 ` [PATCH] vfs: Fix incorrect user_ns assignment in proc and mqueue David Howells
  3 siblings, 1 reply; 25+ messages in thread
From: David Howells @ 2018-10-31 18:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, Al Viro, Linus Torvalds, linux-fsdevel, linux-kernel

> My objections fundamentally is that I can find real problems when I look
> at the code.

Eric.

You have repeatedly stated that there are "thinkos, typos and bugs" in the
code, but you have not been very forthcoming in actually disclosing *what*
those things are.

You had a go at rewriting it for yourself, with different philosophical aims,
and at one point, you posted a partially reworked git branch my way.  However,
you changed the patches in various ways that rendered it incomparable with
mine, making it very hard to actually separate bug fixes from your other
changes.  No useful notes were given.

> Further the changes have not been incremental changes that have evolved the
> code from one state to another but complete replacements of code that make
> code review very difficult and bisection completely inapplicable.

... therefore I find this a little hypocritical.

Further, your rewrite breaks NFS and other things, such as mis-renumbering the
tokens in SELinux.  Splitting the lookup-and-create operation into separate
lookup-no-create and create-or-fail won't work for NFS.  Despite your
protestation, this *has* been discussed publically - and you even joined in
the discussion.

> A couple of bugs that I can see quickly.  Several of which I have
> previously reported:
> 
> - There is an easily triggered NULL pointer deference with open_tree
>   and mount propagation.

As far as I know, these have all been cleared up.  I think that I've cleared
up all the bugs that Alan Jenkins found (many thanks to him for that!).

> - Bisection will not work with the cpuset filesystem patch.

It won't?  I don't recall this been mentioned before (but I may have missed
it).

>   At least cpuset looks like it may be mountable now.

It does mount now.  Thanks for pointing that out.

> - The setting of fc->user_ns on proc remains broken.  In particular if you
>   create a child user namespace and attempt to mount proc it will succeed
>   instead of fail.

You did mention this before - but you haven't been clear on how to fix it.  I
tried to work this out from your branch, but it's mixed in with loads of other
changes.

> - The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I have a patch that I think should fix both of these.  I'll follow up this
message with it, if you could check it please.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] vfs: Fix incorrect user_ns assignment in proc and mqueue
  2018-10-31  5:33 [git pull] mount API series Al Viro
                   ` (2 preceding siblings ...)
  2018-10-31 18:39 ` David Howells
@ 2018-10-31 18:45 ` David Howells
  3 siblings, 0 replies; 25+ messages in thread
From: David Howells @ 2018-10-31 18:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, Al Viro, Linus Torvalds, linux-fsdevel, linux-kernel,
	Alexey Dobriyan

    
The user namespace set on a proc superblock should derive from the pid_ns
that the superblock is associated with and, similarly, an mqueue superblock
should derive from the ipc_ns that that is associated with.

Fix both of these to set the proposed user_ns appropriately in the
respective get_tree() functions.

Fixes: a593f22857a3 ("proc: Add fs_context support to procfs")
Fixes: ec772aa43dc7 ("ipc: Convert mqueue fs to fs_context")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/proc/root.c |    2 ++
 ipc/mqueue.c   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index b0627e622850..3033a900d421 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -162,6 +162,8 @@ static int proc_get_tree(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
 	fc->s_fs_info = ctx->pid_ns;
 	return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
 }
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 869687d586a2..9e793c02f350 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -354,6 +354,8 @@ static int mqueue_get_tree(struct fs_context *fc)
 {
 	struct mqueue_fs_context *ctx = fc->fs_private;
 
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(ctx->ipc_ns->user_ns);
 	fc->s_fs_info = ctx->ipc_ns;
 	return vfs_get_super(fc, vfs_get_keyed_super, mqueue_fill_super);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 18:39 ` David Howells
@ 2018-10-31 20:49   ` Miklos Szeredi
  0 siblings, 0 replies; 25+ messages in thread
From: Miklos Szeredi @ 2018-10-31 20:49 UTC (permalink / raw)
  To: David Howells
  Cc: Eric W. Biederman, Al Viro, Linus Torvalds, linux-fsdevel, linux-kernel

On Wed, Oct 31, 2018 at 7:39 PM, David Howells <dhowells@redhat.com> wrote:
>> My objections fundamentally is that I can find real problems when I look
>> at the code.

I think the big risk with such a change is not that there are bugs,
but that we get the API wrong, and have to keep supporting a broken
API forever.

Were any of your objections of that type?

There's the argument about sharing of super blocks doing weird things,
but we can't get rid of that one due to having to support the old
mount(2) API forever anyway.  That's not a new burden, and the new API
allows fixing this incrementally by adding better models later.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 16:18 ` Linus Torvalds
@ 2018-11-01 10:53   ` Steven Whitehouse
  2018-11-01 15:57     ` Linus Torvalds
  2018-11-01 17:18     ` David Howells
  0 siblings, 2 replies; 25+ messages in thread
From: Steven Whitehouse @ 2018-11-01 10:53 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Eric W. Biederman
  Cc: linux-fsdevel, Linux Kernel Mailing List

Hi,


On 31/10/18 16:18, Linus Torvalds wrote:
> On Tue, Oct 30, 2018 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>          mount API series from David Howells.  Last cycle's objections
>> had been of the "I'd do it differently" variety and with no such
>> differently done variants having ever materialized over several
>> cycles...
> Having just lurked on the discussions about the bugs in here, I don't
> think this is ready. Maybe they got fixed, but if so, it was recent
> and it was pretty fundamental.
>
> The stated aim of the series is to make the mount API _better_, not
> worse. And right now it looks like a "two steps back, one theoretical
> step forwards" kind of better.
>
>                Linus

The design of the new mount API has been under discussion for some time, 
both on the mailing lists, and also at LSF/MM too. Al and David (and 
others) have put a lot of work into getting to the current position, and 
have specifically requested input from Eric about his concerns over past 
cycles.

When I look at the discussions I'm seeing two main issues (please 
correct me if you think I'm wrong about this) which are (a) whether the 
design is correct and (b) whether there are still bugs in the current 
patch set.

Which of these are you most concerned about? It seems to me that there 
is not a lot of point in spending a large amount of time in additional 
review/testing of the current patch set if the overall design is set to 
be rejected. If your concerns are only with the robustness/stability of 
the patch set, then that provides a clear route to resolving the current 
impasse, at least assuming that Eric is able to enumerate the issues 
that he has discovered.

It looks like David has already provided a fix for one of the issues 
which Eric mentioned in his recent email. Eric it would be good if you 
could confirm that this addresses that particular concern,

Steve.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 10:53   ` Steven Whitehouse
@ 2018-11-01 15:57     ` Linus Torvalds
  2018-11-01 17:18     ` David Howells
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2018-11-01 15:57 UTC (permalink / raw)
  To: swhiteho; +Cc: Al Viro, ebiederm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 1, 2018 at 3:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> When I look at the discussions I'm seeing two main issues (please
> correct me if you think I'm wrong about this) which are (a) whether the
> design is correct and (b) whether there are still bugs in the current
> patch set.
>
> Which of these are you most concerned about?

I'm most worried about bugs _due_ to the new design.

Exactly because it splits up what used to be an atomic sequence, I
worry about the intermediate states having issues (the refcounting
things we've already seen, for example), but I also worry about the
fact that it completely changes the model, and that that makes things
like security hooks fundamentally different.

The latter may not be a "bug" in the sense that it's all intentional,
but it does mean that I see *one* mount-time security hook now having
been replaced by *five* security hooks.

And that's ignoring the alloc/dup/free ones.

As far as I can tell, the patch-series simply added the hooks. It made
no attempt at making sure that previous hooks had sane semantics. Do
they?  So now a system that has an old mount hook can be bypassed by
simplky using the new model instead.

I dunno. The patches are illegible in this regard (and I don't blame
the fsmount ones, I blame the security subsystem that just is full of
random indirection to random sub-security systems, which in turn just
have hash lookups for data structures set up by other operations
entirely).

Eric was pointing out bugs as late as the weekend before the merge
window opened. That, to me, does not say "ready for the merge window".

                   Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 16:36   ` Al Viro
@ 2018-11-01 16:51     ` Al Viro
  0 siblings, 0 replies; 25+ messages in thread
From: Al Viro @ 2018-11-01 16:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel

On Wed, Oct 31, 2018 at 04:36:01PM +0000, Al Viro wrote:
> On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote:
> > A couple of bugs that I can see quickly.  Several of which I have
> > previously reported:
> > 
> > - There is an easily triggered NULL pointer deference with open_tree
> >   and mount propagation.
> 
> What the hell?  If the fixes that went in do not handle something,
> especially if you have testcases, where the fuck have you been and
> where _are_ those testcases, while we are at it?
> 
> Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer
> dereference in..., here it is" is hard to miscommunicate.  Sure, any such
> needs to be fixed.  For crying out loud, that thing has not been hidden -
> it sat in -next, it's been reposted several times...

Again, would you mind telling what exactly does the above refer to and whether
it is still true?  I'm not asking to put details in every time you mention
something of that sort, but generally one is expected to come up with those
on demand, especially if the bug _is_ easily triggered.

You, IIRC, had been Cc'd on the threads where open_tree breakage was
dealt with.  I can certainly believe that there might be something else
in that area (or any other, for that matter).  I *am* interested in finding
and fixing that and I would rather appreciate the details of what you
are seeing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 10:53   ` Steven Whitehouse
  2018-11-01 15:57     ` Linus Torvalds
@ 2018-11-01 17:18     ` David Howells
  2018-11-01 18:33       ` Linus Torvalds
  2018-11-01 23:59       ` David Howells
  1 sibling, 2 replies; 25+ messages in thread
From: David Howells @ 2018-11-01 17:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, swhiteho, John Johansen, Alan Jenkins, Al Viro,
	ebiederm, linux-fsdevel, Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Exactly because it splits up what used to be an atomic sequence,

No.  What we have upstream now is most definitely *not* atomic.  That said,
some of the steps of the sequence are atomic in their own right:

 - Creation and initialisation of the superblock
 - Creation of the vfsmount
 - Attachment of a new vfsmount

and these units have not changed.  What I have done is to extract the
parameter processing and move it to the front and provided a way to provide
more exotic parameters more easily.  This means that all parse errors are
found before we start creating objects and applying parameters (which is a bug
in some of our existing remount operations that this fixes).

Also, as the parameters are passed one at a time, I've also inserted a
validation step so that the parameter set as a whole can be checked for
inter-parameter consistency before any object creation is performed.

Note further that upstream, remount/reconfiguration is *not* atomic, but with
these changes that becomes closer to being realisable.  I've looked into how
to do it for btrfs using RCU and a CoW'd struct with the settings in, but
that's a much bigger, more intrusive change that doesn't need to be coupled to
the mount API changes.

> I worry about the intermediate states having issues (the refcounting things
> we've already seen, for example), but I also worry about the fact that it
> completely changes the model, and that that makes things like security hooks
> fundamentally different.

A lot of the recent bugs aren't from fsopen()/fsconfig()/fsmount() at all, but
rather from the open_tree() syscall and, to a lesser extent, the move_mount()
syscall.

Would you be willing to take just the *internal* fs_context changes and leave
the UAPI for the next window?  I have patches that make NFS and BTRFS use it,
which would then allow me to remove nearly 1000 lines of LSM code[1], but I
can't reasonably ask Trond and Chris to take them unless there's some schedule
to get the core dependencies in.

[1] See the "security: Remove the set_mnt_opts and clone_mnt_opts ops" commit
    here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=btrfs-mount-api

> The latter may not be a "bug" in the sense that it's all intentional,
> but it does mean that I see *one* mount-time security hook now having
> been replaced by *five* security hooks.

That's not really accurate.  I'm adding:

 (1) security_fs_context_parse_param()
 (2) security_fs_context_validate()
 (3) security_sb_get_tree()
 (4) security_sb_reconfigure()
 (5) security_sb_mountpoint()
 (6) security_move_mount()

Hook (1) allows individual mount parameters to examined/removed and then (2)
allows the whole set to be checked for consistency.  This is just parameter
parsing and checking.

Then (3) allows super_block::security to be initialised and (4) allows an
LSM's sb parameters to be reconfigured by remount (or changes to be
prohibited).  These use the context set up by (1) and checked by (2).

Finally, (5) is used to check that a mountpoint may be used and (6) to see if
a mount may be moved.

Further, I'm actually removing more than one:

 (1) security_sb_kern_mount()
 (2) security_sb_remount()
 (3) security_sb_copy_data()
 (4) security_sb_set_mnt_opts()
 (5) security_sb_clone_mnt_opts()
 (6) security_sb_parse_opts_str()

But four of them are only applicable to NFS and BTRFS, and can only be removed
after those have been dealt with.

> And that's ignoring the alloc/dup/free ones.

Yes, for the record, I'm adding:

 (7) security_fs_context_alloc()
 (8) security_fs_context_dup()
 (9) security_fs_context_free()

to allow the LSM to store data in the fs_context.

> As far as I can tell, the patch-series simply added the hooks. It made
> no attempt at making sure that previous hooks had sane semantics.

Ummm...  I can't say that what's upstream has sane semantics - that's not
really the focus of these patches.  What I've tried to do is to retain the
behaviour where possible.

> Do they?  So now a system that has an old mount hook can be bypassed by
> simplky using the new model instead.

Not so easily.  The LSM is *still* being consulted and I think both in SELinux
and Smack the same rules will still be applied.

There's a case in AppArmor that requires a rejigging of the FSM compiler to
make it work, and I discussed this with John Johansen at the LSS in Edinburgh
last week, but we can simply disallow the use of the new API with AppArmor.

> I dunno. The patches are illegible in this regard (and I don't blame
> the fsmount ones, I blame the security subsystem that just is full of
> random indirection to random sub-security systems, which in turn just
> have hash lookups for data structures set up by other operations
> entirely).
> 
> Eric was pointing out bugs as late as the weekend before the merge
> window opened. That, to me, does not say "ready for the merge window".

Actually, Alan Jenkins has been mostly the one pointing out the bugs and I
think we've got them all.  Leastways, I haven't heard more from him.

FWIW, the patches have been in linux-next for about two cycles now.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 17:18     ` David Howells
@ 2018-11-01 18:33       ` Linus Torvalds
  2018-11-01 22:05         ` Al Viro
  2018-11-01 23:59       ` David Howells
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2018-11-01 18:33 UTC (permalink / raw)
  To: dhowells
  Cc: swhiteho, john.johansen, alan.christopher.jenkins, Al Viro,
	ebiederm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 1, 2018 at 10:18 AM David Howells <dhowells@redhat.com> wrote:
>
> No.  What we have upstream now is most definitely *not* atomic.  That said,
> some of the steps of the sequence are atomic in their own right:

It's more that what we have traditionally is atomic wrt user space.
Use space does *one* indivisible operation. There's no way for
user-space to do a half-way mount and say "ok, that's it, I'll just
exit now" (or skip a phase and fool the kernel into handling a
half-way setup issue).

So the new model does require a bit more care in tear-down etc. I'm
not so worried about this, actually, but it *is* different.

Afaik, some of the bugs were exactly in this half-way state bits.

[ And yes, looking back at some the reports I found, it was Alan
Jenkins and mainly the open-tree thing. ]

> Would you be willing to take just the *internal* fs_context changes and leave
> the UAPI for the next window?

Hmm. I had to think about that, but the more I thought about it, the
more I liked it.

Yes. Depending on how that is done, that would make a lot of my
worries go away. *Particularly* if it then allows us to do the
per-filesystem conversion one by one.

So if the patch series can be split up into a prep-phase that doesn't
change any user-visible semantics (including the security side), but
that uses the fs_context internally and allows the filesystems to be
converted to the new world order, than that would make merging the
early work much easier (and then my worry about the later phases would
probably be much less too).

It would be _really_ nice to see that prep-phase be done in a way
where each individual patch very obviously doesn't change semantics.
If it's obvious, then I'm happy to consider that pure prep-work and
willing to merge it after the merge window in order to make the _next_
merge window easier.

Al - can I ask you to look at helping David with something like that?
You tend to be very good at generating those patch-series with
"obviously no changes" for the individual patches, but the end result
ends up being totally different from the starting point (I'm thinking
of all the locking and dentry refcounting series).

              Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 18:33       ` Linus Torvalds
@ 2018-11-01 22:05         ` Al Viro
  2018-11-01 22:07           ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2018-11-01 22:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, swhiteho, john.johansen, alan.christopher.jenkins,
	ebiederm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 01, 2018 at 11:33:31AM -0700, Linus Torvalds wrote:

> Al - can I ask you to look at helping David with something like that?
> You tend to be very good at generating those patch-series with
> "obviously no changes" for the individual patches, but the end result
> ends up being totally different from the starting point (I'm thinking
> of all the locking and dentry refcounting series).

I'll try.  Before we go there, I'd like to get the rest of vfs.git off
my hands - AFS series and misc pile.  Will send pull requests shortly,
then - this stuff.  Do you mind if we end up with work.mount rebased?
The usual objections re testing in -next do not apply in this case,
AFAICS...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 22:05         ` Al Viro
@ 2018-11-01 22:07           ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2018-11-01 22:07 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, swhiteho, john.johansen, alan.christopher.jenkins,
	ebiederm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 1, 2018 at 3:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>  Do you mind if we end up with work.mount rebased?
> The usual objections re testing in -next do not apply in this case,
> AFAICS...

I was assuming that the work.mount branch would be entirely re-done, yes.

              Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 17:18     ` David Howells
  2018-11-01 18:33       ` Linus Torvalds
@ 2018-11-01 23:59       ` David Howells
  2018-11-02  4:07         ` Al Viro
  1 sibling, 1 reply; 25+ messages in thread
From: David Howells @ 2018-11-01 23:59 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: dhowells, swhiteho, john.johansen, alan.christopher.jenkins,
	ebiederm, linux-fsdevel, Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So if the patch series can be split up into a prep-phase that doesn't
> change any user-visible semantics (including the security side), but
> that uses the fs_context internally and allows the filesystems to be
> converted to the new world order, than that would make merging the
> early work much easier (and then my worry about the later phases would
> probably be much less too).

As a first go, I've rebased the patches to v4.19 (which required no other
changes), folded in some small bugfixes (fix error handling in remount, fix
incorrect user_ns in proc and mqueue) and split the set up.

There are now three branches in my git tree:

 (*) mount-api-core.  These are the internal-only patches that add the
     fs_context, the legacy wrapper and the security hooks and make certain
     filesystems make use of it.

 (*) mount-api-uapi.  This is mount-api-core with the UAPI-visible patches
     stacked thereon.

 (*) mount-api.  This is the original patchset.

"git diff mount-api mount-api-uapi" shows no differences.

Note that the commit "vfs: Implement logging through fs_context" appears in
both sets.  I was just going to leave it as macros that just wrap pr_notice(),
but I think it might be wiser to pull it out of line (as will be required
later) and make it produce messages at different levels.

The git tree in question is at:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-01 23:59       ` David Howells
@ 2018-11-02  4:07         ` Al Viro
  2018-11-02 19:42           ` Al Viro
  2018-11-03  6:30           ` Gao Xiang
  0 siblings, 2 replies; 25+ messages in thread
From: Al Viro @ 2018-11-02  4:07 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Linus Torvalds, swhiteho, john.johansen,
	alan.christopher.jenkins, ebiederm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Nov 01, 2018 at 11:59:23PM +0000, David Howells wrote:

>  (*) mount-api-core.  These are the internal-only patches that add the
>      fs_context, the legacy wrapper and the security hooks and make certain
>      filesystems make use of it.

FWIW, while rereading that series I'd spotted something very odd in erofs.
It's orthogonal to everything else, but just to make sure it doesn't get
lost:
	* sbi->dev_name thing in erofs is used only for debugging printks,
basically.  Just use sb->s_id[] and be done with that.
	* dump struct erofs_mount_private - you don't need dev_name in
your erofs_fill_super().  Just use mount_bdev() in usual fashion.
	* what the hell are you doing with ->s_root???  Why would you
possibly want it hashed and what kind of dcache lookup could find it?
That d_rehash() looks deeply confused; what are you trying to do there?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-02  4:07         ` Al Viro
@ 2018-11-02 19:42           ` Al Viro
  2018-11-03  6:14             ` Gao Xiang
  2018-11-03  6:30           ` Gao Xiang
  1 sibling, 1 reply; 25+ messages in thread
From: Al Viro @ 2018-11-02 19:42 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Linus Torvalds, swhiteho, john.johansen,
	alan.christopher.jenkins, ebiederm, linux-fsdevel,
	Linux Kernel Mailing List

On Fri, Nov 02, 2018 at 04:07:01AM +0000, Al Viro wrote:
> On Thu, Nov 01, 2018 at 11:59:23PM +0000, David Howells wrote:
> 
> >  (*) mount-api-core.  These are the internal-only patches that add the
> >      fs_context, the legacy wrapper and the security hooks and make certain
> >      filesystems make use of it.
> 
> FWIW, while rereading that series I'd spotted something very odd in erofs.
> It's orthogonal to everything else, but just to make sure it doesn't get
> lost:
> 	* sbi->dev_name thing in erofs is used only for debugging printks,
> basically.  Just use sb->s_id[] and be done with that.
> 	* dump struct erofs_mount_private - you don't need dev_name in
> your erofs_fill_super().  Just use mount_bdev() in usual fashion.
> 	* what the hell are you doing with ->s_root???  Why would you
> possibly want it hashed and what kind of dcache lookup could find it?
> That d_rehash() looks deeply confused; what are you trying to do there?

... and while we are at it, what happens to
                unsigned int nameoff = le16_to_cpu(de[mid].nameoff);
                unsigned int matched = min(startprfx, endprfx);

                struct qstr dname = QSTR_INIT(data + nameoff,
                        unlikely(mid >= ndirents - 1) ?
                                maxsize - nameoff :
                                le16_to_cpu(de[mid + 1].nameoff) - nameoff);

                /* string comparison without already matched prefix */
                int ret = dirnamecmp(name, &dname, &matched);
if le16_to_cpu(de[...].nameoff) is not monotonically increasing?  I.e.
what's to prevent e.g. (unsigned)-1 ending up in dname.len?

Corrupted fs image shouldn't oops the kernel...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-02 19:42           ` Al Viro
@ 2018-11-03  6:14             ` Gao Xiang
  0 siblings, 0 replies; 25+ messages in thread
From: Gao Xiang @ 2018-11-03  6:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, swhiteho, john.johansen,
	alan.christopher.jenkins, ebiederm, linux-fsdevel,
	Linux Kernel Mailing List

Hi Al,

On 2018/11/3 3:42, Al Viro wrote:
> On Fri, Nov 02, 2018 at 04:07:01AM +0000, Al Viro wrote:
>> On Thu, Nov 01, 2018 at 11:59:23PM +0000, David Howells wrote:
>>
>>>  (*) mount-api-core.  These are the internal-only patches that add the
>>>      fs_context, the legacy wrapper and the security hooks and make certain
>>>      filesystems make use of it.
>>
>> FWIW, while rereading that series I'd spotted something very odd in erofs.
>> It's orthogonal to everything else, but just to make sure it doesn't get
>> lost:
>> 	* sbi->dev_name thing in erofs is used only for debugging printks,
>> basically.  Just use sb->s_id[] and be done with that.
>> 	* dump struct erofs_mount_private - you don't need dev_name in
>> your erofs_fill_super().  Just use mount_bdev() in usual fashion.
>> 	* what the hell are you doing with ->s_root???  Why would you
>> possibly want it hashed and what kind of dcache lookup could find it?
>> That d_rehash() looks deeply confused; what are you trying to do there?
> 
> ... and while we are at it, what happens to
>                 unsigned int nameoff = le16_to_cpu(de[mid].nameoff);
>                 unsigned int matched = min(startprfx, endprfx);
> 
>                 struct qstr dname = QSTR_INIT(data + nameoff,
>                         unlikely(mid >= ndirents - 1) ?
>                                 maxsize - nameoff :
>                                 le16_to_cpu(de[mid + 1].nameoff) - nameoff);
> 
>                 /* string comparison without already matched prefix */
>                 int ret = dirnamecmp(name, &dname, &matched);
> if le16_to_cpu(de[...].nameoff) is not monotonically increasing?  I.e.
> what's to prevent e.g. (unsigned)-1 ending up in dname.len?
> 
> Corrupted fs image shouldn't oops the kernel...

Yes, thanks for pointing out. :)
I will add more boundary check later before moving into fs/ directory...
erofs now is under dm-verity for our HUAWEI mobile phone, so it doesn't be corruptted.

I will add more checks and meta checksum later after EROFS productization successfully... :)

Thanks,
Gao Xiang
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-02  4:07         ` Al Viro
  2018-11-02 19:42           ` Al Viro
@ 2018-11-03  6:30           ` Gao Xiang
  1 sibling, 0 replies; 25+ messages in thread
From: Gao Xiang @ 2018-11-03  6:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, swhiteho, john.johansen,
	alan.christopher.jenkins, ebiederm, linux-fsdevel,
	Linux Kernel Mailing List

Hi Al,

On 2018/11/2 12:07, Al Viro wrote:
> On Thu, Nov 01, 2018 at 11:59:23PM +0000, David Howells wrote:
> 
>>  (*) mount-api-core.  These are the internal-only patches that add the
>>      fs_context, the legacy wrapper and the security hooks and make certain
>>      filesystems make use of it.
> 
> FWIW, while rereading that series I'd spotted something very odd in erofs.
> It's orthogonal to everything else, but just to make sure it doesn't get
> lost:
> 	* sbi->dev_name thing in erofs is used only for debugging printks,
> basically.  Just use sb->s_id[] and be done with that.
> 	* dump struct erofs_mount_private - you don't need dev_name in
> your erofs_fill_super().  Just use mount_bdev() in usual fashion.

OK, these two points are the same, the original alternative patch to fixup it is to use bdevname(),
However I saw what is done in drivers/usb/gadget/function/f_fs.c, therefore I fixed in as what I saw in f_fs.c.

Refer:
https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000548.html
https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000551.html

I could remove erofs_mount_private entirely if you want. :)

> 	* what the hell are you doing with ->s_root???  Why would you
> possibly want it hashed and what kind of dcache lookup could find it?
> That d_rehash() looks deeply confused; what are you trying to do there?
Thanks for pointing out. After I think into this piece of code, I also think that is redundant.
I will fix it immediately, thanks again for pointing out.

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-10-31 15:38 ` Eric W. Biederman
  2018-10-31 16:18   ` Eric W. Biederman
  2018-10-31 16:36   ` Al Viro
@ 2018-11-10 14:19   ` Steven Whitehouse
  2018-11-12  2:07     ` Eric W. Biederman
  2 siblings, 1 reply; 25+ messages in thread
From: Steven Whitehouse @ 2018-11-10 14:19 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Al Viro, Linus Torvalds, linux-fsdevel, linux-kernel

Hi,


On 31/10/18 15:38, Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
>
>> 	mount API series from David Howells.  Last cycle's objections
>> had been of the "I'd do it differently" variety and with no such
>> differently done variants having ever materialized over several
>> cycles...
> Absolutely not.
>
> My objections fundamentally is that I can find real problems when I look
> at the code.  Further the changes have not been incremental changes that
> have evolved the code from one state to another but complete
> replacements of code that make code review very difficult and bisection
> completely inapplicable.
>
> I also object that this series completely fails to fix the worst but I
> have ever seen in the mount API.  Whit no real intrest shown in working
> to fix it.
>
> A couple of bugs that I can see quickly.  Several of which I have
> previously reported:
>
> - There is an easily triggered NULL pointer deference with open_tree
>    and mount propagation.
>
>
Can you share some details of what this NULL dereference is? David and 
Al have been working on the changes as requested by Linus later in this 
thread, and they'd like to tidy up this issue too at the same time if 
possible. We are not asking you to actually provide a fix, in case you 
are too busy to do so, however it would be good to know what the issue 
is so that we can make sure that it is resolved in the next round of 
patches,

Steve.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-10 14:19   ` Steven Whitehouse
@ 2018-11-12  2:07     ` Eric W. Biederman
  2018-11-12 20:54       ` Al Viro
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2018-11-12  2:07 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: Al Viro, Linus Torvalds, linux-fsdevel, linux-kernel

Steven Whitehouse <swhiteho@redhat.com> writes:

> Hi,
>
>
> On 31/10/18 15:38, Eric W. Biederman wrote:
>> Al Viro <viro@ZenIV.linux.org.uk> writes:
>>
>>> 	mount API series from David Howells.  Last cycle's objections
>>> had been of the "I'd do it differently" variety and with no such
>>> differently done variants having ever materialized over several
>>> cycles...
>> Absolutely not.
>>
>> My objections fundamentally is that I can find real problems when I look
>> at the code.  Further the changes have not been incremental changes that
>> have evolved the code from one state to another but complete
>> replacements of code that make code review very difficult and bisection
>> completely inapplicable.
>>
>> I also object that this series completely fails to fix the worst but I
>> have ever seen in the mount API.  Whit no real intrest shown in working
>> to fix it.
>>
>> A couple of bugs that I can see quickly.  Several of which I have
>> previously reported:
>>
>> - There is an easily triggered NULL pointer deference with open_tree
>>    and mount propagation.
>>
>>
> Can you share some details of what this NULL dereference is? David and
> Al have been working on the changes as requested by Linus later in
> this thread, and they'd like to tidy up this issue too at the same
> time if possible. We are not asking you to actually provide a fix, in
> case you are too busy to do so, however it would be good to know what
> the issue is so that we can make sure that it is resolved in the next
> round of patches,

https://lore.kernel.org/lkml/87bm7n5k1r.fsf@xmission.com/

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-12  2:07     ` Eric W. Biederman
@ 2018-11-12 20:54       ` Al Viro
  2018-12-17 23:10         ` Al Viro
  0 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2018-11-12 20:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Steven Whitehouse, Linus Torvalds, linux-fsdevel, linux-kernel

On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
> Steven Whitehouse <swhiteho@redhat.com> writes:

> > Can you share some details of what this NULL dereference is? David and
> > Al have been working on the changes as requested by Linus later in
> > this thread, and they'd like to tidy up this issue too at the same
> > time if possible. We are not asking you to actually provide a fix, in
> > case you are too busy to do so, however it would be good to know what
> > the issue is so that we can make sure that it is resolved in the next
> > round of patches,
> 
> https://lore.kernel.org/lkml/87bm7n5k1r.fsf@xmission.com/

Thought it had been dealt with, but you are right - oops is still there
and obviously needs fixing.  However, looking at that place in mainline...
How does that thing manage to work?  Look:
        /* Notice when we are propagating across user namespaces */
        if (m->mnt_ns->user_ns != user_ns)
                type |= CL_UNPRIVILEGED;
        child = copy_tree(last_source, last_source->mnt.mnt_root, type);
        if (IS_ERR(child))
                return PTR_ERR(child);
        child->mnt.mnt_flags &= ~MNT_LOCKED;
        mnt_set_mountpoint(m, mp, child);
        last_dest = m;
        last_source = child;
OK, we'd created the copy to be attached to the next candidate mountpoint.
If we have e.g. a 4-element peer group, we'll start with what we'd been
asked to mount, then call that sucker three times, getting a copy for
each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
4th members live in your namespace, with the second one being elsewhere?
We have
	source_mnt: that'll go on top of the 1st mountpoint
	copy of source_mnt: that'll go on top of the 2nd mountpoint
	copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
	copy of copy of copy of source_mnt: that'll go on top of the 4th one
And AFAICS your logics there has just made sure that everything except the
source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
CL_UNPRIVELEGED is cumulative.

How the hell does that code avoid this randomness?  Note had the members of
that peer group been in a different order, you would've gotten a different result.
What am I missing here?

Oops is a separate story, and a regression in its own right; it needs to be
fixed.  But I would really like to sort out the semantics of the existing
code in that area, so that we don't end up with patch conflicts.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-11-12 20:54       ` Al Viro
@ 2018-12-17 23:10         ` Al Viro
  2018-12-21 16:25           ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2018-12-17 23:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Steven Whitehouse, Linus Torvalds, linux-fsdevel, linux-kernel

On Mon, Nov 12, 2018 at 08:54:39PM +0000, Al Viro wrote:
> On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
> > Steven Whitehouse <swhiteho@redhat.com> writes:
> 
> > > Can you share some details of what this NULL dereference is? David and
> > > Al have been working on the changes as requested by Linus later in
> > > this thread, and they'd like to tidy up this issue too at the same
> > > time if possible. We are not asking you to actually provide a fix, in
> > > case you are too busy to do so, however it would be good to know what
> > > the issue is so that we can make sure that it is resolved in the next
> > > round of patches,
> > 
> > https://lore.kernel.org/lkml/87bm7n5k1r.fsf@xmission.com/
> 
> Thought it had been dealt with, but you are right - oops is still there
> and obviously needs fixing.  However, looking at that place in mainline...
> How does that thing manage to work?  Look:
>         /* Notice when we are propagating across user namespaces */
>         if (m->mnt_ns->user_ns != user_ns)
>                 type |= CL_UNPRIVILEGED;
>         child = copy_tree(last_source, last_source->mnt.mnt_root, type);
>         if (IS_ERR(child))
>                 return PTR_ERR(child);
>         child->mnt.mnt_flags &= ~MNT_LOCKED;
>         mnt_set_mountpoint(m, mp, child);
>         last_dest = m;
>         last_source = child;
> OK, we'd created the copy to be attached to the next candidate mountpoint.
> If we have e.g. a 4-element peer group, we'll start with what we'd been
> asked to mount, then call that sucker three times, getting a copy for
> each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
> 4th members live in your namespace, with the second one being elsewhere?
> We have
> 	source_mnt: that'll go on top of the 1st mountpoint
> 	copy of source_mnt: that'll go on top of the 2nd mountpoint
> 	copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
> 	copy of copy of copy of source_mnt: that'll go on top of the 4th one
> And AFAICS your logics there has just made sure that everything except the
> source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
> CL_UNPRIVELEGED is cumulative.
> 
> How the hell does that code avoid this randomness?  Note had the members of
> that peer group been in a different order, you would've gotten a different result.
> What am I missing here?
> 
> Oops is a separate story, and a regression in its own right; it needs to be
> fixed.  But I would really like to sort out the semantics of the existing
> code in that area, so that we don't end up with patch conflicts.

Ping?

FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics
we want is, AFAICS, "when destination is not yours, whatever you attach to
it should be all locked".  Correct?

If so, the natural place to deal with that would be in commit_tree() after
propagate_mnt() has created all copies, not while creating those copies.
That, after all, is where we mark all those struct mount as belonging to
namespace...

Again, this is quite independent from the oops you've reported (and that,
BTW, can be triggered without any userns involvement - commit_tree() itself
will oops just fine if parent's ->mnt_ns is NULL, userns or no userns).
I think I understand how to deal with that thing, but it's a separate
story; handling of MNT_LOCK... is a problem that exists in the mainline
and whatever fix we come up with for this one will need to be backportable.

Al "resurfaced from long and thoroughly nasty dive through the LSM gutter
and finally getting around to more pleasant stuff" Viro

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [git pull] mount API series
  2018-12-17 23:10         ` Al Viro
@ 2018-12-21 16:25           ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2018-12-21 16:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Whitehouse, Linus Torvalds, linux-fsdevel, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Nov 12, 2018 at 08:54:39PM +0000, Al Viro wrote:
>> On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
>> > Steven Whitehouse <swhiteho@redhat.com> writes:
>> 
>> > > Can you share some details of what this NULL dereference is? David and
>> > > Al have been working on the changes as requested by Linus later in
>> > > this thread, and they'd like to tidy up this issue too at the same
>> > > time if possible. We are not asking you to actually provide a fix, in
>> > > case you are too busy to do so, however it would be good to know what
>> > > the issue is so that we can make sure that it is resolved in the next
>> > > round of patches,
>> > 
>> > https://lore.kernel.org/lkml/87bm7n5k1r.fsf@xmission.com/
>> 
>> Thought it had been dealt with, but you are right - oops is still there
>> and obviously needs fixing.  However, looking at that place in mainline...
>> How does that thing manage to work?  Look:
>>         /* Notice when we are propagating across user namespaces */
>>         if (m->mnt_ns->user_ns != user_ns)
>>                 type |= CL_UNPRIVILEGED;
>>         child = copy_tree(last_source, last_source->mnt.mnt_root, type);
>>         if (IS_ERR(child))
>>                 return PTR_ERR(child);
>>         child->mnt.mnt_flags &= ~MNT_LOCKED;
>>         mnt_set_mountpoint(m, mp, child);
>>         last_dest = m;
>>         last_source = child;

[Moved this question up as it's answer is a good foundation for the
rest]

> FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics
> we want is, AFAICS, "when destination is not yours, whatever you attach to
> it should be all locked".  Correct?

Kinda sorta.

There is a tree of privilege.  Mount namespaces owned by a more
privileged user_ns can propagate to slaves in a mount namespace owned by
less privileged user_ns.  Because the required permission for the making
the initial mount are insufficient a mount can not propagate from a less
privileged mount namespace to a more privileged mount namespace.

When propagating from a more privileged mount namespace to a less
privileged mount namespace we want to maintain some properties.

1) That mounts that propogate together are locked together.
2) That the mount flags readonly, nodev, nosuid, and noexec
   when set in a more privileged mount namespace are not changeable
   in a less privileged mount namespace.

>> OK, we'd created the copy to be attached to the next candidate mountpoint.
>> If we have e.g. a 4-element peer group, we'll start with what we'd been
>> asked to mount, then call that sucker three times, getting a copy for
>> each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
>> 4th members live in your namespace, with the second one being
>> elsewhere?

As I understand the question we have a bug in the mount propagation
tree.  All peers need to be in mount namespaces that share a user
namespace.

Further copy_tree fundamentally needs to copy from either a peer node to
a peer node or a master node to a slave node to achieve the correct
results when setting up the propagation tree.

So I don't currently see how the logic in propgate_mnt could
possibly get into a problematic situation.

>> We have
>> 	source_mnt: that'll go on top of the 1st mountpoint
>> 	copy of source_mnt: that'll go on top of the 2nd mountpoint
>> 	copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
>> 	copy of copy of copy of source_mnt: that'll go on top of the 4th one
>> And AFAICS your logics there has just made sure that everything except the
>> source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
>> CL_UNPRIVELEGED is cumulative.
>> 
>> How the hell does that code avoid this randomness?  Note had the members of
>> that peer group been in a different order, you would've gotten a different result.
>> What am I missing here?

I believe it is that members of a peer group are guaranteed to share
a user namepace, or else they can't be peers.

>> Oops is a separate story, and a regression in its own right; it needs to be
>> fixed.  But I would really like to sort out the semantics of the existing
>> code in that area, so that we don't end up with patch conflicts.
>
> Ping?

Apologies for the delay I have been on my own deep dive and I initially
misunderstood the question.  For some reason I thought you were implying
that f2ebb3a921c1 ("smarter propagate_mnt()") changed the logic and
introduced the bug.  So I didn't understand why you were asking me.

> If so, the natural place to deal with that would be in commit_tree() after
> propagate_mnt() has created all copies, not while creating those copies.
> That, after all, is where we mark all those struct mount as belonging to
> namespace...

Alternatively we could add a new mount propagation flag call it
MNT_PRIV_BORDER that would mean that we don't need to look at mount
namespaces when considering the propgation tree.  We would just need
to notice that the slave we are propagating to sets MNT_PRIV_BORDER
and set CL_UNPRIVILEGED when creating the copy.

Then it would only be necessary to set MNT_BRIV_BORDER when we
are creating a slave with CL_UNPRIVLEGED is set.  That would
keep all of the information in the mount propagation tree avoiding
and let us ignore namespaces when performing mount propagation.

I think that would work better when propagating to unconnected mounts.

It raises the possibility of creating mount propagation trees
with surprising properties if someone deliberately copies from a less
privileged mount namespace to a more privileged mount namespace.  But
as the user is asking for that deliberately I don't think we care.  If
we do care we can always do something when we attach the floating island
of mounts to a mount namespace.

The ammended invariant would be that peers either all have
MNT_PRIV_BORDER set or have MNT_PRIV_BORDER clear.

A patch of I am thinking the code below.

> Again, this is quite independent from the oops you've reported (and that,
> BTW, can be triggered without any userns involvement - commit_tree() itself
> will oops just fine if parent's ->mnt_ns is NULL, userns or no userns).
> I think I understand how to deal with that thing, but it's a separate
> story; handling of MNT_LOCK... is a problem that exists in the mainline
> and whatever fix we come up with for this one will need to be backportable.

Yes that is oopsable even without logic introduced by 132c94e31b8b
("vfs: Carefully propogate mounts across user namespaces")

> Al "resurfaced from long and thoroughly nasty dive through the LSM gutter
> and finally getting around to more pleasant stuff" Viro

Do you have any guess when you will be posting the patches that resulted
from that deep dive for review?

Eric

diff --git a/fs/namespace.c b/fs/namespace.c
index f195ee3c5aad..c6faa7513c6c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1046,6 +1046,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 		list_add(&mnt->mnt_slave, &old->mnt_slave_list);
 		mnt->mnt_master = old;
 		CLEAR_MNT_SHARED(mnt);
+		if (flag & CL_UNPRIVILEGED)
+			mnt->mnt.mnt_flags |= MNT_PRIV_BORDER;
 	} else if (!(flag & CL_PRIVATE)) {
 		if ((flag & CL_MAKE_SHARED) || IS_MNT_SHARED(old))
 			list_add(&mnt->mnt_share, &old->mnt_share);
diff --git a/fs/pnode.c b/fs/pnode.c
index 53d411a371ce..1654e6198690 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -260,7 +260,7 @@ static int propagate_one(struct mount *m)
 	}
 		
 	/* Notice when we are propagating across user namespaces */
-	if (m->mnt_ns->user_ns != user_ns)
+	if ((type == CL_SHARED) && (m->mnt.mnt_flags & MNT_PRIV_BORDER))
 		type |= CL_UNPRIVILEGED;
 	child = copy_tree(last_source, last_source->mnt.mnt_root, type);
 	if (IS_ERR(child))
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1f38e0201d05..4bf65f42ae57 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -52,6 +52,7 @@ struct mnt_namespace;
 			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
 
 #define MNT_INTERNAL	0x4000
+#define MNT_PRIV_BORDER	0x8000	/* Propagation from greater to lesser privilege */
 
 #define MNT_LOCK_ATIME		0x040000
 #define MNT_LOCK_NOEXEC		0x080000


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-12-21 16:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31  5:33 [git pull] mount API series Al Viro
2018-10-31 15:38 ` Eric W. Biederman
2018-10-31 16:18   ` Eric W. Biederman
2018-10-31 16:36   ` Al Viro
2018-11-01 16:51     ` Al Viro
2018-11-10 14:19   ` Steven Whitehouse
2018-11-12  2:07     ` Eric W. Biederman
2018-11-12 20:54       ` Al Viro
2018-12-17 23:10         ` Al Viro
2018-12-21 16:25           ` Eric W. Biederman
2018-10-31 16:18 ` Linus Torvalds
2018-11-01 10:53   ` Steven Whitehouse
2018-11-01 15:57     ` Linus Torvalds
2018-11-01 17:18     ` David Howells
2018-11-01 18:33       ` Linus Torvalds
2018-11-01 22:05         ` Al Viro
2018-11-01 22:07           ` Linus Torvalds
2018-11-01 23:59       ` David Howells
2018-11-02  4:07         ` Al Viro
2018-11-02 19:42           ` Al Viro
2018-11-03  6:14             ` Gao Xiang
2018-11-03  6:30           ` Gao Xiang
2018-10-31 18:39 ` David Howells
2018-10-31 20:49   ` Miklos Szeredi
2018-10-31 18:45 ` [PATCH] vfs: Fix incorrect user_ns assignment in proc and mqueue David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).