LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/14] VFS: Filesystem information [ver #18]
@ 2020-03-09 14:00 David Howells
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
                   ` (16 more replies)
  0 siblings, 17 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:00 UTC (permalink / raw)
  To: torvalds, viro
  Cc: Theodore Ts'o, Stefan Metzmacher, Andreas Dilger, linux-ext4,
	Aleksa Sarai, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-api, dhowells, raven, mszeredi, christian, jannh,
	darrick.wong, kzak, jlayton, linux-api, linux-fsdevel,
	linux-security-module, linux-kernel


Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.

The patchset is based on top of the notifications patchset and allows event
counters implemented in the latter to be retrieved to allow overruns to be
efficiently managed.

Included are a couple of sample programs plus limited example code for NFS
and Ext4.  The example code is not intended to go upstream as-is.


=======
THE WHY
=======

Why do we want this?

Using /proc/mounts (or similar) has problems:

 (1) Reading from it holds a global lock (namespace_sem) that prevents
     mounting and unmounting.  Lots of data is encoded and mangled into
     text whilst the lock is held, including superblock option strings and
     mount point paths.  This causes performance problems when there are a
     lot of mount objects in a system.

 (2) Even though namespace_sem is held during a read, reading the whole
     file isn't necessarily atomic with respect to mount-type operations.
     If a read isn't satisfied in one go, then it may return to userspace
     briefly and then continue reading some way into the file.  But changes
     can occur in the interval that may then go unseen.

 (3) Determining what has changed means parsing and comparing consecutive
     outputs of /proc/mounts.

 (4) Querying a specific mount or superblock means searching through
     /proc/mounts and searching by path or mount ID - but we might have an
     fd we want to query.

 (5) Mount topology is not explicit.  One must derive it manually by
     comparing entries.

 (6) Whilst you can poll() it for events, it only tells you that something
     changed in the namespace, not what or whether you can even see the
     change.

To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree.  The notification messages include the ID(s) of the
affected mounts.

To support notifications, however, we need to be able to handle overruns in
the notification queue.  I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them.  Exposing them through /proc
would require adding yet another /proc/mounts-type file.  We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).

Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:

 (1) Jeff Layton's error handling changes need a way to allow error event
     information to be retrieved.

 (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
     actually 3-state { Set, Unset, Not supported }.  It could be useful to
     provide a way to expose information like this[*].

 (3) Limits of the numerical metadata values in a filesystem[*].

 (4) Filesystem capability information[*].  Filesystems don't all have the
     same capabilities, and even different instances may have different
     capabilities, particularly with network filesystems where the set of
     may be server-dependent.  Capabilities might even vary at file
     granularity - though possibly such information should be conveyed
     through statx() instead.

 (5) ID mapping/shifting tables in use for a superblock.

 (6) Filesystem-specific information.  I need something for AFS so that I
     can do pioctl()-emulation, thereby allowing me to implement certain of
     the AFS command line utilities that query state of a particular file.
     This could also have application for other filesystems, such as NFS,
     CIFS and ext4.

 [*] In a lot of cases these are probably fixed and can be memcpy'd from
     static data.

There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.


============================
WHY NOT USE PROCFS OR SYSFS?
============================

Why is it better to go with a new system call rather than adding more magic
stuff to /proc or /sysfs for each superblock object and each mount object?

 (1) It can be targetted.  It makes it easy to query directly by path or
     fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
     cannot do three of these things easily.

 (2) Easier to provide LSM oversight.  Is the accessing process allowed to
     query information pertinent to a particular file?

 (3) It's more efficient as we can return specific binary data rather than
     making huge text dumps.  Granted, sysfs and procfs could present the
     same data, though as lots of little files which have to be
     individually opened, read, closed and parsed.

 (4) We wouldn't have the overhead of open and close (even adding a
     self-contained readfile() syscall has to do that internally).

 (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
     file accessed.  We can use an integer attribute ID instead (yes, this
     is similar to ioctl) - but could also use a string ID if that is
     preferred.

 (6) Can query cross-namespace if, say, a container manager process is
     given an fs_context that hasn't yet been mounted into a namespace - or
     hasn't even been fully created yet.

 (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
     mount happens or is removed - and since systemd makes much use of
     mount namespaces and mount propagation, this will create a lot of
     nodes.


================
DESIGN DECISIONS
================

 (1) Information is partitioned into sets of attributes.

 (2) Attribute IDs are integers as they're fast to compare.

 (3) Attribute values are typed (struct, list of structs, string, opaque
     blob).  They type is fixed for a particular attribute.

 (4) For structure types, the length is also a version.  New fields can be
     tacked onto the end.

 (5) When copying a versioned struct to userspace, the core handles a
     version mismatch by truncating or zero-padding the data as necessary.
     None of this is seen by the filesystem.

 (6) The core handles all the buffering and buffer resizing.

 (7) The filesystem never gets any access to the userspace parameter buffer
     or result buffer.

 (8) "Meta" attributes can describe other attributes.


========
OVERVIEW
========

fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.

Attribute values are of four basic types:

 (1) Structure with version-dependent length (the length is the version).

 (2) Variable-length string.

 (3) List of structures (all the same length).

 (4) Opaque blob.

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.  Values can be up to INT_MAX size,
subject to memory availability.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.

I've tried to make the interface as light as possible, so integer attribute
ID rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems.  This means that userspace pointers are not exposed to
the filesystem.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name, workgroup or other netfs grouping concept.
     - Server names and addresses.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.
     - Mount point path.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  A helper function is provided
     to scan a list of attributes and a filesystem can have multiple such
     lists.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
	};
	char server_name[256];
	len = fsinfo(AT_FDCWD, "/home/dhowells/", &params,
		     &server_name, sizeof(server_name));

would retrieve the name of the NFS server as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_CONFIGURATION;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	fsinfo-core


===================
SIGNIFICANT CHANGES
===================

 ver #18:

 (*) Moved the mount and superblock notification patches into a different
     branch.

 (*) Made superblock configuration (->show_opts), bindmount path
     (->show_path) and filesystem statistics (->show_stats) available as
     the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.

 (*) Made mountpoint device name available, filtered through the superblock
     (->show_devname), as the SOURCE attribute.

 (*) Made the mountpoint available as a full path as well as a relative
     one.

 (*) Added more event counters to MOUNT_INFO, including a subtree
     notification counter, to make it easier to clean up after a
     notification overrun.

 (*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
     five event counters.

 (*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
     also so that mount ID reuse can be detected.

 (*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
     avoid duplicate information.

 (*) Switched to using the RESOLVE_* flags rather than AT_* flags for
     pathwalk control.  Added more RESOLVE_* flags.

 (*) Used a lock instead of RCU to enumerate children for the
     MOUNT_CHILDREN attribute for safety.  This is probably worth
     revisiting at a later date, however.


 ver #17:

 (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.

 (*) Rearranged the order in which fsinfo() does things so that the
     superblock operations table can have a function pointer rather than a
     table pointer.  The ->fsinfo() op is now called at least twice, once
     to determine the size of buffer needed and then to retrieve the data.
     If the retrieval step indicates yet more space is needed, the buffer
     will be expanded and that step repeated.

 (*) Merge the element size into the size in the fsinfo_attribute def and
     don't set size for strings or opaques.  Let a helper work that out.
     This means that strings can actually get larger then 4K.

 (*) A helper is provided to scan a list of attributes and call the
     appropriate get function.  This can be called from a filesystem's
     ->fsinfo() method multiple times.  It also handles attribute
     enumeration and info querying.

 (*) Rearranged the patches to put all the notification patches first.
     This allowed some of the bits to be squashed together.  At some point,
     I'll move the notification patches into a different branch.

 ver #16:

 (*) Split the features bits out of the fsinfo() core into their own patch
     and got rid of the name encoding attributes.

 (*) Renamed the 'array' type to 'list' and made AFS use it for returning
     server address lists.

 (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
     where each attribute has a ->get() method to deal with it.  These
     tables can then be returned with an fsinfo meta attribute.

 (*) Dropped the fscontext query and parameter/description retrieval
     attributes for now.

 (*) Picked the mount topology attributes into this branch.

 (*) Picked the mount notifications into this branch and rebased on top of
     notifications-pipe-core.

 (*) Picked the superblock notifications into this branch.

 (*) Add sample code for Ext4 and NFS.

David
---
David Howells (14):
      VFS: Add additional RESOLVE_* flags
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of supported features
      fsinfo: Allow retrieval of superblock devname, options and stats
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Add a uniquifier ID to struct mount
      fsinfo: Allow mount information to be queried
      fsinfo: Allow the mount topology propogation flags to be retrieved
      fsinfo: Provide notification overrun handling support
      fsinfo: sample: Mount listing program
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Example support for Ext4
      fsinfo: Example support for NFS


 Documentation/filesystems/fsinfo.rst        |  564 +++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |    7 
 fs/Makefile                                 |    1 
 fs/afs/internal.h                           |    1 
 fs/afs/super.c                              |  218 +++++++
 fs/d_path.c                                 |    2 
 fs/ext4/Makefile                            |    1 
 fs/ext4/ext4.h                              |    6 
 fs/ext4/fsinfo.c                            |   45 +
 fs/ext4/super.c                             |    3 
 fs/fsinfo.c                                 |  720 ++++++++++++++++++++++
 fs/internal.h                               |   13 
 fs/mount.h                                  |    3 
 fs/namespace.c                              |  362 +++++++++++
 fs/nfs/Makefile                             |    1 
 fs/nfs/fsinfo.c                             |  230 +++++++
 fs/nfs/internal.h                           |    6 
 fs/nfs/nfs4super.c                          |    3 
 fs/nfs/super.c                              |    3 
 fs/open.c                                   |    8 
 include/linux/fcntl.h                       |    3 
 include/linux/fs.h                          |    4 
 include/linux/fsinfo.h                      |  111 +++
 include/linux/syscalls.h                    |    4 
 include/uapi/asm-generic/unistd.h           |    4 
 include/uapi/linux/fsinfo.h                 |  360 +++++++++++
 include/uapi/linux/mount.h                  |   10 
 include/uapi/linux/openat2.h                |    8 
 include/uapi/linux/windows.h                |   35 +
 kernel/sys_ni.c                             |    1 
 samples/vfs/Makefile                        |    7 
 samples/vfs/test-fsinfo.c                   |  880 +++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  |  277 ++++++++
 50 files changed, 3905 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/windows.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
@ 2020-03-09 14:00 ` David Howells
  2020-03-09 20:56   ` Stefan Metzmacher
                     ` (2 more replies)
  2020-03-09 14:01 ` [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
                   ` (15 subsequent siblings)
  16 siblings, 3 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:00 UTC (permalink / raw)
  To: torvalds, viro
  Cc: Stefan Metzmacher, Aleksa Sarai, dhowells, raven, mszeredi,
	christian, jannh, darrick.wong, kzak, jlayton, linux-api,
	linux-fsdevel, linux-security-module, linux-kernel

Add additional RESOLVE_* flags to correspond to AT_* flags that aren't
currently implemented:

	RESOLVE_NO_TRAILING_SYMLINKS    for AT_SYMLINK_NOFOLLOW
	RESOLVE_NO_TRAILING_AUTOMOUNTS  for AT_NO_AUTOMOUNT
	RESOLVE_EMPTY_PATH              for AT_EMPTY_PATH

This is necessary for fsinfo() to use RESOLVE_* flags instead of AT_* flags
if the latter are to be considered deprecated for new system calls.

Also make openat2() handle RESOLVE_NO_TRAILING_SYMLINKS.

Automounting is currently forced by doing an open(), so adding support to
openat2() for RESOLVE_NO_TRAILING_AUTOMOUNTS is not trivial.

Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Aleksa Sarai <cyphar@cyphar.com>
---

 fs/open.c                    |    8 +++++---
 include/linux/fcntl.h        |    3 ++-
 include/uapi/linux/openat2.h |    8 +++++++-
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..7c38a7605c21 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -977,7 +977,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
 inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 {
 	int flags = how->flags;
-	int lookup_flags = 0;
+	int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
 	int acc_mode = ACC_MODE(flags);
 
 	/* Must never be set by userspace */
@@ -1055,8 +1055,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 
 	if (flags & O_DIRECTORY)
 		lookup_flags |= LOOKUP_DIRECTORY;
-	if (!(flags & O_NOFOLLOW))
-		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
 
 	if (how->resolve & RESOLVE_NO_XDEV)
 		lookup_flags |= LOOKUP_NO_XDEV;
@@ -1068,6 +1068,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		lookup_flags |= LOOKUP_BENEATH;
 	if (how->resolve & RESOLVE_IN_ROOT)
 		lookup_flags |= LOOKUP_IN_ROOT;
+	if (how->resolve & RESOLVE_NO_TRAILING_SYMLINKS)
+		lookup_flags &= ~LOOKUP_FOLLOW;
 
 	op->lookup_flags = lookup_flags;
 	return 0;
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..eacf17a8ca34 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -19,7 +19,8 @@
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
-	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NO_TRAILING_SYMLINKS | \
+	 RESOLVE_NO_TRAILING_AUTOMOUNTS | RESOLVE_EMPTY_PATH)
 
 /* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..2647a108f116 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -22,7 +22,10 @@ struct open_how {
 	__u64 resolve;
 };
 
-/* how->resolve flags for openat2(2). */
+/*
+ * Path resolution paths to replace AT_* paths in all new syscalls that would
+ * use them.
+ */
 #define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
 					(includes bind-mounts). */
 #define RESOLVE_NO_MAGICLINKS	0x02 /* Block traversal through procfs-style
@@ -35,5 +38,8 @@ struct open_how {
 #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
 					be scoped inside the dirfd
 					(similar to chroot(2)). */
+#define RESOLVE_NO_TRAILING_SYMLINKS	0x20 /* Don't follow trailing symlinks in the path */
+#define RESOLVE_NO_TRAILING_AUTOMOUNTS	0x40 /* Don't follow trailing automounts in the path */
+#define RESOLVE_EMPTY_PATH	0x80	/* Permit a path of "" to indicate the dfd exactly */
 
 #endif /* _UAPI_LINUX_OPENAT2_H */



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-10  9:31   ` Christian Brauner
  2020-03-09 14:01 ` [PATCH 03/14] fsinfo: Provide a bitmap of supported features [ver #18] David Howells
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: linux-api, dhowells, raven, mszeredi, christian, jannh,
	darrick.wong, kzak, jlayton, linux-api, linux-fsdevel,
	linux-security-module, linux-kernel

Add a system call to allow filesystem information to be queried.  A request
value can be given to indicate the desired attribute.  Support is provided
for enumerating multi-value attributes.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

	int ret = fsinfo(int dfd,
			 const char *pathname,
			 const struct fsinfo_params *params,
			 size_t params_size,
			 void *result_buffer,
			 size_t result_buf_size);

The params parameter optionally points to a block of parameters:

	struct fsinfo_params {
		__u32	resolve_flags;
		__u32	flags;
		__u32	request;
		__u32	Nth;
		__u32	Mth;
	};

If params is NULL, the default is that params->request is
FSINFO_ATTR_STATFS and all the other fields are 0.  params_size indicates
the size of the parameter struct.  If the parameter block is short compared
to what the kernel expects, the missing length will be set to 0; if the
parameter block is longer, an error will be given if the excess is not all
zeros.

The object to be queried is specified as follows - part param->flags
indicates the type of reference:

 (1) FSINFO_FLAGS_QUERY_PATH - dfd, pathname and at_flags indicate a
     filesystem object to query.  There is no separate system call
     providing an analogue of lstat() - RESOLVE_NO_TRAILING_SYMLINKS should
     be set in at_flags instead.  RESOLVE_NO_TRAILING_AUTOMOUNTS can also
     be used to an allow automount point to be queried without triggering
     it.

 (2) FSINFO_FLAGS_QUERY_FD - dfd indicates a file descriptor pointing to
     the filesystem object to query.  pathname should be NULL.

 (3) FSINFO_FLAGS_QUERY_MOUNT - pathname indicates the numeric ID of the
     mountpoint to query as a string.  dfd is used to constrain which
     mounts can be accessed.  If dfd is AT_FDCWD, the mount must be within
     the subtree rooted at chroot, otherwise the mount must be within the
     subtree rooted at the directory specified by dfd.

 (4) In the future FSINFO_FLAGS_QUERY_FSCONTEXT will be added - dfd will
     indicate a context handle fd obtained from fsopen() or fspick(),
     allowing that to be queried before the target superblock is attached
     to the filesystem or even created.

params->request indicates the attribute/attributes to be queried.  This can
be one of:

	FSINFO_ATTR_STATFS		- statfs-style info
	FSINFO_ATTR_IDS			- Filesystem IDs
	FSINFO_ATTR_LIMITS		- Filesystem limits
	FSINFO_ATTR_SUPPORTS		- Support for statx, ioctl, etc.
	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs

Some attributes (such as the servers backing a network filesystem) can have
multiple values.  These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

result_buffer and result_buf_size point to the reply buffer.  The buffer is
filled up to the specified size, even if this means truncating the reply.
The size of the full reply is returned, irrespective of the amount data
that was copied.  In future versions, this will allow extra fields to be
tacked on to the end of the reply, but anyone not expecting them will only
get the subset they're expecting.  If either buffer of result_buf_size are
0, no copy will take place and the data size will be returned.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |    7 
 fs/Makefile                                 |    1 
 fs/fsinfo.c                                 |  582 +++++++++++++++++++++++++
 include/linux/fs.h                          |    4 
 include/linux/fsinfo.h                      |   73 +++
 include/linux/syscalls.h                    |    4 
 include/uapi/asm-generic/unistd.h           |    4 
 include/uapi/linux/fsinfo.h                 |  186 ++++++++
 kernel/sys_ni.c                             |    1 
 samples/vfs/Makefile                        |    5 
 samples/vfs/test-fsinfo.c                   |  633 +++++++++++++++++++++++++++
 28 files changed, 1516 insertions(+), 2 deletions(-)
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 7c0115af9010..4d0b07dde12d 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	watch_mount			sys_watch_mount
 550	common	watch_sb			sys_watch_sb
+551	common	fsinfo				sys_fsinfo
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index f256f009a89f..fdda8382b420 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index bc0f923e0e04..388eeb71cff0 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		441
+#define __NR_compat_syscalls		442
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index a4dafc659647..2316e60e031a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 893fb4151547..efc2723ca91f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 54aaf0d40c64..745c0f462fce 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index fd34dd0efed0..499f83562a8c 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	watch_mount			sys_watch_mount
 440	n32	watch_sb			sys_watch_sb
+441	n32	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index db0f4c0a0a0b..b3188bc3ab3c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	watch_mount			sys_watch_mount
 440	n64	watch_sb			sys_watch_sb
+441	n64	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ce2e1326de8f..1a3e8ed5e538 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	watch_mount			sys_watch_mount
 440	o32	watch_sb			sys_watch_sb
+441	o32	fsinfo				sys_fsinfo
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 6e4a7c08b64b..2572c215d861 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 08943f3b8206..39d7ac7e918c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -521,3 +521,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index b3b8529d2b74..ae4cefd3dd1b 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount		sys_watch_mount			sys_watch_mount
 440	common	watch_sb		sys_watch_sb			sys_watch_sb
+441  common	fsinfo			sys_fsinfo			sys_fsinfo
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 89307a20657c..05945b9aee4b 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4ff841a00450..b71b34d4b45c 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e2731d295f88..e118ba9aca4c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
 439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
 440	i386	watch_sb		sys_watch_sb			__ia32_sys_watch_sb
+441	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f4391176102c..067f247471d0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
 439	common	watch_mount		__x64_sys_watch_mount
 440	common	watch_sb		__x64_sys_watch_sb
+441	common	fsinfo			__x64_sys_fsinfo
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 8e7d731ed6cf..e1ec25099d10 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/fs/Kconfig b/fs/Kconfig
index fef1365c23a5..01d0d436b3cd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
 	  Enable this to perform validation of the parameter description for a
 	  filesystem when it is registered.
 
+config FSINFO
+	bool "Enable the fsinfo() system call"
+	help
+	  Enable the file system information querying system call to allow
+	  comprehensive information to be retrieved about a filesystem,
+	  superblock or mount object.
+
 if BLOCK
 
 config FS_IOMAP
diff --git a/fs/Makefile b/fs/Makefile
index 4477757780d0..b6bf2424c7f7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
 obj-$(CONFIG_SYSCTL)		+= drop_caches.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_FSINFO)		+= fsinfo.o
 obj-y				+= iomap/
 
 obj-y				+= quota/
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
new file mode 100644
index 000000000000..b7b81e9d7e21
--- /dev/null
+++ b/fs/fsinfo.c
@@ -0,0 +1,582 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/statfs.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/fsinfo.h>
+#include <uapi/linux/mount.h>
+#include "internal.h"
+
+/**
+ * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
+ * @s: The string to store (may be NULL)
+ * @ctx: The parameter context
+ */
+int fsinfo_string(const char *s, struct fsinfo_context *ctx)
+{
+	unsigned int len;
+	char *p = ctx->buffer;
+	int ret = 0;
+
+	if (s) {
+		len = min_t(size_t, strlen(s), ctx->buf_size - 1);
+		if (!ctx->want_size_only) {
+			memcpy(p, s, len);
+			p[len] = 0;
+		}
+		ret = len;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(fsinfo_string);
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_statfs *p = ctx->buffer;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0)
+		return ret;
+
+	p->f_blocks.lo	= buf.f_blocks;
+	p->f_bfree.lo	= buf.f_bfree;
+	p->f_bavail.lo	= buf.f_bavail;
+	p->f_files.lo	= buf.f_files;
+	p->f_ffree.lo	= buf.f_ffree;
+	p->f_favail.lo	= buf.f_ffree;
+	p->f_bsize	= buf.f_bsize;
+	p->f_frsize	= buf.f_frsize;
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_ids *p = ctx->buffer;
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0 && ret != -ENOSYS)
+		return ret;
+	if (ret == 0)
+		memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+
+	sb = path->dentry->d_sb;
+	p->f_fstype	= sb->s_magic;
+	p->f_dev_major	= MAJOR(sb->s_dev);
+	p->f_dev_minor	= MINOR(sb->s_dev);
+	p->f_sb_id	= sb->s_unique_id;
+	strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
+	return sizeof(*p);
+}
+
+int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_limits *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->max_file_size.hi	= 0;
+	p->max_file_size.lo	= sb->s_maxbytes;
+	p->max_ino.hi		= 0;
+	p->max_ino.lo		= UINT_MAX;
+	p->max_hard_links	= sb->s_max_links;
+	p->max_uid		= UINT_MAX;
+	p->max_gid		= UINT_MAX;
+	p->max_projid		= UINT_MAX;
+	p->max_filename_len	= NAME_MAX;
+	p->max_symlink_len	= PATH_MAX;
+	p->max_xattr_name_len	= XATTR_NAME_MAX;
+	p->max_xattr_body_len	= XATTR_SIZE_MAX;
+	p->max_dev_major	= 0xffffff;
+	p->max_dev_minor	= 0xff;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_limits);
+
+int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_supports *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->stx_mask = STATX_BASIC_STATS;
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_supports);
+
+static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
+	.atime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_timestamp_info *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+	s8 exponent;
+
+	*p = fsinfo_default_timestamp_info;
+
+	if (sb->s_time_gran < 1000000000) {
+		if (sb->s_time_gran < 1000)
+			exponent = -9;
+		else if (sb->s_time_gran < 1000000)
+			exponent = -6;
+		else
+			exponent = -3;
+
+		p->atime.gran_exponent = exponent;
+		p->mtime.gran_exponent = exponent;
+		p->ctime.gran_exponent = exponent;
+		p->btime.gran_exponent = exponent;
+	}
+
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
+
+static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_volume_uuid *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	memcpy(p, &sb->s_uuid, sizeof(*p));
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
+}
+
+static const struct fsinfo_attribute fsinfo_common_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+
+	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
+	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+	{}
+};
+
+/*
+ * Determine an attribute's minimum buffer size and, if the buffer is large
+ * enough, get the attribute value.
+ */
+static int fsinfo_get_this_attribute(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	int buf_size;
+
+	if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
+		return -ENODATA;
+	if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
+		return -ENODATA;
+
+	switch (attr->type) {
+	case FSINFO_TYPE_VSTRUCT:
+		ctx->clear_tail = true;
+		buf_size = attr->size;
+		break;
+	case FSINFO_TYPE_STRING:
+	case FSINFO_TYPE_OPAQUE:
+	case FSINFO_TYPE_LIST:
+		buf_size = 4096;
+		break;
+	default:
+		return -ENOPKG;
+	}
+
+	if (ctx->buf_size < buf_size)
+		return buf_size;
+
+	return attr->get(path, ctx);
+}
+
+static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	__u32 *p = ctx->buffer;
+	unsigned int i;
+
+	if (ctx->usage >= ctx->buf_size ||
+	    ctx->buf_size - ctx->usage < sizeof(__u32)) {
+		ctx->usage += sizeof(__u32);
+		return;
+	}
+
+	for (i = 0; i < ctx->usage / sizeof(__u32); i++)
+		if (p[i] == attr->attr_id)
+			return;
+
+	p[i] = attr->attr_id;
+	ctx->usage += sizeof(__u32);
+}
+
+static int fsinfo_list_attributes(struct path *path,
+				  struct fsinfo_context *ctx,
+				  const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	for (a = attributes; a->get; a++)
+		fsinfo_attributes_insert(ctx, a);
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+static int fsinfo_get_attribute_info(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+	struct fsinfo_attribute_info *p = ctx->buffer;
+
+	if (!ctx->buf_size)
+		return sizeof(*p);
+
+	for (a = attributes; a->get; a++) {
+		if (a->attr_id == ctx->Nth) {
+			p->attr_id	= a->attr_id;
+			p->type		= a->type;
+			p->flags	= a->flags;
+			p->size		= a->size;
+			p->size		= a->size;
+			return sizeof(*p);
+		}
+	}
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+/**
+ * fsinfo_get_attribute - Look up and handle an attribute
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ * @attributes: List of attributes to search.
+ *
+ * Look through a list of attributes for one that matches the requested
+ * attribute then call the handler for it.
+ */
+int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
+			 const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return fsinfo_get_attribute_info(path, ctx, attributes);
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return fsinfo_list_attributes(path, ctx, attributes);
+	default:
+		for (a = attributes; a->get; a++)
+			if (a->attr_id == ctx->requested_attr)
+				return fsinfo_get_this_attribute(path, ctx, a);
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL(fsinfo_get_attribute);
+
+/**
+ * generic_fsinfo - Handle an fsinfo attribute generically
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ */
+static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
+{
+	int ret;
+
+	if (path->dentry->d_sb->s_op->fsinfo) {
+		ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
+		if (ret != -EOPNOTSUPP)
+			return ret;
+	}
+	ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return -ENODATA;
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return ctx->usage;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * vfs_fsinfo - Retrieve filesystem information
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ *
+ * Get an attribute on a filesystem or an object within a filesystem.  The
+ * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
+ * if it's a multi-valued attribute, the particular value is selected by
+ * @ctx->Nth and then @ctx->Mth.
+ *
+ * For common attributes, a value may be fabricated if it is not supported by
+ * the filesystem.
+ *
+ * On success, the size of the attribute's value is returned (0 is a valid
+ * size).  A buffer will have been allocated and will be pointed to by
+ * @ctx->buffer.  The caller must free this with kvfree().
+ *
+ * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
+ * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
+ * doesn't exist for the specified object or -ENODATA if the attribute exists,
+ * but the Nth,Mth value does not exist.  -EMSGSIZE indicates that the value is
+ * unmanageable internally and -ENOPKG indicates other internal failure.
+ *
+ * Errors such as -EIO may also come from attempts to access media or servers
+ * to obtain the requested information if it's not immediately to hand.
+ *
+ * [*] Note that the caller may set @ctx->want_size_only if it only wants the
+ *     size of the value and not the data.  If this is set, a buffer may not be
+ *     allocated under some circumstances.  This is intended for size query by
+ *     userspace.
+ *
+ * [*] Note that @ctx->clear_tail will be returned set if the data should be
+ *     padded out with zeros when writing it to userspace.
+ */
+static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	struct dentry *dentry = path->dentry;
+	int ret;
+
+	ret = security_sb_statfs(dentry);
+	if (ret)
+		return ret;
+
+	/* Call the handler to find out the buffer size required. */
+	ctx->buf_size = 0;
+	ret = fsinfo_call(path, ctx);
+	if (ret < 0 || ctx->want_size_only)
+		return ret;
+	ctx->buf_size = ret;
+
+	do {
+		/* Allocate a buffer of the requested size. */
+		if (ctx->buf_size > INT_MAX)
+			return -EMSGSIZE;
+		ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
+		if (!ctx->buffer)
+			return -ENOMEM;
+
+		ctx->usage = 0;
+		ctx->skip = 0;
+		ret = fsinfo_call(path, ctx);
+		if (IS_ERR_VALUE((long)ret))
+			return ret;
+		if ((unsigned int)ret <= ctx->buf_size)
+			return ret; /* It fitted */
+
+		/* We need to resize the buffer */
+		ctx->buf_size = roundup(ret, PAGE_SIZE);
+		kvfree(ctx->buffer);
+		ctx->buffer = NULL;
+	} while (!signal_pending(current));
+
+	return -ERESTARTSYS;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *pathname,
+			   unsigned int resolve_flags, struct fsinfo_context *ctx)
+{
+	struct path path;
+	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret = -EINVAL;
+
+	if (resolve_flags & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+
+	if (resolve_flags & RESOLVE_NO_XDEV)
+		lookup_flags |= LOOKUP_NO_XDEV;
+	if (resolve_flags & RESOLVE_NO_MAGICLINKS)
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (resolve_flags & RESOLVE_NO_SYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (resolve_flags & RESOLVE_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (resolve_flags & RESOLVE_IN_ROOT)
+		lookup_flags |= LOOKUP_IN_ROOT;
+	if (resolve_flags & RESOLVE_NO_TRAILING_SYMLINKS)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (resolve_flags & RESOLVE_NO_TRAILING_AUTOMOUNTS)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (resolve_flags & RESOLVE_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+	ret = user_path_at(dfd, pathname, lookup_flags, &path);
+	if (ret)
+		goto out;
+
+	ret = vfs_fsinfo(&path, ctx);
+	path_put(&path);
+	if (retry_estale(ret, lookup_flags)) {
+		lookup_flags |= LOOKUP_REVAL;
+		goto retry;
+	}
+out:
+	return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
+{
+	struct fd f = fdget_raw(fd);
+	int ret = -EBADF;
+
+	if (f.file) {
+		ret = vfs_fsinfo(&f.file->f_path, ctx);
+		fdput(f);
+	}
+	return ret;
+}
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @pathname: Filesystem to query or NULL.
+ * @params: Parameters to define request (NULL: FSINFO_ATTR_STATFS).
+ * @params_size: Size of parameter buffer.
+ * @result_buffer: Result buffer.
+ * @result_buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem.  The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned.  If
+ * @result_buf_size is 0 or @result_buffer is NULL, only the size is returned.
+ * If the size of the value is larger than @result_buf_size, it will be
+ * truncated by the copy.  If the size of the value is smaller than
+ * @result_buf_size then the excess buffer space will be cleared.  The full
+ * size of the value will be returned, irrespective of how much data is
+ * actually placed in the buffer.
+ */
+SYSCALL_DEFINE6(fsinfo,
+		int, dfd,
+		const char __user *, pathname,
+		const struct fsinfo_params __user *, params,
+		size_t, params_size,
+		void __user *, result_buffer,
+		size_t, result_buf_size)
+{
+	struct fsinfo_context ctx;
+	struct fsinfo_params user_params;
+	unsigned int result_size;
+	void *r;
+	int ret;
+
+	if ((!params &&  params_size) ||
+	    ( params && !params_size) ||
+	    (!result_buffer &&  result_buf_size) ||
+	    ( result_buffer && !result_buf_size))
+		return -EINVAL;
+	if (result_buf_size > UINT_MAX)
+		return -EOVERFLOW;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.requested_attr	= FSINFO_ATTR_STATFS;
+	ctx.flags		= FSINFO_FLAGS_QUERY_PATH;
+	ctx.want_size_only	= (result_buf_size == 0);
+
+	if (params) {
+		ret = copy_struct_from_user(&user_params, sizeof(user_params),
+					    params, params_size);
+		if (ret < 0)
+			return ret;
+		if (user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
+			return -EINVAL;
+		ctx.flags = user_params.flags;
+		ctx.requested_attr = user_params.request;
+		ctx.Nth = user_params.Nth;
+		ctx.Mth = user_params.Mth;
+	}
+
+	switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
+	case FSINFO_FLAGS_QUERY_PATH:
+		ret = vfs_fsinfo_path(dfd, pathname, user_params.resolve_flags, &ctx);
+		break;
+	case FSINFO_FLAGS_QUERY_FD:
+		if (pathname)
+			return -EINVAL;
+		ret = vfs_fsinfo_fd(dfd, &ctx);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (ret < 0)
+		goto error;
+
+	r = ctx.buffer + ctx.skip;
+	result_size = min_t(size_t, ret, result_buf_size);
+	if (result_size > 0 &&
+	    copy_to_user(result_buffer, r, result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+	/* Clear any part of the buffer that we won't fill if we're putting a
+	 * struct in there.  Strings, opaque objects and arrays are expected to
+	 * be variable length.
+	 */
+	if (ctx.clear_tail &&
+	    result_buf_size > result_size &&
+	    clear_user(result_buffer + result_size,
+		       result_buf_size - result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+error:
+	kvfree(ctx.buffer);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9181cfcd5265..39178f89a6ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -69,6 +69,7 @@ struct fsverity_info;
 struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
+struct fsinfo_context;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1964,6 +1965,9 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+#ifdef CONFIG_FSINFO
+	int (*fsinfo)(struct path *, struct fsinfo_context *);
+#endif
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
 
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..bf806669b4fb
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#ifdef CONFIG_FSINFO
+
+#include <uapi/linux/fsinfo.h>
+
+struct path;
+
+#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
+
+struct fsinfo_context {
+	__u32		flags;		/* [in] FSINFO_FLAGS_* */
+	__u32		requested_attr;	/* [in] What is being asking for */
+	__u32		Nth;		/* [in] Instance of it (some may have multiple) */
+	__u32		Mth;		/* [in] Subinstance */
+	bool		want_size_only;	/* [in] Just want to know the size, not the data */
+	bool		clear_tail;	/* [out] T if tail of buffer should be cleared */
+	unsigned int	skip;		/* [out] Number of bytes to skip in buffer */
+	unsigned int	usage;		/* [tmp] Amount of buffer used (if large) */
+	unsigned int	buf_size;	/* [tmp] Size of ->buffer[] */
+	void		*buffer;	/* [out] The reply buffer */
+};
+
+/*
+ * A filesystem information attribute definition.
+ */
+struct fsinfo_attribute {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type:8;		/* The type of the attribute's value(s) */
+	unsigned int		flags:8;
+	unsigned int		size:16;	/* - Value size (FSINFO_STRUCT/LIST) */
+	int (*get)(struct path *path, struct fsinfo_context *params);
+};
+
+#define __FSINFO(A, T, S, G, F) \
+	{ .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
+
+#define _FSINFO(A, T, S, G)	__FSINFO(A, T, S, G, 0)
+#define _FSINFO_N(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N)
+#define _FSINFO_NM(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
+
+#define _FSINFO_VSTRUCT(A,S,G)	  _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_N(A,S,G)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+
+extern int fsinfo_string(const char *, struct fsinfo_context *);
+extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
+extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
+				const struct fsinfo_attribute *);
+
+#endif /* CONFIG_FSINFO */
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c84440d57f52..76064c0807e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -47,6 +47,7 @@ struct stat64;
 struct statfs;
 struct statfs64;
 struct statx;
+struct fsinfo_params;
 struct __sysctl_args;
 struct sysinfo;
 struct timespec;
@@ -1007,6 +1008,9 @@ asmlinkage long sys_watch_mount(int dfd, const char __user *path,
 				unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_watch_sb(int dfd, const char __user *path,
 			     unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
+			   struct fsinfo_params __user *params, size_t params_size,
+			   void __user *result_buffer, size_t result_buf_size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bff318b7ffa..7d764f86d3f5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_watch_mount, sys_watch_mount)
 #define __NR_watch_sb 440
 __SYSCALL(__NR_watch_sb, sys_watch_sb)
+#define __NR_fsinfo 441
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
 
 #undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..b56ebd525b03
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/openat2.h>
+
+/*
+ * The filesystem attributes that can be requested.  Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+#define FSINFO_ATTR_STATFS		0x00	/* statfs()-style state */
+#define FSINFO_ATTR_IDS			0x01	/* Filesystem IDs */
+#define FSINFO_ATTR_LIMITS		0x02	/* Filesystem limits */
+#define FSINFO_ATTR_SUPPORTS		0x03	/* What's supported in statx, iocflags, ... */
+#define FSINFO_ATTR_TIMESTAMP_INFO	0x04	/* Inode timestamp info */
+#define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
+#define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
+#define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+	__u32	flags;		/* Flags controlling fsinfo() specifically */
+#define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
+#define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
+#define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
+	__u32	resolve_flags;	/* RESOLVE_* flags */
+	__u32	request;	/* ID of requested attribute */
+	__u32	Nth;		/* Instance of it (some may have multiple) */
+	__u32	Mth;		/* Subinstance of Nth instance */
+};
+
+enum fsinfo_value_type {
+	FSINFO_TYPE_VSTRUCT	= 0,	/* Version-lengthed struct (up to 4096 bytes) */
+	FSINFO_TYPE_STRING	= 1,	/* NUL-term var-length string (up to 4095 chars) */
+	FSINFO_TYPE_OPAQUE	= 2,	/* Opaque blob (unlimited size) */
+	FSINFO_TYPE_LIST	= 3,	/* List of ints/structs (unlimited size) */
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
+ *
+ * This gives information about the attributes supported by fsinfo for the
+ * given path.
+ */
+struct fsinfo_attribute_info {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type;		/* The type of the attribute's value(s) */
+	unsigned int		flags;
+#define FSINFO_FLAGS_N		0x01		/* - Attr has a set of values */
+#define FSINFO_FLAGS_NM		0x02		/* - Attr has a set of sets of values */
+	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
+};
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
+
+struct fsinfo_u128 {
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+	__u64	hi;
+	__u64	lo;
+#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
+	__u64	lo;
+	__u64	hi;
+#endif
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_STATFS).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
+	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
+	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
+	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
+	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
+	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */
+	__u64	f_bsize;		/* Optimal block size */
+	__u64	f_frsize;		/* Fragment size */
+};
+
+#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_IDS).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+	char	f_fs_name[15 + 1];	/* Filesystem name */
+	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
+	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
+	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
+	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
+	__u32	f_dev_minor;
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+	struct fsinfo_u128 max_file_size;	/* Maximum file size */
+	struct fsinfo_u128 max_ino;		/* Maximum inode number */
+	__u64	max_uid;			/* Maximum UID supported */
+	__u64	max_gid;			/* Maximum GID supported */
+	__u64	max_projid;			/* Maximum project ID supported */
+	__u64	max_hard_links;			/* Maximum number of hard links on a file */
+	__u64	max_xattr_body_len;		/* Maximum xattr content length */
+	__u32	max_xattr_name_len;		/* Maximum xattr name length */
+	__u32	max_filename_len;		/* Maximum filename length */
+	__u32	max_symlink_len;		/* Maximum symlink content length */
+	__u32	max_dev_major;			/* Maximum device major representable */
+	__u32	max_dev_minor;			/* Maximum device minor representable */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
+	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
+	__u32	fs_ioc_getflags;	/* What FS_IOC_GETFLAGS may return */
+	__u32	fs_ioc_setflags_set;	/* What FS_IOC_SETFLAGS may set */
+	__u32	fs_ioc_setflags_clear;	/* What FS_IOC_SETFLAGS may clear */
+	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
+
+struct fsinfo_timestamp_one {
+	__s64	minimum;	/* Minimum timestamp value in seconds */
+	__s64	maximum;	/* Maximum timestamp value in seconds */
+	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
+	__s8	gran_exponent;
+	__u8	__padding[5];
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
+ */
+struct fsinfo_timestamp_info {
+	struct fsinfo_timestamp_one	atime;	/* Access time */
+	struct fsinfo_timestamp_one	mtime;	/* Modification time */
+	struct fsinfo_timestamp_one	ctime;	/* Change time */
+	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
+};
+
+#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
+ */
+struct fsinfo_volume_uuid {
+	__u8	uuid[16];
+};
+
+#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ce01f86e5db..519317f3904c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
 COND_SYSCALL(io_uring_register);
+COND_SYSCALL(fsinfo);
 
 /* fs/xattr.c */
 
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 65acdde5c117..9159ad1d7fc5 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,10 +1,15 @@
 # SPDX-License-Identifier: GPL-2.0-only
 # List of programs to build
+
 hostprogs := \
+	test-fsinfo \
 	test-fsmount \
 	test-statx
 
 always-y := $(hostprogs)
 
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-fsinfo += -static -lm
+
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
new file mode 100644
index 000000000000..67aebf9fc9d8
--- /dev/null
+++ b/samples/vfs/test-fsinfo.c
@@ -0,0 +1,633 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static bool debug = 0;
+static bool list_last;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+	       struct fsinfo_params *params, size_t params_size,
+	       void *result_buffer, size_t result_buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename,
+		       params, params_size,
+		       result_buffer, result_buf_size);
+}
+
+struct fsinfo_attribute {
+	unsigned int		attr_id;
+	enum fsinfo_value_type	type;
+	unsigned int		size;
+	const char		*name;
+	void (*dump)(void *reply, unsigned int size);
+};
+
+static const struct fsinfo_attribute fsinfo_attributes[];
+
+static ssize_t get_fsinfo(const char *, const char *, struct fsinfo_params *, void **);
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+	unsigned offset, print_offset = 1, col = 0;
+
+	from /= 4;
+	to = (to + 3) / 4;
+
+	for (offset = from; offset < to; offset++) {
+		if (print_offset) {
+			printf("%04x: ", offset * 8);
+			print_offset = 0;
+		}
+		printf("%08x", data[offset]);
+		col++;
+		if ((col & 3) == 0) {
+			printf("\n");
+			print_offset = 1;
+		} else {
+			printf(" ");
+		}
+	}
+
+	if (!print_offset)
+		printf("\n");
+}
+
+static void dump_attribute_info(void *reply, unsigned int size)
+{
+	struct fsinfo_attribute_info *attr_info = reply;
+	const struct fsinfo_attribute *attr;
+	char type[32], val_size[32];
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:	strcpy(type, "V-STRUCT");	break;
+	case FSINFO_TYPE_STRING:	strcpy(type, "STRING");		break;
+	case FSINFO_TYPE_OPAQUE:	strcpy(type, "OPAQUE");		break;
+	case FSINFO_TYPE_LIST:		strcpy(type, "LIST");		break;
+	default:
+		sprintf(type, "type-%x", attr_info->type);
+		break;
+	}
+
+	if (attr_info->flags & FSINFO_FLAGS_N)
+		strcat(type, " x N");
+	else if (attr_info->flags & FSINFO_FLAGS_NM)
+		strcat(type, " x NM");
+
+	for (attr = fsinfo_attributes; attr->name; attr++)
+		if (attr->attr_id == attr_info->attr_id)
+			break;
+
+	if (attr_info->size)
+		sprintf(val_size, "%u", attr_info->size);
+	else
+		strcpy(val_size, "-");
+
+	printf("%8x %-12s %08x %5s %s\n",
+	       attr_info->attr_id,
+	       type,
+	       attr_info->flags,
+	       val_size,
+	       attr->name ? attr->name : "");
+}
+
+static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
+{
+	struct fsinfo_statfs *f = reply;
+
+	printf("\n");
+	printf("\tblocks       : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_blocks.lo,
+	       (unsigned long long)f->f_bfree.lo,
+	       (unsigned long long)f->f_bavail.lo);
+
+	printf("\tfiles        : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_files.lo,
+	       (unsigned long long)f->f_ffree.lo,
+	       (unsigned long long)f->f_favail.lo);
+	printf("\tbsize        : %llu\n", f->f_bsize);
+	printf("\tfrsize       : %llu\n", f->f_frsize);
+}
+
+static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
+{
+	struct fsinfo_ids *f = reply;
+
+	printf("\n");
+	printf("\tdev          : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+	printf("\tfs           : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+	printf("\tfsid         : %llx\n", (unsigned long long)f->f_fsid);
+	printf("\tsbid         : %llx\n", (unsigned long long)f->f_sb_id);
+}
+
+static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
+{
+	struct fsinfo_limits *f = reply;
+
+	printf("\n");
+	printf("\tmax file size: %llx%016llx\n",
+	       (unsigned long long)f->max_file_size.hi,
+	       (unsigned long long)f->max_file_size.lo);
+	printf("\tmax ino      : %llx%016llx\n",
+	       (unsigned long long)f->max_ino.hi,
+	       (unsigned long long)f->max_ino.lo);
+	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
+	       (unsigned long long)f->max_uid,
+	       (unsigned long long)f->max_gid,
+	       (unsigned long long)f->max_projid);
+	printf("\tmax dev      : maj=%x min=%x\n",
+	       f->max_dev_major, f->max_dev_minor);
+	printf("\tmax links    : %llx\n",
+	       (unsigned long long)f->max_hard_links);
+	printf("\tmax xattr    : n=%x b=%llx\n",
+	       f->max_xattr_name_len,
+	       (unsigned long long)f->max_xattr_body_len);
+	printf("\tmax len      : file=%x sym=%x\n",
+	       f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
+{
+	struct fsinfo_supports *f = reply;
+
+	printf("\n");
+	printf("\tstx_attr     : %llx\n", (unsigned long long)f->stx_attributes);
+	printf("\tstx_mask     : %x\n", f->stx_mask);
+	printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
+	       f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
+	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
+}
+
+static void print_time(struct fsinfo_timestamp_one *t, char stamp)
+{
+	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
+	       stamp,
+	       t->gran_mantissa * pow(10., t->gran_exponent),
+	       (long long)t->minimum,
+	       (long long)t->maximum);
+}
+
+static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
+{
+	struct fsinfo_timestamp_info *f = reply;
+
+	printf("\n");
+	print_time(&f->atime, 'a');
+	print_time(&f->mtime, 'm');
+	print_time(&f->ctime, 'c');
+	print_time(&f->btime, 'b');
+}
+
+static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
+{
+	struct fsinfo_volume_uuid *f = reply;
+
+	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+	       "-%02x%02x%02x%02x%02x%02x\n",
+	       f->uuid[ 0], f->uuid[ 1],
+	       f->uuid[ 2], f->uuid[ 3],
+	       f->uuid[ 4], f->uuid[ 5],
+	       f->uuid[ 6], f->uuid[ 7],
+	       f->uuid[ 8], f->uuid[ 9],
+	       f->uuid[10], f->uuid[11],
+	       f->uuid[12], f->uuid[13],
+	       f->uuid[14], f->uuid[15]);
+}
+
+static void dump_string(void *reply, unsigned int size)
+{
+	char *s = reply, *p;
+	bool nl = false, last_nl = false;
+
+	p = s;
+	if (size >= 4096) {
+		size = 4096;
+		p[4092] = '.';
+		p[4093] = '.';
+		p[4094] = '.';
+		p[4095] = 0;
+	} else {
+		p[size] = 0;
+	}
+
+	for (p = s; *p; p++) {
+		if (*p == '\n') {
+			last_nl = nl = true;
+			continue;
+		}
+		last_nl = false;
+		if (!isprint(*p) && *p != '\t')
+			*p = '?';
+	}
+
+	if (nl)
+		putchar('\n');
+	printf("%s", s);
+	if (!last_nl)
+		putchar('\n');
+}
+
+#define dump_fsinfo_meta_attribute_info		(void *)0x123
+#define dump_fsinfo_meta_attributes		(void *)0x123
+
+/*
+ *
+ */
+#define __FSINFO(A, T, S, G, F, N)					\
+	{ .attr_id = A, .type = T, .size = S, .name = N, .dump = dump_##G }
+
+#define _FSINFO(A,T,S,G,N)	__FSINFO(A, T, S, G, 0, N)
+#define _FSINFO_N(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N, N)
+#define _FSINFO_NM(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM, N)
+
+#define _FSINFO_VSTRUCT(A,S,G,N)    _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_N(A,S,G,N)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_NM(A,S,G,N) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G, #A)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G, #A)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+
+static const struct fsinfo_attribute fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		string),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	string),
+	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
+	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
+	{}
+};
+
+static void dump_value(unsigned int attr_id,
+		       const struct fsinfo_attribute *attr,
+		       const struct fsinfo_attribute_info *attr_info,
+		       void *reply, unsigned int size)
+{
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	attr->dump(reply, size);
+}
+
+static void dump_list(unsigned int attr_id,
+		      const struct fsinfo_attribute *attr,
+		      const struct fsinfo_attribute_info *attr_info,
+		      void *reply, unsigned int size)
+{
+	size_t elem_size = attr_info->size;
+	unsigned int ix = 0;
+
+	printf("\n");
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	list_last = false;
+	while (size >= elem_size) {
+		printf("\t[%02x] ", ix);
+		if (size == elem_size)
+			list_last = true;
+		attr->dump(reply, size);
+		reply += elem_size;
+		size -= elem_size;
+		ix++;
+	}
+}
+
+/*
+ * Call fsinfo, expanding the buffer as necessary.
+ */
+static ssize_t get_fsinfo(const char *file, const char *name,
+			  struct fsinfo_params *params, void **_r)
+{
+	ssize_t ret;
+	size_t buf_size = 4096;
+	void *r;
+
+	for (;;) {
+		r = malloc(buf_size);
+		if (!r) {
+			perror("malloc");
+			exit(1);
+		}
+		memset(r, 0xbd, buf_size);
+
+		errno = 0;
+		ret = fsinfo(AT_FDCWD, file, params, sizeof(*params), r, buf_size - 1);
+		if (ret == -1)
+			goto error;
+
+		if (ret <= buf_size - 1)
+			break;
+		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+	}
+
+	if (debug)
+		printf("fsinfo(%s,%s,%u,%u) = %zd\n",
+		       file, name, params->Nth, params->Mth, ret);
+
+	((char *)r)[ret] = 0;
+	*_r = r;
+	return ret;
+
+error:
+	*_r = NULL;
+	free(r);
+	if (debug)
+		printf("fsinfo(%s,%s,%u,%u) = %m\n",
+		       file, name, params->Nth, params->Mth);
+	return ret;
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params,
+		   const struct fsinfo_attribute_info *attr_info, bool raw)
+{
+	const struct fsinfo_attribute *attr;
+	const char *name;
+	size_t size = 4096;
+	char namebuf[32];
+	void *r;
+
+	for (attr = fsinfo_attributes; attr->name; attr++) {
+		if (attr->attr_id == params->request) {
+			name = attr->name;
+			if (strncmp(name, "fsinfo_generic_", 15) == 0)
+				name += 15;
+			goto found;
+		}
+	}
+
+	sprintf(namebuf, "<unknown-%x>", params->request);
+	name = namebuf;
+	attr = NULL;
+
+found:
+	size = get_fsinfo(file, name, params, &r);
+
+	if (size == -1) {
+		if (errno == ENODATA) {
+			if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
+			    params->Nth == 0 && params->Mth == 0) {
+				fprintf(stderr,
+					"Unexpected ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			free(r);
+			return (params->Mth == 0) ? 2 : 1;
+		}
+		if (errno == EOPNOTSUPP) {
+			if (params->Nth > 0 || params->Mth > 0) {
+				fprintf(stderr,
+					"Should return -ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			//printf("\e[33m%s\e[m: <not supported>\n",
+			//       fsinfo_attr_names[attr]);
+			free(r);
+			return 2;
+		}
+		perror(file);
+		exit(1);
+	}
+
+	if (raw) {
+		if (size > 4096)
+			size = 4096;
+		dump_hex(r, 0, size);
+		free(r);
+		return 0;
+	}
+
+	switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
+	case 0:
+		printf("\e[33m%s\e[m: ", name);
+		break;
+	case FSINFO_FLAGS_N:
+		printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
+		break;
+	case FSINFO_FLAGS_NM:
+		printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
+		break;
+	}
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:
+	case FSINFO_TYPE_STRING:
+		dump_value(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_LIST:
+		dump_list(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_OPAQUE:
+		free(r);
+		return 0;
+
+	default:
+		fprintf(stderr, "Fishy about %u 0x%x,%x,%x\n",
+			params->request, attr_info->type, attr_info->flags, attr_info->size);
+		exit(1);
+	}
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+	return *(const int *)a - *(const int *)b;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	struct fsinfo_attribute_info attr_info;
+	struct fsinfo_params params = {
+		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+	};
+	unsigned int *attrs, ret, nr, i;
+	bool meta = false;
+	int raw = 0, opt, Nth, Mth;
+
+	while ((opt = getopt(argc, argv, "Madlr"))) {
+		switch (opt) {
+		case 'M':
+			meta = true;
+			continue;
+		case 'a':
+			params.resolve_flags |= RESOLVE_NO_TRAILING_AUTOMOUNTS;
+			params.flags = FSINFO_FLAGS_QUERY_PATH;
+			continue;
+		case 'd':
+			debug = true;
+			continue;
+		case 'l':
+			params.resolve_flags &= ~RESOLVE_NO_TRAILING_SYMLINKS;
+			params.flags = FSINFO_FLAGS_QUERY_PATH;
+			continue;
+		case 'r':
+			raw = 1;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1) {
+		printf("Format: test-fsinfo [-Madlr] <path>\n");
+		exit(2);
+	}
+
+	/* Retrieve a list of supported attribute IDs */
+	params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
+	params.Nth = 0;
+	params.Mth = 0;
+	ret = get_fsinfo(argv[0], "attributes", &params, (void **)&attrs);
+	if (ret == -1) {
+		fprintf(stderr, "Unable to get attribute list: %m\n");
+		exit(1);
+	}
+
+	if (ret % sizeof(attrs[0])) {
+		fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
+		exit(2);
+	}
+
+	nr = ret / sizeof(attrs[0]);
+	qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
+
+	if (meta) {
+		printf("ATTR ID  TYPE         FLAGS    SIZE  NAME\n");
+		printf("======== ============ ======== ===== =========\n");
+		for (i = 0; i < nr; i++) {
+			params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+			params.Nth = attrs[i];
+			params.Mth = 0;
+			ret = fsinfo(AT_FDCWD, argv[0],
+				     &params, sizeof(params),
+				     &attr_info, sizeof(attr_info));
+			if (ret == -1) {
+				fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+				exit(1);
+			}
+
+			dump_attribute_info(&attr_info, ret);
+		}
+		exit(0);
+	}
+
+	for (i = 0; i < nr; i++) {
+		params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+		params.Nth = attrs[i];
+		params.Mth = 0;
+		ret = fsinfo(AT_FDCWD, argv[0],
+			     &params, sizeof(params),
+			     &attr_info, sizeof(attr_info));
+		if (ret == -1) {
+			fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+			exit(1);
+		}
+
+		if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
+		    attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
+			continue;
+
+		if (attrs[i] != attr_info.attr_id) {
+			fprintf(stderr, "ID for %03x returned %03x\n",
+				attrs[i], attr_info.attr_id);
+			break;
+		}
+		Nth = 0;
+		do {
+			Mth = 0;
+			do {
+				params.request = attrs[i];
+				params.Nth = Nth;
+				params.Mth = Mth;
+
+				switch (try_one(argv[0], &params, &attr_info, raw)) {
+				case 0:
+					continue;
+				case 1:
+					goto done_M;
+				case 2:
+					goto done_N;
+				}
+			} while (++Mth < 100);
+
+		done_M:
+			if (Mth >= 100) {
+				fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
+				break;
+			}
+
+		} while (++Nth < 100);
+
+	done_N:
+		if (Nth >= 100) {
+			fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
+			break;
+		}
+	}
+
+	return 0;
+}



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 03/14] fsinfo: Provide a bitmap of supported features [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
  2020-03-09 14:01 ` [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-09 14:01 ` [PATCH 04/14] fsinfo: Allow retrieval of superblock devname, options and stats " David Howells
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Provide a bitmap of features that a filesystem may provide for the path
being queried.  Features include such things as:

 (1) The general class of filesystem, such as kernel-interface,
     block-based, flash-based, network-based.

 (2) Supported inode features, such as which timestamps are supported,
     whether simple numeric user, group or project IDs are supported and
     whether user identification is actually more complex behind the
     scenes.

 (3) Supported volume features, such as it having a UUID, a name or a
     filesystem ID.

 (4) Supported filesystem features, such as what types of file are
     supported, whether sparse files, extended attributes and quotas are
     supported.

 (5) Supported interface features, such as whether locking and leases are
     supported, what open flags are honoured and how i_version is managed.

For some filesystems, this may be an immutable set and can just be memcpy'd
into the reply buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                 |   30 +++++++++++++++++++
 include/linux/fsinfo.h      |   38 ++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   67 ++++++++++++++++++++++++++++++++++++++++++
 samples/vfs/test-fsinfo.c   |   69 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 204 insertions(+)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index b7b81e9d7e21..662b0edde151 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -121,6 +121,35 @@ int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
 }
 EXPORT_SYMBOL(fsinfo_generic_supports);
 
+int fsinfo_generic_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	fsinfo_init_features(p);
+	if (sb->s_mtd)
+		fsinfo_set_feature(p, FSINFO_FEAT_IS_FLASH_FS);
+	else if (sb->s_bdev)
+		fsinfo_set_feature(p, FSINFO_FEAT_IS_BLOCK_FS);
+
+	if (sb->s_quota_types & QTYPE_MASK_USR)
+		fsinfo_set_feature(p, FSINFO_FEAT_USER_QUOTAS);
+	if (sb->s_quota_types & QTYPE_MASK_GRP)
+		fsinfo_set_feature(p, FSINFO_FEAT_GROUP_QUOTAS);
+	if (sb->s_quota_types & QTYPE_MASK_PRJ)
+		fsinfo_set_feature(p, FSINFO_FEAT_PROJECT_QUOTAS);
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	if (sb->s_id[0])
+		fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_ID);
+
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_ATIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_CTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_features);
+
 static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
 	.atime = {
 		.minimum	= S64_MIN,
@@ -196,6 +225,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
index bf806669b4fb..3f08e61c3270 100644
--- a/include/linux/fsinfo.h
+++ b/include/linux/fsinfo.h
@@ -67,6 +67,44 @@ extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
 extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
 extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
 				const struct fsinfo_attribute *);
+extern int fsinfo_generic_features(struct path *, struct fsinfo_context *);
+
+static inline void fsinfo_init_features(struct fsinfo_features *p)
+{
+	p->nr_features = FSINFO_FEAT__NR;
+}
+
+static inline void fsinfo_set_feature(struct fsinfo_features *p,
+				      enum fsinfo_feature feature)
+{
+	p->features[feature / 8] |= 1 << (feature % 8);
+}
+
+static inline void fsinfo_clear_feature(struct fsinfo_features *p,
+					enum fsinfo_feature feature)
+{
+	p->features[feature / 8] &= ~(1 << (feature % 8));
+}
+
+/**
+ * fsinfo_set_unix_features - Set standard UNIX features.
+ * @f: The features mask to alter
+ */
+static inline void fsinfo_set_unix_features(struct fsinfo_features *p)
+{
+	fsinfo_set_feature(p, FSINFO_FEAT_UIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_GIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_DIRECTORIES);
+	fsinfo_set_feature(p, FSINFO_FEAT_SYMLINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_HARD_LINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_DEVICE_FILES);
+	fsinfo_set_feature(p, FSINFO_FEAT_UNIX_SPECIALS);
+	fsinfo_set_feature(p, FSINFO_FEAT_SPARSE);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_ATIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_CTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_INODE_NUMBERS);
+}
 
 #endif /* CONFIG_FSINFO */
 
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index b56ebd525b03..448378301456 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -23,6 +23,7 @@
 #define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
 #define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
 #define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
+#define FSINFO_ATTR_FEATURES		0x08	/* Filesystem features (bits) */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
@@ -154,6 +155,72 @@ struct fsinfo_supports {
 
 #define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FEATURES).
+ *
+ * Bitmask indicating filesystem features where renderable as single bits.
+ */
+enum fsinfo_feature {
+	FSINFO_FEAT_IS_KERNEL_FS	= 0,	/* fs is kernel-special filesystem */
+	FSINFO_FEAT_IS_BLOCK_FS		= 1,	/* fs is block-based filesystem */
+	FSINFO_FEAT_IS_FLASH_FS		= 2,	/* fs is flash filesystem */
+	FSINFO_FEAT_IS_NETWORK_FS	= 3,	/* fs is network filesystem */
+	FSINFO_FEAT_IS_AUTOMOUNTER_FS	= 4,	/* fs is automounter special filesystem */
+	FSINFO_FEAT_IS_MEMORY_FS	= 5,	/* fs is memory-based filesystem */
+	FSINFO_FEAT_AUTOMOUNTS		= 6,	/* fs supports automounts */
+	FSINFO_FEAT_ADV_LOCKS		= 7,	/* fs supports advisory file locking */
+	FSINFO_FEAT_MAND_LOCKS		= 8,	/* fs supports mandatory file locking */
+	FSINFO_FEAT_LEASES		= 9,	/* fs supports file leases */
+	FSINFO_FEAT_UIDS		= 10,	/* fs supports numeric uids */
+	FSINFO_FEAT_GIDS		= 11,	/* fs supports numeric gids */
+	FSINFO_FEAT_PROJIDS		= 12,	/* fs supports numeric project ids */
+	FSINFO_FEAT_STRING_USER_IDS	= 13,	/* fs supports string user identifiers */
+	FSINFO_FEAT_GUID_USER_IDS	= 14,	/* fs supports GUID user identifiers */
+	FSINFO_FEAT_WINDOWS_ATTRS	= 15,	/* fs has windows attributes */
+	FSINFO_FEAT_USER_QUOTAS		= 16,	/* fs has per-user quotas */
+	FSINFO_FEAT_GROUP_QUOTAS	= 17,	/* fs has per-group quotas */
+	FSINFO_FEAT_PROJECT_QUOTAS	= 18,	/* fs has per-project quotas */
+	FSINFO_FEAT_XATTRS		= 19,	/* fs has xattrs */
+	FSINFO_FEAT_JOURNAL		= 20,	/* fs has a journal */
+	FSINFO_FEAT_DATA_IS_JOURNALLED	= 21,	/* fs is using data journalling */
+	FSINFO_FEAT_O_SYNC		= 22,	/* fs supports O_SYNC */
+	FSINFO_FEAT_O_DIRECT		= 23,	/* fs supports O_DIRECT */
+	FSINFO_FEAT_VOLUME_ID		= 24,	/* fs has a volume ID */
+	FSINFO_FEAT_VOLUME_UUID		= 25,	/* fs has a volume UUID */
+	FSINFO_FEAT_VOLUME_NAME		= 26,	/* fs has a volume name */
+	FSINFO_FEAT_VOLUME_FSID		= 27,	/* fs has a volume FSID */
+	FSINFO_FEAT_IVER_ALL_CHANGE	= 28,	/* i_version represents data + meta changes */
+	FSINFO_FEAT_IVER_DATA_CHANGE	= 29,	/* i_version represents data changes only */
+	FSINFO_FEAT_IVER_MONO_INCR	= 30,	/* i_version incremented monotonically */
+	FSINFO_FEAT_DIRECTORIES		= 31,	/* fs supports (sub)directories */
+	FSINFO_FEAT_SYMLINKS		= 32,	/* fs supports symlinks */
+	FSINFO_FEAT_HARD_LINKS		= 33,	/* fs supports hard links */
+	FSINFO_FEAT_HARD_LINKS_1DIR	= 34,	/* fs supports hard links in same dir only */
+	FSINFO_FEAT_DEVICE_FILES	= 35,	/* fs supports bdev, cdev */
+	FSINFO_FEAT_UNIX_SPECIALS	= 36,	/* fs supports pipe, fifo, socket */
+	FSINFO_FEAT_RESOURCE_FORKS	= 37,	/* fs supports resource forks/streams */
+	FSINFO_FEAT_NAME_CASE_INDEP	= 38,	/* Filename case independence is mandatory */
+	FSINFO_FEAT_NAME_NON_UTF8	= 39,	/* fs has non-utf8 names */
+	FSINFO_FEAT_NAME_HAS_CODEPAGE	= 40,	/* fs has a filename codepage */
+	FSINFO_FEAT_SPARSE		= 41,	/* fs supports sparse files */
+	FSINFO_FEAT_NOT_PERSISTENT	= 42,	/* fs is not persistent */
+	FSINFO_FEAT_NO_UNIX_MODE	= 43,	/* fs does not support unix mode bits */
+	FSINFO_FEAT_HAS_ATIME		= 44,	/* fs supports access time */
+	FSINFO_FEAT_HAS_BTIME		= 45,	/* fs supports birth/creation time */
+	FSINFO_FEAT_HAS_CTIME		= 46,	/* fs supports change time */
+	FSINFO_FEAT_HAS_MTIME		= 47,	/* fs supports modification time */
+	FSINFO_FEAT_HAS_ACL		= 48,	/* fs supports ACLs of some sort */
+	FSINFO_FEAT_HAS_INODE_NUMBERS	= 49,	/* fs has inode numbers */
+	FSINFO_FEAT__NR
+};
+
+struct fsinfo_features {
+	__u32	nr_features;	/* Number of supported features (FSINFO_FEAT__NR) */
+	__u8	features[(FSINFO_FEAT__NR + 7) / 8];
+};
+
+#define FSINFO_ATTR_FEATURES__STRUCT struct fsinfo_features
+
 struct fsinfo_timestamp_one {
 	__s64	minimum;	/* Minimum timestamp value in seconds */
 	__s64	maximum;	/* Maximum timestamp value in seconds */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 67aebf9fc9d8..a48072c77401 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -184,6 +184,74 @@ static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
 	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
 }
 
+#define FSINFO_FEATURE_NAME(C) [FSINFO_FEAT_##C] = #C
+static const char *fsinfo_feature_names[FSINFO_FEAT__NR] = {
+	FSINFO_FEATURE_NAME(IS_KERNEL_FS),
+	FSINFO_FEATURE_NAME(IS_BLOCK_FS),
+	FSINFO_FEATURE_NAME(IS_FLASH_FS),
+	FSINFO_FEATURE_NAME(IS_NETWORK_FS),
+	FSINFO_FEATURE_NAME(IS_AUTOMOUNTER_FS),
+	FSINFO_FEATURE_NAME(IS_MEMORY_FS),
+	FSINFO_FEATURE_NAME(AUTOMOUNTS),
+	FSINFO_FEATURE_NAME(ADV_LOCKS),
+	FSINFO_FEATURE_NAME(MAND_LOCKS),
+	FSINFO_FEATURE_NAME(LEASES),
+	FSINFO_FEATURE_NAME(UIDS),
+	FSINFO_FEATURE_NAME(GIDS),
+	FSINFO_FEATURE_NAME(PROJIDS),
+	FSINFO_FEATURE_NAME(STRING_USER_IDS),
+	FSINFO_FEATURE_NAME(GUID_USER_IDS),
+	FSINFO_FEATURE_NAME(WINDOWS_ATTRS),
+	FSINFO_FEATURE_NAME(USER_QUOTAS),
+	FSINFO_FEATURE_NAME(GROUP_QUOTAS),
+	FSINFO_FEATURE_NAME(PROJECT_QUOTAS),
+	FSINFO_FEATURE_NAME(XATTRS),
+	FSINFO_FEATURE_NAME(JOURNAL),
+	FSINFO_FEATURE_NAME(DATA_IS_JOURNALLED),
+	FSINFO_FEATURE_NAME(O_SYNC),
+	FSINFO_FEATURE_NAME(O_DIRECT),
+	FSINFO_FEATURE_NAME(VOLUME_ID),
+	FSINFO_FEATURE_NAME(VOLUME_UUID),
+	FSINFO_FEATURE_NAME(VOLUME_NAME),
+	FSINFO_FEATURE_NAME(VOLUME_FSID),
+	FSINFO_FEATURE_NAME(IVER_ALL_CHANGE),
+	FSINFO_FEATURE_NAME(IVER_DATA_CHANGE),
+	FSINFO_FEATURE_NAME(IVER_MONO_INCR),
+	FSINFO_FEATURE_NAME(DIRECTORIES),
+	FSINFO_FEATURE_NAME(SYMLINKS),
+	FSINFO_FEATURE_NAME(HARD_LINKS),
+	FSINFO_FEATURE_NAME(HARD_LINKS_1DIR),
+	FSINFO_FEATURE_NAME(DEVICE_FILES),
+	FSINFO_FEATURE_NAME(UNIX_SPECIALS),
+	FSINFO_FEATURE_NAME(RESOURCE_FORKS),
+	FSINFO_FEATURE_NAME(NAME_CASE_INDEP),
+	FSINFO_FEATURE_NAME(NAME_NON_UTF8),
+	FSINFO_FEATURE_NAME(NAME_HAS_CODEPAGE),
+	FSINFO_FEATURE_NAME(SPARSE),
+	FSINFO_FEATURE_NAME(NOT_PERSISTENT),
+	FSINFO_FEATURE_NAME(NO_UNIX_MODE),
+	FSINFO_FEATURE_NAME(HAS_ATIME),
+	FSINFO_FEATURE_NAME(HAS_BTIME),
+	FSINFO_FEATURE_NAME(HAS_CTIME),
+	FSINFO_FEATURE_NAME(HAS_MTIME),
+	FSINFO_FEATURE_NAME(HAS_ACL),
+	FSINFO_FEATURE_NAME(HAS_INODE_NUMBERS),
+};
+
+static void dump_fsinfo_generic_features(void *reply, unsigned int size)
+{
+	struct fsinfo_features *f = reply;
+	int i;
+
+	printf("\n\t");
+	for (i = 0; i < sizeof(f->features); i++)
+		printf("%02x", f->features[i]);
+	printf(" (nr=%u)\n", f->nr_features);
+	for (i = 0; i < FSINFO_FEAT__NR; i++)
+		if (f->features[i / 8] & (1 << (i % 8)))
+			printf("\t- %s\n", fsinfo_feature_names[i]);
+}
+
 static void print_time(struct fsinfo_timestamp_one *t, char stamp)
 {
 	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
@@ -285,6 +353,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		string),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 04/14] fsinfo: Allow retrieval of superblock devname, options and stats [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (2 preceding siblings ...)
  2020-03-09 14:01 ` [PATCH 03/14] fsinfo: Provide a bitmap of supported features [ver #18] David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-09 14:01 ` [PATCH 05/14] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Provide fsinfo() attributes to retrieve superblock device name, options,
and statistics in string form.  The following attributes are defined:

	FSINFO_ATTR_SOURCE		- Mount-specific device name
	FSINFO_ATTR_CONFIGURATION	- Mount options
	FSINFO_ATTR_FS_STATISTICS	- Filesystem statistics

FSINFO_ATTR_SOURCE could be made indexable by params->Nth to handle the
case where there is more than one source (e.g. the bcachefs filesystem).

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                 |   41 +++++++++++++++++++++++++++++++++++++++++
 fs/internal.h               |    2 ++
 fs/namespace.c              |   39 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |    3 +++
 samples/vfs/test-fsinfo.c   |    4 ++++
 5 files changed, 89 insertions(+)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 662b0edde151..9562bce5253c 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -217,6 +217,44 @@ static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ct
 	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
 }
 
+/*
+ * Retrieve the superblock configuration (mount options) as a comma-separated
+ * string.  The initial comma is stripped off.
+ */
+static int fsinfo_generic_seq_read(struct path *path, struct fsinfo_context *ctx)
+{
+	struct super_block *sb = path->dentry->d_sb;
+	struct seq_file m = {
+		.buf	= ctx->buffer,
+		.size	= ctx->buf_size,
+	};
+	int ret = 0;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_CONFIGURATION:
+		if (sb->s_op->show_options)
+			ret = sb->s_op->show_options(&m, path->mnt->mnt_root);
+		break;
+
+	case FSINFO_ATTR_FS_STATISTICS:
+		if (sb->s_op->show_stats)
+			ret = sb->s_op->show_stats(&m, path->mnt->mnt_root);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+	if (seq_has_overflowed(&m))
+		return ctx->buf_size + PAGE_SIZE;
+	if (ctx->requested_attr == FSINFO_ATTR_CONFIGURATION) {
+		if (m.count > 0 && ((char *)ctx->buffer)[0] == ',') {
+			m.count--;
+			ctx->skip = 1;
+		}
+	}
+	return m.count;
+}
+
 static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
@@ -226,6 +264,9 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
+	FSINFO_STRING	(FSINFO_ATTR_SOURCE,		fsinfo_generic_mount_source),
+	FSINFO_STRING	(FSINFO_ATTR_CONFIGURATION,	fsinfo_generic_seq_read),
+	FSINFO_STRING	(FSINFO_ATTR_FS_STATISTICS,	fsinfo_generic_seq_read),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
diff --git a/fs/internal.h b/fs/internal.h
index a0d90f23593c..6f2cc77bf38d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -91,6 +91,8 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write_file(struct file *);
 
 extern void dissolve_on_fput(struct vfsmount *);
+extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
+
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 54d237251941..e26e06447993 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -30,6 +30,7 @@
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/shmem_fs.h>
+#include <linux/fsinfo.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -3997,3 +3998,41 @@ const struct proc_ns_operations mntns_operations = {
 	.install	= mntns_install,
 	.owner		= mntns_owner,
 };
+
+#ifdef CONFIG_FSINFO
+static inline void mangle(struct seq_file *m, const char *s)
+{
+	seq_escape(m, s, " \t\n\\");
+}
+
+/*
+ * Return the mount source/device name as seen from this mountpoint.  Shared
+ * mounts may vary here and the filesystem is permitted to substitute its own
+ * rendering.
+ */
+int fsinfo_generic_mount_source(struct path *path, struct fsinfo_context *ctx)
+{
+	struct super_block *sb = path->mnt->mnt_sb;
+	struct mount *mnt = real_mount(path->mnt);
+	struct seq_file m = {
+		.buf	= ctx->buffer,
+		.size	= ctx->buf_size,
+	};
+	int ret;
+
+	if (sb->s_op->show_devname) {
+		ret = sb->s_op->show_devname(&m, mnt->mnt.mnt_root);
+		if (ret < 0)
+			return ret;
+	} else {
+		if (!mnt->mnt_devname)
+			return fsinfo_string("none", ctx);
+		mangle(&m, mnt->mnt_devname);
+	}
+
+	if (seq_has_overflowed(&m))
+		return ctx->buf_size + PAGE_SIZE;
+	return m.count;
+}
+
+#endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 448378301456..253b5213a775 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -24,6 +24,9 @@
 #define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
 #define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
 #define FSINFO_ATTR_FEATURES		0x08	/* Filesystem features (bits) */
+#define FSINFO_ATTR_SOURCE		0x09	/* Superblock source/device name (string) */
+#define FSINFO_ATTR_CONFIGURATION	0x0a	/* Superblock configuration/options (string) */
+#define FSINFO_ATTR_FS_STATISTICS	0x0b	/* Superblock filesystem statistics (string) */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index a48072c77401..1002b718cdbc 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -358,6 +358,10 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		string),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	string),
+	FSINFO_STRING	(FSINFO_ATTR_SOURCE,		string),
+	FSINFO_STRING	(FSINFO_ATTR_CONFIGURATION,	string),
+	FSINFO_STRING	(FSINFO_ATTR_FS_STATISTICS,	string),
+
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
 	{}



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 05/14] fsinfo: Allow fsinfo() to look up a mount object by ID [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (3 preceding siblings ...)
  2020-03-09 14:01 ` [PATCH 04/14] fsinfo: Allow retrieval of superblock devname, options and stats " David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-09 14:01 ` [PATCH 06/14] fsinfo: Add a uniquifier ID to struct mount " David Howells
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Allow the fsinfo() syscall to look up a mount object by ID rather than by
pathname.  This is necessary as there can be multiple mounts stacked up at
the same pathname and there's no way to look through them otherwise.

This is done by passing FSINFO_FLAGS_QUERY_MOUNT to fsinfo() in the
parameters and then passing the mount ID as a string to fsinfo() in place
of the filename:

	struct fsinfo_params params = {
		.flags	 = FSINFO_FLAGS_QUERY_MOUNT,
		.request = FSINFO_ATTR_IDS,
	};

	ret = fsinfo(AT_FDCWD, "21", &params, buffer, sizeof(buffer));

The caller is only permitted to query a mount object if the root directory
of that mount connects directly to the current chroot if dfd == AT_FDCWD[*]
or the directory specified by dfd otherwise.  Note that this is not
available to the pathwalk of any other syscall.

[*] This needs to be something other than AT_FDCWD, perhaps AT_FDROOT.

[!] This probably needs an LSM hook.

[!] This might want to check the permissions on all the intervening dirs -
    but it would have to do that under RCU conditions.

[!] This might want to check a CAP_* flag.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                 |   53 +++++++++++++++++++
 fs/internal.h               |    1 
 fs/namespace.c              |  117 ++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/fsinfo.h |    1 
 samples/vfs/test-fsinfo.c   |    7 ++-
 5 files changed, 175 insertions(+), 4 deletions(-)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 9562bce5253c..bafeb73feaf4 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -547,6 +547,56 @@ static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
 	return ret;
 }
 
+/*
+ * Look up the root of a mount object.  This allows access to mount objects
+ * (and their attached superblocks) that can't be retrieved by path because
+ * they're entirely covered.
+ *
+ * We only permit access to a mount that has a direct path between either the
+ * dentry pointed to by dfd or to our chroot (if dfd is AT_FDCWD).
+ */
+static int vfs_fsinfo_mount(int dfd, const char __user *filename,
+			    struct fsinfo_context *ctx)
+{
+	struct path path;
+	struct fd f = {};
+	char *name;
+	long mnt_id;
+	int ret;
+
+	if (!filename)
+		return -EINVAL;
+
+	name = strndup_user(filename, 32);
+	if (IS_ERR(name))
+		return PTR_ERR(name);
+	ret = kstrtoul(name, 0, &mnt_id);
+	if (ret < 0)
+		goto out_name;
+	if (mnt_id > INT_MAX)
+		goto out_name;
+
+	if (dfd != AT_FDCWD) {
+		ret = -EBADF;
+		f = fdget_raw(dfd);
+		if (!f.file)
+			goto out_name;
+	}
+
+	ret = lookup_mount_object(f.file ? &f.file->f_path : NULL,
+				  mnt_id, &path);
+	if (ret < 0)
+		goto out_fd;
+
+	ret = vfs_fsinfo(&path, ctx);
+	path_put(&path);
+out_fd:
+	fdput(f);
+out_name:
+	kfree(name);
+	return ret;
+}
+
 /**
  * sys_fsinfo - System call to get filesystem information
  * @dfd: Base directory to pathwalk from or fd referring to filesystem.
@@ -620,6 +670,9 @@ SYSCALL_DEFINE6(fsinfo,
 			return -EINVAL;
 		ret = vfs_fsinfo_fd(dfd, &ctx);
 		break;
+	case FSINFO_FLAGS_QUERY_MOUNT:
+		ret = vfs_fsinfo_mount(dfd, pathname, &ctx);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/fs/internal.h b/fs/internal.h
index 6f2cc77bf38d..abbd5299e7dc 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -91,6 +91,7 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write_file(struct file *);
 
 extern void dissolve_on_fput(struct vfsmount *);
+extern int lookup_mount_object(struct path *, int, struct path *);
 extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
 
 /*
diff --git a/fs/namespace.c b/fs/namespace.c
index e26e06447993..f33cec5fe885 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -64,7 +64,7 @@ static int __init set_mphash_entries(char *str)
 __setup("mphash_entries=", set_mphash_entries);
 
 static u64 event;
-static DEFINE_IDA(mnt_id_ida);
+static DEFINE_IDR(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
 
 static struct hlist_head *mount_hashtable __read_mostly;
@@ -105,17 +105,27 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 
 static int mnt_alloc_id(struct mount *mnt)
 {
-	int res = ida_alloc(&mnt_id_ida, GFP_KERNEL);
+	int res;
 
+	/* Allocate an ID, but don't set the pointer back to the mount until
+	 * later, as once we do that, we have to follow RCU protocols to get
+	 * rid of the mount struct.
+	 */
+	res = idr_alloc(&mnt_id_ida, NULL, 0, INT_MAX, GFP_KERNEL);
 	if (res < 0)
 		return res;
 	mnt->mnt_id = res;
 	return 0;
 }
 
+static void mnt_publish_id(struct mount *mnt)
+{
+	idr_replace(&mnt_id_ida, mnt, mnt->mnt_id);
+}
+
 static void mnt_free_id(struct mount *mnt)
 {
-	ida_free(&mnt_id_ida, mnt->mnt_id);
+	idr_remove(&mnt_id_ida, mnt->mnt_id);
 }
 
 /*
@@ -959,6 +969,7 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
 	unlock_mount_hash();
+	mnt_publish_id(mnt);
 	return &mnt->mnt;
 }
 EXPORT_SYMBOL(vfs_create_mount);
@@ -1052,6 +1063,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
 	unlock_mount_hash();
+	mnt_publish_id(mnt);
 
 	if ((flag & CL_SLAVE) ||
 	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
@@ -4035,4 +4047,103 @@ int fsinfo_generic_mount_source(struct path *path, struct fsinfo_context *ctx)
 	return m.count;
 }
 
+/*
+ * See if one path point connects directly to another by ancestral relationship
+ * across mountpoints.  Must call with the RCU read lock held.
+ */
+static bool are_paths_connected(struct path *ancestor, struct path *to_check)
+{
+	struct mount *mnt, *parent;
+	struct path cursor;
+	unsigned seq;
+	bool connected;
+
+	seq = 0;
+restart:
+	cursor = *to_check;
+
+	read_seqbegin_or_lock(&rename_lock, &seq);
+	while (cursor.mnt != ancestor->mnt) {
+		mnt = real_mount(cursor.mnt);
+		parent = READ_ONCE(mnt->mnt_parent);
+		if (mnt == parent)
+			goto failed;
+		cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
+		cursor.mnt = &parent->mnt;
+	}
+
+	while (cursor.dentry != ancestor->dentry) {
+		if (cursor.dentry == cursor.mnt->mnt_root ||
+		    IS_ROOT(cursor.dentry))
+			goto failed;
+		cursor.dentry = READ_ONCE(cursor.dentry->d_parent);
+	}
+
+	connected = true;
+out:
+	done_seqretry(&rename_lock, seq);
+	return connected;
+
+failed:
+	if (need_seqretry(&rename_lock, seq)) {
+		seq = 1;
+		goto restart;
+	}
+	connected = false;
+	goto out;
+}
+
+/**
+ * lookup_mount_object - Look up a vfsmount object by ID
+ * @root: The mount root must connect backwards to this point (or chroot if NULL).
+ * @id: The ID of the mountpoint.
+ * @_mntpt: Where to return the resulting mountpoint path.
+ *
+ * Look up the root of the mount with the corresponding ID.  This is only
+ * permitted if that mount connects directly to the specified root/chroot.
+ */
+int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
+{
+	struct mount *mnt;
+	struct path stop, mntpt = {};
+	int ret = -EPERM;
+
+	if (!root)
+		get_fs_root(current->fs, &stop);
+	else
+		stop = *root;
+
+	rcu_read_lock();
+	lock_mount_hash();
+	mnt = idr_find(&mnt_id_ida, mnt_id);
+	if (!mnt)
+		goto out_unlock_mh;
+	if (mnt->mnt.mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
+		goto out_unlock_mh;
+	if (mnt_get_count(mnt) == 0)
+		goto out_unlock_mh;
+	mnt_add_count(mnt, 1);
+	mntpt.mnt = &mnt->mnt;
+	mntpt.dentry = dget(mnt->mnt.mnt_root);
+	unlock_mount_hash();
+
+	if (are_paths_connected(&stop, &mntpt)) {
+		*_mntpt = mntpt;
+		mntpt.mnt = NULL;
+		mntpt.dentry = NULL;
+		ret = 0;
+	}
+
+out_unlock:
+	rcu_read_unlock();
+	if (!root)
+		path_put(&stop);
+	path_put(&mntpt);
+	return ret;
+
+out_unlock_mh:
+	unlock_mount_hash();
+	goto out_unlock;
+}
+
 #endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 253b5213a775..491a59e8cc95 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -42,6 +42,7 @@ struct fsinfo_params {
 #define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
 #define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
 #define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
+#define FSINFO_FLAGS_QUERY_MOUNT 0x0002	/* - mount object (path=>mount_id, dirfd=>subtree) */
 	__u32	resolve_flags;	/* RESOLVE_* flags */
 	__u32	request;	/* ID of requested attribute */
 	__u32	Nth;		/* Instance of it (some may have multiple) */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 1002b718cdbc..c407bda4134f 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -579,7 +579,7 @@ int main(int argc, char **argv)
 	bool meta = false;
 	int raw = 0, opt, Nth, Mth;
 
-	while ((opt = getopt(argc, argv, "Madlr"))) {
+	while ((opt = getopt(argc, argv, "Madmlr"))) {
 		switch (opt) {
 		case 'M':
 			meta = true;
@@ -595,6 +595,10 @@ int main(int argc, char **argv)
 			params.resolve_flags &= ~RESOLVE_NO_TRAILING_SYMLINKS;
 			params.flags = FSINFO_FLAGS_QUERY_PATH;
 			continue;
+		case 'm':
+			params.resolve_flags = 0;
+			params.flags = FSINFO_FLAGS_QUERY_MOUNT;
+			continue;
 		case 'r':
 			raw = 1;
 			continue;
@@ -607,6 +611,7 @@ int main(int argc, char **argv)
 
 	if (argc != 1) {
 		printf("Format: test-fsinfo [-Madlr] <path>\n");
+		printf("Format: test-fsinfo [-Mdr] -m <mnt_id>\n");
 		exit(2);
 	}
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 06/14] fsinfo: Add a uniquifier ID to struct mount [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (4 preceding siblings ...)
  2020-03-09 14:01 ` [PATCH 05/14] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-09 14:01 ` [PATCH 07/14] fsinfo: Allow mount information to be queried " David Howells
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Add a uniquifier ID to struct mount that is effectively unique over the
kernel lifetime to deal around mnt_id values being reused.  This can then
be exported through fsinfo() to allow detection of replacement mounts that
happen to end up with the same mount ID.

The normal mount handle is still used for referring to a particular mount.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/mount.h     |    3 +++
 fs/namespace.c |    3 +++
 2 files changed, 6 insertions(+)

diff --git a/fs/mount.h b/fs/mount.h
index 381f842f3a27..9afbd2a7f692 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -73,6 +73,9 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+#ifdef CONFIG_FSINFO
+	u64	mnt_unique_id;		/* ID unique over lifetime of kernel */
+#endif
 #ifdef CONFIG_MOUNT_NOTIFICATIONS
 	atomic_t mnt_topology_changes;	/* Number of topology changes applied */
 	atomic_t mnt_attr_changes;	/* Number of attribute changes applied */
diff --git a/fs/namespace.c b/fs/namespace.c
index f33cec5fe885..54e8eb93fdd6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -115,6 +115,9 @@ static int mnt_alloc_id(struct mount *mnt)
 	if (res < 0)
 		return res;
 	mnt->mnt_id = res;
+#ifdef CONFIG_FSINFO
+	vfs_generate_unique_id(&mnt->mnt_unique_id);
+#endif
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 07/14] fsinfo: Allow mount information to be queried [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (5 preceding siblings ...)
  2020-03-09 14:01 ` [PATCH 06/14] fsinfo: Add a uniquifier ID to struct mount " David Howells
@ 2020-03-09 14:01 ` David Howells
  2020-03-10  9:04   ` Miklos Szeredi
  2020-03-09 14:02 ` [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-09 14:01 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Allow mount information, including information about the topology tree to
be queried with the fsinfo() system call.  Setting AT_FSINFO_QUERY_MOUNT
allows overlapping mounts to be queried by indicating that the syscall
should interpet the pathname as a number indicating the mount ID.

To this end, a number of fsinfo() attributes are provided:

 (1) FSINFO_ATTR_MOUNT_INFO.

     This is a structure providing information about a mount, including:

	- Mounted superblock ID (mount ID uniquifier).
	- Mount ID (can be used with AT_FSINFO_QUERY_MOUNT).
	- Parent mount ID.
	- Mount attributes (eg. R/O, NOEXEC).
	- Mount change/notification counter.

     Note that the parent mount ID is overridden to the ID of the queried
     mount if the parent lies outside of the chroot or dfd tree.

 (2) FSINFO_ATTR_MOUNT_PATH.

     This a string providing information about a bind mount relative the
     the root that was bound off, though it may get overridden by the
     filesystem (NFS unconditionally sets it to "/", for example).

 (3) FSINFO_ATTR_MOUNT_POINT.

     This is a string indicating the name of the mountpoint within the
     parent mount, limited to the parent's mounted root and the chroot.

 (4) FSINFO_ATTR_MOUNT_POINT_FULL.

     This is a string indicating the full path of the mountpoint, limited to
     the chroot.

 (5) FSINFO_ATTR_MOUNT_CHILDREN.

     This produces an array of structures, one for each child and capped
     with one for the argument mount (checked after listing all the
     children).  Each element contains the mount ID and the change counter
     of the respective mount object.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/d_path.c                 |    2 
 fs/fsinfo.c                 |   14 +++
 fs/internal.h               |   10 ++
 fs/namespace.c              |  177 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   36 +++++++++
 samples/vfs/test-fsinfo.c   |   43 ++++++++++
 6 files changed, 281 insertions(+), 1 deletion(-)

diff --git a/fs/d_path.c b/fs/d_path.c
index 0f1fc1743302..4c203f64e45e 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -229,7 +229,7 @@ static int prepend_unreachable(char **buffer, int *buflen)
 	return prepend(buffer, buflen, "(unreachable)", 13);
 }
 
-static void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
+void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
 {
 	unsigned seq;
 
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index bafeb73feaf4..6d2bc03998e4 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -236,6 +236,14 @@ static int fsinfo_generic_seq_read(struct path *path, struct fsinfo_context *ctx
 			ret = sb->s_op->show_options(&m, path->mnt->mnt_root);
 		break;
 
+	case FSINFO_ATTR_MOUNT_PATH:
+		if (sb->s_op->show_path) {
+			ret = sb->s_op->show_path(&m, path->mnt->mnt_root);
+		} else {
+			seq_dentry(&m, path->mnt->mnt_root, " \t\n\\");
+		}
+		break;
+
 	case FSINFO_ATTR_FS_STATISTICS:
 		if (sb->s_op->show_stats)
 			ret = sb->s_op->show_stats(&m, path->mnt->mnt_root);
@@ -270,6 +278,12 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+
+	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_PATH,	fsinfo_generic_seq_read),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT_FULL,	fsinfo_generic_mount_point_full),
+	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
 	{}
 };
 
diff --git a/fs/internal.h b/fs/internal.h
index abbd5299e7dc..1a318dc85f2f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -15,6 +15,7 @@ struct mount;
 struct shrink_control;
 struct fs_context;
 struct user_namespace;
+struct fsinfo_context;
 
 /*
  * block_dev.c
@@ -47,6 +48,11 @@ extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
  */
 extern void __init chrdev_init(void);
 
+/*
+ * d_path.c
+ */
+extern void get_fs_root_rcu(struct fs_struct *fs, struct path *root);
+
 /*
  * fs_context.c
  */
@@ -93,6 +99,10 @@ extern void __mnt_drop_write_file(struct file *);
 extern void dissolve_on_fput(struct vfsmount *);
 extern int lookup_mount_object(struct path *, int, struct path *);
 extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_point(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_point_full(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_context *);
 
 /*
  * fs_struct.c
diff --git a/fs/namespace.c b/fs/namespace.c
index 54e8eb93fdd6..a6cb8c6b004f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4149,4 +4149,181 @@ int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
 	goto out_unlock;
 }
 
+/*
+ * Retrieve information about the nominated mount.
+ */
+int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_mount_info *p = ctx->buffer;
+	struct super_block *sb;
+	struct mount *m;
+	struct path root;
+	unsigned int flags;
+
+	m = real_mount(path->mnt);
+	sb = m->mnt.mnt_sb;
+
+	p->sb_unique_id		= sb->s_unique_id;
+	p->mnt_unique_id	= m->mnt_unique_id;
+	p->mnt_id		= m->mnt_id;
+	p->parent_id		= m->mnt_parent->mnt_id;
+
+	get_fs_root(current->fs, &root);
+	if (path->mnt == root.mnt) {
+		p->parent_id = p->mnt_id;
+	} else {
+		rcu_read_lock();
+		if (!are_paths_connected(&root, path))
+			p->parent_id = p->mnt_id;
+		rcu_read_unlock();
+	}
+	if (IS_MNT_SHARED(m))
+		p->group_id = m->mnt_group_id;
+	if (IS_MNT_SLAVE(m)) {
+		int master = m->mnt_master->mnt_group_id;
+		int dom = get_dominating_id(m, &root);
+		p->master_id = master;
+		if (dom && dom != master)
+			p->from_id = dom;
+	}
+	path_put(&root);
+
+	flags = READ_ONCE(m->mnt.mnt_flags);
+	if (flags & MNT_READONLY)
+		p->attr |= MOUNT_ATTR_RDONLY;
+	if (flags & MNT_NOSUID)
+		p->attr |= MOUNT_ATTR_NOSUID;
+	if (flags & MNT_NODEV)
+		p->attr |= MOUNT_ATTR_NODEV;
+	if (flags & MNT_NOEXEC)
+		p->attr |= MOUNT_ATTR_NOEXEC;
+	if (flags & MNT_NODIRATIME)
+		p->attr |= MOUNT_ATTR_NODIRATIME;
+
+	if (flags & MNT_NOATIME)
+		p->attr |= MOUNT_ATTR_NOATIME;
+	else if (flags & MNT_RELATIME)
+		p->attr |= MOUNT_ATTR_RELATIME;
+	else
+		p->attr |= MOUNT_ATTR_STRICTATIME;
+	return sizeof(*p);
+}
+
+/*
+ * Return the path of this mount relative to its parent and clipped to
+ * the current chroot.
+ */
+int fsinfo_generic_mount_point(struct path *path, struct fsinfo_context *ctx)
+{
+	struct mountpoint *mp;
+	struct mount *m, *parent;
+	struct path mountpoint, root;
+	void *p;
+
+	rcu_read_lock();
+
+	m = real_mount(path->mnt);
+	parent = m->mnt_parent;
+	if (parent == m)
+		goto skip;
+	mp = READ_ONCE(m->mnt_mp);
+	if (mp)
+		goto found;
+skip:
+	rcu_read_unlock();
+	return -ENODATA;
+
+found:
+	mountpoint.mnt = &parent->mnt;
+	mountpoint.dentry = READ_ONCE(mp->m_dentry);
+
+	get_fs_root_rcu(current->fs, &root);
+	if (path->mnt == root.mnt) {
+		rcu_read_unlock();
+		return fsinfo_string("/", ctx);
+	}
+
+	if (root.mnt != &parent->mnt) {
+		root.mnt = &parent->mnt;
+		root.dentry = parent->mnt.mnt_root;
+	}
+
+	p = __d_path(&mountpoint, &root, ctx->buffer, ctx->buf_size);
+	rcu_read_unlock();
+
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+	if (!p)
+		return -EPERM;
+
+	ctx->skip = p - ctx->buffer;
+	return (ctx->buffer + ctx->buf_size) - p;
+}
+
+/*
+ * Return the path of this mount from the current chroot.
+ */
+int fsinfo_generic_mount_point_full(struct path *path, struct fsinfo_context *ctx)
+{
+	struct path root;
+	void *p;
+
+	rcu_read_lock();
+	get_fs_root_rcu(current->fs, &root);
+	p = __d_path(path, &root, ctx->buffer, ctx->buf_size);
+	rcu_read_unlock();
+
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+	if (!p)
+		return -EPERM;
+
+	ctx->skip = p - ctx->buffer;
+	return (ctx->buffer + ctx->buf_size) - p;
+}
+
+/*
+ * Store a mount record into the fsinfo buffer.
+ */
+static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p)
+{
+	struct fsinfo_mount_child record = {};
+	unsigned int usage = ctx->usage;
+
+	if (ctx->usage >= INT_MAX)
+		return;
+	ctx->usage = usage + sizeof(record);
+
+	if (ctx->buffer && ctx->usage <= ctx->buf_size) {
+		record.mnt_unique_id	= p->mnt_unique_id;
+		record.mnt_id		= p->mnt_id;
+		memcpy(ctx->buffer + usage, &record, sizeof(record));
+	}
+}
+
+/*
+ * Return information about the submounts relative to path.
+ */
+int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
+{
+	struct mount *m, *child;
+
+	m = real_mount(path->mnt);
+
+	read_seqlock_excl(&mount_lock);
+
+	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
+		if (child->mnt_parent != m)
+			continue;
+		fsinfo_store_mount(ctx, child);
+	}
+
+	/* End the list with a copy of the parameter mount's details so that
+	 * userspace can quickly check for changes.
+	 */
+	fsinfo_store_mount(ctx, m);
+	read_sequnlock_excl(&mount_lock);
+	return ctx->usage;
+}
+
 #endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 491a59e8cc95..7a8b577f54b7 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -31,6 +31,12 @@
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
 
+#define FSINFO_ATTR_MOUNT_INFO		0x200	/* Mount object information */
+#define FSINFO_ATTR_MOUNT_PATH		0x201	/* Bind mount/superblock path (string) */
+#define FSINFO_ATTR_MOUNT_POINT		0x202	/* Relative path of mount in parent (string) */
+#define FSINFO_ATTR_MOUNT_POINT_FULL	0x203	/* Absolute path of mount (string) */
+#define FSINFO_ATTR_MOUNT_CHILDREN	0x204	/* Children of this mount (list) */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -71,6 +77,7 @@ struct fsinfo_attribute_info {
 	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
 };
 
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
 
@@ -84,6 +91,35 @@ struct fsinfo_u128 {
 #endif
 };
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_MOUNT_INFO).
+ */
+struct fsinfo_mount_info {
+	__u64	sb_unique_id;		/* Kernel-lifetime unique superblock ID */
+	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
+	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
+	__u32	parent_id;		/* Parent mount identifier */
+	__u32	group_id;		/* Mount group ID */
+	__u32	master_id;		/* Slave master group ID */
+	__u32	from_id;		/* Slave propagated from ID */
+	__u32	attr;			/* MOUNT_ATTR_* flags */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
+
+/*
+ * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
+ * - An extra element is placed on the end representing the parent mount.
+ */
+struct fsinfo_mount_child {
+	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
+	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
+
 /*
  * Information struct for fsinfo(FSINFO_ATTR_STATFS).
  * - This gives extended filesystem information.
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index c407bda4134f..2f9fe3b24bca 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -288,6 +288,43 @@ static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
 	       f->uuid[14], f->uuid[15]);
 }
 
+static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
+{
+	struct fsinfo_mount_info *r = reply;
+
+	printf("\n");
+	printf("\tsb_uniq : %llx\n", (unsigned long long)r->sb_unique_id);
+	printf("\tmnt_uniq: %llx\n", (unsigned long long)r->mnt_unique_id);
+	printf("\tmnt_id  : %x\n", r->mnt_id);
+	printf("\tparent  : %x\n", r->parent_id);
+	printf("\tgroup   : %x\n", r->group_id);
+	printf("\tattr    : %x\n", r->attr);
+}
+
+static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
+{
+	struct fsinfo_mount_child *r = reply;
+	ssize_t mplen;
+	char path[32], *mp;
+
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
+		.request	= FSINFO_ATTR_MOUNT_POINT,
+	};
+
+	if (!list_last) {
+		sprintf(path, "%u", r->mnt_id);
+		mplen = get_fsinfo(path, "FSINFO_ATTR_MOUNT_POINT", &params, (void **)&mp);
+		if (mplen < 0)
+			mp = "-";
+	} else {
+		mp = "<this>";
+	}
+
+	printf("%8x %16llx %s\n",
+	       r->mnt_id, (unsigned long long)r->mnt_unique_id, mp);
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -364,6 +401,12 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
+
+	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_PATH,	string),
+	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	string),
+	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT_FULL,	string),
+	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_child),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (6 preceding siblings ...)
  2020-03-09 14:01 ` [PATCH 07/14] fsinfo: Allow mount information to be queried " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-10  8:42   ` Christian Brauner
  2020-03-09 14:02 ` [PATCH 09/14] fsinfo: Provide notification overrun handling support " David Howells
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Allow the mount topology propogation flags to be retrieved as part of the
FSINFO_ATTR_MOUNT_INFO attributes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c              |    7 ++++++-
 include/uapi/linux/fsinfo.h |    2 +-
 include/uapi/linux/mount.h  |   10 +++++++++-
 samples/vfs/test-fsinfo.c   |    1 +
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a6cb8c6b004f..88aef45bcfa8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4177,15 +4177,20 @@ int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
 			p->parent_id = p->mnt_id;
 		rcu_read_unlock();
 	}
-	if (IS_MNT_SHARED(m))
+	if (IS_MNT_SHARED(m)) {
 		p->group_id = m->mnt_group_id;
+		p->propagation |= MOUNT_PROPAGATION_SHARED;
+	}
 	if (IS_MNT_SLAVE(m)) {
 		int master = m->mnt_master->mnt_group_id;
 		int dom = get_dominating_id(m, &root);
 		p->master_id = master;
 		if (dom && dom != master)
 			p->from_id = dom;
+		p->propagation |= MOUNT_PROPAGATION_SLAVE;
 	}
+	if (IS_MNT_UNBINDABLE(m))
+		p->propagation |= MOUNT_PROPAGATION_UNBINDABLE;
 	path_put(&root);
 
 	flags = READ_ONCE(m->mnt.mnt_flags);
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 7a8b577f54b7..909d6104933b 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -103,7 +103,7 @@ struct fsinfo_mount_info {
 	__u32	master_id;		/* Slave master group ID */
 	__u32	from_id;		/* Slave propagated from ID */
 	__u32	attr;			/* MOUNT_ATTR_* flags */
-	__u32	__padding[1];
+	__u32	propagation;		/* MOUNT_PROPAGATION_* flags */
 };
 
 #define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..39e50fe9d8d9 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -105,7 +105,7 @@ enum fsconfig_command {
 #define FSMOUNT_CLOEXEC		0x00000001
 
 /*
- * Mount attributes.
+ * Mount object attributes (these are separate to filesystem attributes).
  */
 #define MOUNT_ATTR_RDONLY	0x00000001 /* Mount read-only */
 #define MOUNT_ATTR_NOSUID	0x00000002 /* Ignore suid and sgid bits */
@@ -117,4 +117,12 @@ enum fsconfig_command {
 #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
 #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
 
+/*
+ * Mount object propogation attributes.
+ */
+#define MOUNT_PROPAGATION_UNBINDABLE	0x00000001 /* Mount is unbindable */
+#define MOUNT_PROPAGATION_SLAVE		0x00000002 /* Mount is slave */
+#define MOUNT_PROPAGATION_PRIVATE	0x00000000 /* Mount is private (ie. not shared) */
+#define MOUNT_PROPAGATION_SHARED	0x00000004 /* Mount is shared */
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 2f9fe3b24bca..bdc7ea952630 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -299,6 +299,7 @@ static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
 	printf("\tparent  : %x\n", r->parent_id);
 	printf("\tgroup   : %x\n", r->group_id);
 	printf("\tattr    : %x\n", r->attr);
+	printf("\tpropag  : %x\n", r->propagation);
 }
 
 static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 09/14] fsinfo: Provide notification overrun handling support [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (7 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 14:02 ` [PATCH 10/14] fsinfo: sample: Mount listing program " David Howells
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Provide support for the handling of an overrun in a watch queue.  In the
event that an overrun occurs, the watcher needs to be able to find out what
it was that they missed.  To this end, previous patches added event
counters to the superblock and mount object structures.

To make them accessible, they can be accessed using fsinfo() and the
FSINFO_ATTR_MOUNT_INFO attribute.

	struct fsinfo_mount_info {
		__u64	mnt_unique_id;
		__u32	sb_changes;
		__u32	sb_notifications;
		__u32	mnt_attr_changes;
		__u32	mnt_topology_changes;
		__u32	mnt_subtree_notifications;
	...
	};

There's a uniquifier and five event counters:

 (1) mnt_unique_id - This is an effectively non-repeating ID given to each
     mount object on creation.  This allows the caller to check that the
     mount ID didn't get reused (the 32-bit mount ID is more efficient to
     look up).

 (2) sb_changes - Count of superblock configuration changes.

 (3) sb_notifications - Count of other superblock notifications (errors,
     quota overruns, etc.).

 (4) mnt_attr_changes - Count of attribute changes on a mount object.

 (5) mnt_topology_changes - Count of alterations to the mount tree that
     affected this node.

 (6) mnt_subtree_notifications - Count of mount object event notifications
     that were generated in the subtree rooted at this node.  This excludes
     events generated on this node itself and does not include superblock
     events.

The counters are also accessible through the FSINFO_ATTR_MOUNT_CHILDREN
attribute, where a list of all the children of a mount can be scanned.  The
record returned for each child includes the sum of the above five counters
for that child.  An additional record is added at the end for the queried
object and that also includes the sum of its five counters

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c              |   31 ++++++++++++++++++++++++++-----
 include/uapi/linux/fsinfo.h |    9 ++++++++-
 samples/vfs/test-fsinfo.c   |    7 +++++--
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 88aef45bcfa8..2b651003e6af 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4167,6 +4167,15 @@ int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
 	p->mnt_unique_id	= m->mnt_unique_id;
 	p->mnt_id		= m->mnt_id;
 	p->parent_id		= m->mnt_parent->mnt_id;
+#ifdef CONFIG_SB_NOTIFICATIONS
+	p->sb_changes		= atomic_read(&sb->s_change_counter);
+	p->sb_notifications	= atomic_read(&sb->s_notify_counter);
+#endif
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	p->mnt_attr_changes	= atomic_read(&m->mnt_attr_changes);
+	p->mnt_topology_changes	= atomic_read(&m->mnt_topology_changes);
+	p->mnt_subtree_notifications = atomic_read(&m->mnt_subtree_notifications);
+#endif
 
 	get_fs_root(current->fs, &root);
 	if (path->mnt == root.mnt) {
@@ -4293,17 +4302,29 @@ int fsinfo_generic_mount_point_full(struct path *path, struct fsinfo_context *ct
 static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p)
 {
 	struct fsinfo_mount_child record = {};
+	const struct super_block *sb = p->mnt.mnt_sb;
 	unsigned int usage = ctx->usage;
 
 	if (ctx->usage >= INT_MAX)
 		return;
 	ctx->usage = usage + sizeof(record);
+	if (!ctx->buffer || ctx->usage > ctx->buf_size)
+		return;
 
-	if (ctx->buffer && ctx->usage <= ctx->buf_size) {
-		record.mnt_unique_id	= p->mnt_unique_id;
-		record.mnt_id		= p->mnt_id;
-		memcpy(ctx->buffer + usage, &record, sizeof(record));
-	}
+	record.mnt_unique_id	= p->mnt_unique_id;
+	record.mnt_id		= p->mnt_id;
+	record.notify_sum	= 0;
+#ifdef CONFIG_SB_NOTIFICATIONS
+	record.notify_sum	+= (atomic_read(&sb->s_change_counter) +
+				    atomic_read(&sb->s_notify_counter));
+#endif
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	record.notify_sum	+= (atomic_read(&p->mnt_attr_changes) +
+				    atomic_read(&p->mnt_topology_changes) +
+				    atomic_read(&p->mnt_subtree_notifications));
+#endif
+
+	memcpy(ctx->buffer + usage, &record, sizeof(record));
 }
 
 /*
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 909d6104933b..826b788b0795 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -104,6 +104,11 @@ struct fsinfo_mount_info {
 	__u32	from_id;		/* Slave propagated from ID */
 	__u32	attr;			/* MOUNT_ATTR_* flags */
 	__u32	propagation;		/* MOUNT_PROPAGATION_* flags */
+	__u32	sb_changes;		/* Number of sb configuration changes */
+	__u32	sb_notifications;	/* Number of other sb notifications */
+	__u32	mnt_attr_changes;	/* Number of attribute changes to this mount. */
+	__u32	mnt_topology_changes;	/* Number of topology changes to this mount. */
+	__u32	mnt_subtree_notifications; /* Number of notifications in mount subtree */
 };
 
 #define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
@@ -115,7 +120,9 @@ struct fsinfo_mount_info {
 struct fsinfo_mount_child {
 	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
 	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
-	__u32	__padding[1];
+	__u32	notify_sum;		/* Sum of sb_changes, sb_notifications, mnt_attr_changes,
+					 * mnt_topology_changes and mnt_subtree_notifications.
+					 */
 };
 
 #define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index bdc7ea952630..91434f459ba5 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -300,6 +300,9 @@ static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
 	printf("\tgroup   : %x\n", r->group_id);
 	printf("\tattr    : %x\n", r->attr);
 	printf("\tpropag  : %x\n", r->propagation);
+	printf("\tsb_nfy  : changes=%u other=%u\n", r->sb_changes, r->sb_notifications);
+	printf("\tmnt_nfy : attr=%u topology=%u subtree=%u\n",
+	       r->mnt_attr_changes, r->mnt_topology_changes, r->mnt_subtree_notifications);
 }
 
 static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
@@ -322,8 +325,8 @@ static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
 		mp = "<this>";
 	}
 
-	printf("%8x %16llx %s\n",
-	       r->mnt_id, (unsigned long long)r->mnt_unique_id, mp);
+	printf("%8x %16llx %10u %s\n",
+	       r->mnt_id, (unsigned long long)r->mnt_unique_id, r->notify_sum, mp);
 }
 
 static void dump_string(void *reply, unsigned int size)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 10/14] fsinfo: sample: Mount listing program [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (8 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 09/14] fsinfo: Provide notification overrun handling support " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 14:02 ` [PATCH 11/14] fsinfo: Add API documentation " David Howells
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Implement a program to demonstrate mount listing using the new fsinfo()
syscall.  For example, to dump the tree from mount 21:

# ./test-mntinfo -m 21
MOUNT                            MOUNT ID   CHANGE#  AT P DEV   TYPE
-------------------------------- ---------- -------- -- - ----- --------
21                                       21        0  e 4  0:14 sysfs
 \_ kernel/security                      24        0  e 4   0:8 securityfs
 \_ fs/cgroup                            28        4 2f 4  0:18 tmpfs
 |   \_ unified                          29        0  e 4  0:19 cgroup2
 |   \_ systemd                          30        0  e 4  0:1a cgroup
 |   \_ blkio                            34        0  e 4  0:1e cgroup
 |   \_ net_cls,net_prio                 35        0  e 4  0:1f cgroup
 |   \_ perf_event                       36        0  e 4  0:20 cgroup
 |   \_ freezer                          37        0  e 4  0:21 cgroup
 |   \_ devices                          38        0  e 4  0:22 cgroup
 |   \_ cpu,cpuacct                      39        0  e 4  0:23 cgroup
 |   \_ rdma                             40        0  e 4  0:24 cgroup
 |   \_ memory                           41        0  e 4  0:25 cgroup
 |   \_ cpuset                           42        0  e 4  0:26 cgroup
 |   \_ hugetlb                          43        0  e 4  0:27 cgroup
 \_ fs/pstore                            31        0  e 4  0:1b pstore
 \_ firmware/efi/efivars                 32        0  e 4  0:1c efivarfs
 \_ fs/bpf                               33        0  e 4  0:1d bpf
 \_ kernel/config                        92        0  0 4  0:28 configfs
 \_ fs/selinux                           44        0  0 4  0:11 selinuxfs
 \_ kernel/debug                         45        1  0 4   0:7 debugfs

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/vfs/Makefile       |    2 
 samples/vfs/test-mntinfo.c |  277 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 279 insertions(+)
 create mode 100644 samples/vfs/test-mntinfo.c

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 9159ad1d7fc5..19be60ab950e 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,12 +4,14 @@
 hostprogs := \
 	test-fsinfo \
 	test-fsmount \
+	test-mntinfo \
 	test-statx
 
 always-y := $(hostprogs)
 
 HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-fsinfo += -static -lm
+HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
 
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-mntinfo.c b/samples/vfs/test-mntinfo.c
new file mode 100644
index 000000000000..5d2eb483e3e5
--- /dev/null
+++ b/samples/vfs/test-mntinfo.c
@@ -0,0 +1,277 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <linux/fcntl.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+	       struct fsinfo_params *params, size_t params_size,
+	       void *result_buffer, size_t result_buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename,
+		       params, params_size,
+		       result_buffer, result_buf_size);
+}
+
+static char tree_buf[4096];
+static char bar_buf[4096];
+static unsigned int children_list_interval;
+
+/*
+ * Get an fsinfo attribute in a statically allocated buffer.
+ */
+static void get_attr(unsigned int mnt_id, unsigned int attr, unsigned int Nth,
+		     void *buf, size_t buf_size)
+{
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
+		.request	= attr,
+		.Nth		= Nth,
+	};
+	char file[32];
+	long ret;
+
+	sprintf(file, "%u", mnt_id);
+
+	memset(buf, 0xbd, buf_size);
+
+	ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), buf, buf_size);
+	if (ret == -1) {
+		fprintf(stderr, "mount-%s: %m\n", file);
+		exit(1);
+	}
+}
+
+/*
+ * Get an fsinfo attribute in a dynamically allocated buffer.
+ */
+static void *get_attr_alloc(unsigned int mnt_id, unsigned int attr,
+			    unsigned int Nth, size_t *_size)
+{
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
+		.request	= attr,
+		.Nth		= Nth,
+	};
+	size_t buf_size = 4096;
+	char file[32];
+	void *r;
+	long ret;
+
+	sprintf(file, "%u", mnt_id);
+
+	for (;;) {
+		r = malloc(buf_size);
+		if (!r) {
+			perror("malloc");
+			exit(1);
+		}
+		memset(r, 0xbd, buf_size);
+
+		ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), r, buf_size);
+		if (ret == -1) {
+			fprintf(stderr, "mount-%s: %x,%x,%x %m\n",
+				file, params.request, params.Nth, params.Mth);
+			exit(1);
+		}
+
+		if (ret <= buf_size) {
+			*_size = ret;
+			break;
+		}
+		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+	}
+
+	return r;
+}
+
+/*
+ * Display a mount and then recurse through its children.
+ */
+static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
+{
+	struct fsinfo_mount_child child;
+	struct fsinfo_mount_info info;
+	struct fsinfo_ids ids;
+	void *children;
+	unsigned int d;
+	size_t ch_size, p_size;
+	char dev[64];
+	int i, n, s;
+
+	get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, 0, &info, sizeof(info));
+	get_attr(mnt_id, FSINFO_ATTR_IDS, 0, &ids, sizeof(ids));
+	if (depth > 0)
+		printf("%s", tree_buf);
+
+	s = strlen(path);
+	printf("%s", !s ? "\"\"" : path);
+	if (!s)
+		s += 2;
+	s += depth;
+	if (s < 38)
+		s = 38 - s;
+	else
+		s = 1;
+	printf("%*.*s", s, s, "");
+
+	sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor);
+	printf("%10u %8x %2x %x %5s %s",
+	       info.mnt_id,
+	       (info.sb_changes +
+		info.sb_notifications +
+		info.mnt_attr_changes +
+		info.mnt_topology_changes +
+		info.mnt_subtree_notifications),
+	       info.attr, info.propagation,
+	       dev, ids.f_fs_name);
+	putchar('\n');
+
+	children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size);
+	n = ch_size / children_list_interval - 1;
+
+	bar_buf[depth + 1] = '|';
+	if (depth > 0) {
+		tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1];
+		tree_buf[depth - 4 + 2] = ' ';
+	}
+
+	tree_buf[depth + 0] = ' ';
+	tree_buf[depth + 1] = '\\';
+	tree_buf[depth + 2] = '_';
+	tree_buf[depth + 3] = ' ';
+	tree_buf[depth + 4] = 0;
+	d = depth + 4;
+
+	memset(&child, 0, sizeof(child));
+	for (i = 0; i < n; i++) {
+		void *p = children + i * children_list_interval;
+
+		if (sizeof(child) >= children_list_interval)
+			memcpy(&child, p, children_list_interval);
+		else
+			memcpy(&child, p, sizeof(child));
+
+		if (i == n - 1)
+			bar_buf[depth + 1] = ' ';
+		path = get_attr_alloc(child.mnt_id, FSINFO_ATTR_MOUNT_POINT,
+				      0, &p_size);
+		display_mount(child.mnt_id, d, path + 1);
+		free(path);
+	}
+
+	free(children);
+	if (depth > 0) {
+		tree_buf[depth - 4 + 1] = '\\';
+		tree_buf[depth - 4 + 2] = '_';
+	}
+	tree_buf[depth] = 0;
+}
+
+/*
+ * Find the ID of whatever is at the nominated path.
+ */
+static unsigned int lookup_mnt_by_path(const char *path)
+{
+	struct fsinfo_mount_info mnt;
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+		.request	= FSINFO_ATTR_MOUNT_INFO,
+	};
+
+	if (fsinfo(AT_FDCWD, path, &params, sizeof(params), &mnt, sizeof(mnt)) == -1) {
+		perror(path);
+		exit(1);
+	}
+
+	return mnt.mnt_id;
+}
+
+/*
+ * Determine the element size for the mount child list.
+ */
+static unsigned int query_list_element_size(int mnt_id, unsigned int attr)
+{
+	struct fsinfo_attribute_info attr_info;
+
+	get_attr(mnt_id, FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, attr,
+		 &attr_info, sizeof(attr_info));
+	return attr_info.size;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	unsigned int mnt_id;
+	char *path;
+	bool use_mnt_id = false;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "m"))) {
+		switch (opt) {
+		case 'm':
+			use_mnt_id = true;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	switch (argc) {
+	case 0:
+		mnt_id = lookup_mnt_by_path("/");
+		path = "ROOT";
+		break;
+	case 1:
+		path = argv[0];
+		if (use_mnt_id) {
+			mnt_id = strtoul(argv[0], NULL, 0);
+			break;
+		}
+
+		mnt_id = lookup_mnt_by_path(argv[0]);
+		break;
+	default:
+		printf("Format: test-mntinfo\n");
+		printf("Format: test-mntinfo <path>\n");
+		printf("Format: test-mntinfo -m <mnt_id>\n");
+		exit(2);
+	}
+
+	children_list_interval =
+		query_list_element_size(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN);
+
+	printf("MOUNT                                 MOUNT ID   CHANGE#  AT P DEV   TYPE\n");
+	printf("------------------------------------- ---------- -------- -- - ----- --------\n");
+	display_mount(mnt_id, 0, path);
+	return 0;
+}



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 11/14] fsinfo: Add API documentation [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (9 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 10/14] fsinfo: sample: Mount listing program " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 14:02 ` [PATCH 12/14] fsinfo: Add support for AFS " David Howells
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Add API documentation for fsinfo.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/fsinfo.rst |  564 ++++++++++++++++++++++++++++++++++
 1 file changed, 564 insertions(+)
 create mode 100644 Documentation/filesystems/fsinfo.rst

diff --git a/Documentation/filesystems/fsinfo.rst b/Documentation/filesystems/fsinfo.rst
new file mode 100644
index 000000000000..1f02e7d53bed
--- /dev/null
+++ b/Documentation/filesystems/fsinfo.rst
@@ -0,0 +1,564 @@
+============================
+Filesystem Information Query
+============================
+
+The fsinfo() system call allows the querying of filesystem and filesystem
+security information beyond what stat(), statx() and statfs() can obtain.  It
+does not require a file to be opened as does ioctl().
+
+fsinfo() may be called with a path, with open file descriptor or a with a mount
+object identifier.
+
+The fsinfo() system call needs to be configured on by enabling:
+
+	"File systems"/"Enable the fsinfo() system call" (CONFIG_FSINFO)
+
+This document has the following sections:
+
+.. contents:: :local:
+
+
+Overview
+========
+
+The fsinfo() system call retrieves one of a number of attributes, the IDs of
+which can be found in include/uapi/linux/fsinfo.h::
+
+	FSINFO_ATTR_STATFS	- statfs()-style state
+	FSINFO_ATTR_IDS		- Filesystem IDs
+	FSINFO_ATTR_LIMITS	- Filesystem limits
+	...
+	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about an attribute
+	FSINFO_ATTR_FSINFO_ATTRIBUTES - List of available attributes
+	...
+	FSINFO_ATTR_MOUNT_INFO	- Information about the mount topology
+	...
+
+Each attribute can have zero or more values, which can be of one of the
+following types:
+
+ * ``FSINFO_TYPE_VSTRUCT``.  This is a structure with a version-dependent
+   length.  New versions of the kernel may append more fields, though they are
+   not permitted to remove or replace old ones.
+
+   Older applications, expecting an older version of the field, can ask for a
+   shorter struct and will only get the fields they requested; newer
+   applications running on an older kernel will get the extra fields they
+   requested filled with zeros.  Either way, the system call returns the size
+   of the internal struct, regardless of how much data it returned.
+
+   This allows for struct-type fields to be extended in future.
+
+ * ``FSINFO_TYPE_STRING``.  This is a variable-length string of up to INT_MAX
+   characters (no NUL character is included).  The returned string will be
+   truncated if the output buffer is too small.  The total size of the string
+   is returned, regardless of any truncation.
+
+ * ``FSINFO_TYPE_OPAQUE``.  This is a variable-length blob of indeterminate
+   structure.  It may be up to INT_MAX bytes in size.
+
+ * ``FSINFO_TYPE_LIST``.  This is a variable-length list of fixed-size
+   structures.  The element size may not vary over time, so the element format
+   must be designed with care.  The maximum length is INT_MAX bytes, though
+   this depends on the kernel being able to allocate an internal buffer large
+   enough.
+
+Value type is an inherent propery of an attribute and all the values of an
+attribute must be of that type.  Each attribute can have a single value, a
+sequence of values or a sequence-of-sequences of values.
+
+
+Filesystem API
+==============
+
+If the filesystem wishes to override the generic queryable attributes or
+provide queryable attributes of its own, it should define a handler function
+and point the appropriate superblock op to it::
+
+	int (*fsinfo)(struct path *path, struct fsinfo_context *ctx);
+
+The core calls this function to see if it wants to handle the attribute.  For
+each table of attibutes it has (and it can have more than one), it should
+call::
+
+	int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
+				 const struct fsinfo_attribute *attrs);
+
+to scan the table to see if the requested one is in there.  This function also
+handles determining the size of struct attributes, enumerating attributes for
+the FSINFO_ATTR_FSINFO_ATTRIBUTES and querying information about an attribute
+for FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO.
+
+If it doesn't want to handle the function, -EOPNOTSUPP should be returned.  The
+core will then examine the generic attribute table.
+
+
+Attribute Table
+---------------
+
+An attribute table is a sequence of ``struct fsinfo_attribute`` terminated with
+a blank entry.  Entries can be created with a set of helper macros::
+
+	FSINFO_VSTRUCT(A,G)
+	FSINFO_VSTRUCT_N(A,G)
+	FSINFO_VSTRUCT_NM(A,G)
+	FSINFO_STRING(A,G)
+	FSINFO_STRING_N(A,G)
+	FSINFO_STRING_NM(A,G)
+	FSINFO_OPAQUE(A,G)
+	FSINFO_LIST(A,G)
+	FSINFO_LIST_N(A,G)
+
+The names of the macro are a combination of type (vstruct, string, opaque and
+list) and an optional qualifier, if the attribute has N values or N lots of M
+values.  ``A`` is the name of the attribute and ``G`` is a function to get a
+value for that attribute.
+
+For vstruct- and list-type attributes, it is expected that there is a macro
+defined with the name ``A##__STRUCT`` that indicates the structure type.
+
+The get function needs to match the following type::
+
+	int (*get)(struct path *path, struct fsinfo_context *ctx);
+
+where "path" indicates the object to be queried and ctx is a context describing
+the parameters and the output buffer.  The function should return the total
+size of the data it would like to produce or an error.
+
+
+Context Structure
+-----------------
+
+The context struct looks like::
+
+	struct fsinfo_context {
+		__u32		requested_attr;
+		__u32		Nth;
+		__u32		Mth;
+		bool		want_size_only;
+		unsigned int	skip;
+		unsigned int	usage;
+		unsigned int	buf_size;
+		void		*buffer;
+		...
+	};
+
+The fields relevant to the filesystem are as follows:
+
+ * ``requested_attr``
+
+   Which attribute is being requested.  EOPNOTSUPP should be returned if the
+   attribute is not supported by the filesystem or the LSM.
+
+ * ``Nth`` and ``Mth``
+
+   Which value of an attribute is being requested.
+
+   For a single-value attribute Nth and Mth will both be 0.
+
+   For a "1D" attribute, Nth will indicate which value and Mth will always
+   be 0.  Take, for example, FSINFO_ATTR_SERVER_NAME - for a network
+   filesystem, the superblock will be backed by a number of servers.  This will
+   return the name of the Nth server.  ENODATA will be returned if Nth goes
+   beyond the end of the array.
+
+   For a "2D" attribute, Mth will indicate the index in the Nth set of values.
+   Take, for example, an attribute for a network filesystems that returns
+   server addresses - each server may have one or more addresses.  This could
+   return the Mth address of the Nth server.  ENODATA should be returned if the
+   Nth set doesn't exist or the Mth element of the Nth set doesn't exist.
+
+ * ``want_size_only``
+
+   Is set to true if the caller only wants the size of the value so that the
+   get function doesn't have to make expensive calculations or calls to
+   retrieve the value.
+
+ * ``skip``
+
+   This indicates how far into the buffer the data to be returned starts.  This
+   can be used to trim the front off the buffer or to handle backward-filling.
+
+ * ``usage``
+
+   This indicates how much of the buffer has been used so far for an list or
+   opaque type attribute.  This is updated by the fsinfo_note_param*()
+   functions.
+
+ * ``buf_size``
+
+   This indicates the current size of the buffer.  For the list type and the
+   opaque type this will be increased if the current buffer won't hold the
+   value and the filesystem will be called again.
+
+ * ``buffer``
+
+   This points to the output buffer.  It will be buf_size in size and will be
+   resized if the returned size is larger than this.
+
+To simplify filesystem code, there will always be at least a minimal buffer
+available if a ->get() method gets called.
+
+
+Helper Functions
+================
+
+The API includes a number of helper functions:
+
+ * ``int fsinfo_string(const char *s, struct fsinfo_context *ctx);``
+
+   This places the specified string into the buffer set in the context.  If the
+   string is NULL, the buffer will be left empty.
+
+ * ``int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);``
+ * ``int fsinfo_generic_supports(struct path *, struct fsinfo_context *);``
+ * ``int fsinfo_generic_limits(struct path *, struct fsinfo_context *);``
+
+   These set the generic information for timestamp resolution and range
+   information, supported features and number limits and are called for the
+   corresponding attributes if the filesystem doesn't override them.
+
+   If the filesystem does override them, it can call the above functions and
+   then amend the results.
+
+ * ``void fsinfo_set_feature(struct fsinfo_features *ft,
+			     enum fsinfo_feature feature);``
+
+   This function sets a feature flag.
+
+ * ``void fsinfo_clear_feature(struct fsinfo_features *ft,
+			       enum fsinfo_feature feature);``
+
+   This function clears a feature flag.
+
+ * ``void fsinfo_set_unix_features(struct fsinfo_features *ft);``
+
+   Set feature flags appropriate to the features of a standard UNIX filesystem,
+   such as having numeric UIDS and GIDS; allowing the creation of directories,
+   symbolic links, hard links, device files, FIFO and socket files; permitting
+   sparse files; and having access, change and modification times.
+
+
+Attribute Summary
+=================
+
+To summarise the attributes that are defined::
+
+  Symbolic name				Type
+  =====================================	===============
+  FSINFO_ATTR_STATFS			vstruct
+  FSINFO_ATTR_IDS			vstruct
+  FSINFO_ATTR_LIMITS			vstruct
+  FSINFO_ATTR_SUPPORTS			vstruct
+  FSINFO_ATTR_TIMESTAMP_INFO		vstruct
+  FSINFO_ATTR_VOLUME_ID			string
+  FSINFO_ATTR_VOLUME_UUID		vstruct
+  FSINFO_ATTR_VOLUME_NAME		string
+  FSINFO_ATTR_FEATURES			vstruct
+  FSINFO_ATTR_SOURCE			string
+  FSINFO_ATTR_CONFIGURATION		string
+  FSINFO_ATTR_FS_STATISTICS		string
+  FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO	N × vstruct
+  FSINFO_ATTR_FSINFO_ATTRIBUTES		list
+  FSINFO_ATTR_MOUNT_INFO		vstruct
+  FSINFO_ATTR_MOUNT_PATH		string
+  FSINFO_ATTR_MOUNT_POINT		string
+  FSINFO_ATTR_MOUNT_CHILDREN		list
+  FSINFO_ATTR_AFS_CELL_NAME		string
+  FSINFO_ATTR_AFS_SERVER_NAME		N × string
+  FSINFO_ATTR_AFS_SERVER_ADDRESSES	N × list
+
+
+Attribute Catalogue
+===================
+
+A number of the attributes convey information about a filesystem superblock:
+
+ *  ``FSINFO_ATTR_STATFS``
+
+    This struct-type attribute gives most of the equivalent data to statfs(),
+    but with all the fields as unconditional 64-bit or 128-bit integers.  Note
+    that static data like IDs that don't change are retrieved with
+    FSINFO_ATTR_IDS instead.
+
+    Further, superblock flags (such as MS_RDONLY) are not exposed by this
+    attribute; rather the parameters must be listed and the attributes picked
+    out from that.
+
+ *  ``FSINFO_ATTR_IDS``
+
+    This struct-type attribute conveys various identifiers used by the target
+    filesystem.  This includes the filesystem name, the NFS filesystem ID, the
+    superblock ID used in notifications, the filesystem magic type number and
+    the primary device ID.
+
+ *  ``FSINFO_ATTR_LIMITS``
+
+    This struct-type attribute conveys the limits on various aspects of a
+    filesystem, such as maximum file, symlink and xattr sizes, maxiumm filename
+    and xattr name length, maximum number of symlinks, maximum device major and
+    minor numbers and maximum UID, GID and project ID numbers.
+
+ *  ``FSINFO_ATTR_SUPPORTS``
+
+    This struct-type attribute conveys information about the support the
+    filesystem has for various UAPI features of a filesystem.  This includes
+    information about which bits are supported in various masks employed by the
+    statx system call, what FS_IOC_* flags are supported by ioctls and what
+    DOS/Windows file attribute flags are supported.
+
+ *  ``FSINFO_ATTR_TIMESTAMP_INFO``
+
+    This struct-type attribute conveys information about the resolution and
+    range of the timestamps available in a filesystem.  The resolutions are
+    given as a mantissa and exponent (resolution = mantissa * 10^exponent
+    seconds), where the exponent can be negative to indicate a sub-second
+    resolution (-9 being nanoseconds, for example).
+
+ *  ``FSINFO_ATTR_VOLUME_ID``
+
+    This is a string-type attribute that conveys the superblock identifier for
+    the volume.  By default it will be filled in from the contents of s_id from
+    the superblock.  For a block-based filesystem, for example, this might be
+    the name of the primary block device.
+
+ *  ``FSINFO_ATTR_VOLUME_UUID``
+
+    This is a struct-type attribute that conveys the UUID identifier for the
+    volume.  By default it will be filled in from the contents of s_uuid from
+    the superblock.  If this doesn't exist, it will be an entirely zeros.
+
+ *  ``FSINFO_ATTR_VOLUME_NAME``
+
+    This is a string-type attribute that conveys the name of the volume.  By
+    default it will return EOPNOTSUPP.  For a disk-based filesystem, it might
+    convey the partition label; for a network-based filesystem, it might convey
+    the name of the remote volume.
+
+ *  ``FSINFO_ATTR_FEATURES``
+
+    This is a special attribute, being a set of single-bit feature flags,
+    formatted as struct-type attribute.  The meanings of the feature bits are
+    listed below - see the "Feature Bit Catalogue" section.  The feature bits
+    are grouped numerically into bytes, such that features 0-7 are in byte 0,
+    8-15 are in byte 1, 16-23 in byte 2 and so on.
+
+    Any feature bit that's not supported by the kernel will be set to false if
+    asked for.  The highest supported feature is set at the beginning of the
+    structure.
+
+ *  ``FSINFO_ATTR_SOURCE``
+ *  ``FSINFO_ATTR_CONFIGURATION``
+ *  ``FSINFO_ATTR_FS_STATISTICS``
+
+    These attributes return the mountpoint device name (as processed by the
+    filesystem), the superblock configuration (mount) options and the
+    superblock statistics in string form, as presented through a variety
+    of /proc files.
+
+
+Some attributes give information about fsinfo itself:
+
+ *  ``FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO``
+
+    This struct-type attribute gives metadata about the attribute with the ID
+    specified by the Nth parameter, including its type, default size and
+    element size.
+
+ *  ``FSINFO_ATTR_FSINFO_ATTRIBUTES``
+
+    This list-type attribute gives a list of the attribute IDs available at the
+    point of reference.  FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO can then be used to
+    query each attribute.
+
+
+Some attributes give information about mount objects:
+
+ *  ``FSINFO_ATTR_MOUNT_INFO``
+
+    This gives information about a particular mount point, including its IDs,
+    its topological relationships, its attributes and its event counters.
+
+ *  ``FSINFO_ATTR_MOUNT_PATH``
+
+    This gives information about the path set by binding a mount, though it may
+    be overridden by the filesystem.
+
+ *  ``FSINFO_ATTR_MOUNT_POINT``
+ *  ``FSINFO_ATTR_MOUNT_POINT_FULL``
+
+    These give the path to the mount point for a mount object, in the former
+    relative to its parent mount's mount point (limited to chroot) and in the
+    latter as a full path from the chroot.
+
+ *  ``FSINFO_ATTR_MOUNT_CHILDREN``
+
+    This gives a list of all the child mounts of the queried mount.  This is
+    presented as tuples of { mount ID, mount uniquifier, event counter sum }
+    and includes at the end a tuple representing the queried mount.
+
+
+Finally there are filesystem-specific attributes, e.g.:
+
+ *  ``FSINFO_ATTR_AFS_CELL_NAME``
+
+    This is a string-type attribute that retrieves the AFS cell name of the
+    target object.
+
+ *  ``FSINFO_ATTR_AFS_SERVER_NAME``
+
+    This is a string-type attribute that conveys the name of the Nth server
+    backing a network-filesystem superblock.
+
+ *  ``FSINFO_ATTR_AFS_SERVER_ADDRESSES``
+
+    This is a list-type attribute that conveys the addresses of the Nth server,
+    corresponding to the Nth server returned by FSINFO_ATTR_SERVER_NAME.
+
+
+Feature Bit Catalogue
+=====================
+
+The feature bits convey single true/false assertions about a specific instance
+of a filesystem (ie. a specific superblock).  They are accessed using the
+"FSINFO_ATTR_FEATURE" attribute:
+
+ *  ``FSINFO_FEAT_IS_KERNEL_FS``
+ *  ``FSINFO_FEAT_IS_BLOCK_FS``
+ *  ``FSINFO_FEAT_IS_FLASH_FS``
+ *  ``FSINFO_FEAT_IS_NETWORK_FS``
+ *  ``FSINFO_FEAT_IS_AUTOMOUNTER_FS``
+ *  ``FSINFO_FEAT_IS_MEMORY_FS``
+
+    These indicate what kind of filesystem the target is: kernel API (proc),
+    block-based (ext4), flash/nvm-based (jffs2), remote over the network (NFS),
+    local quasi-filesystem that acts as a tray of mountpoints (autofs), plain
+    in-memory filesystem (shmem).
+
+ *  ``FSINFO_FEAT_AUTOMOUNTS``
+
+    This indicate if a filesystem may have objects that are automount points.
+
+ *  ``FSINFO_FEAT_ADV_LOCKS``
+ *  ``FSINFO_FEAT_MAND_LOCKS``
+ *  ``FSINFO_FEAT_LEASES``
+
+    These indicate if a filesystem supports advisory locks, mandatory locks or
+    leases.
+
+ *  ``FSINFO_FEAT_UIDS``
+ *  ``FSINFO_FEAT_GIDS``
+ *  ``FSINFO_FEAT_PROJIDS``
+
+    These indicate if a filesystem supports/stores/transports numeric user IDs,
+    group IDs or project IDs.  The "FSINFO_ATTR_LIMITS" attribute can be used
+    to find out the upper limits on the IDs values.
+
+ *  ``FSINFO_FEAT_STRING_USER_IDS``
+
+    This indicates if a filesystem supports/stores/transports string user
+    identifiers.
+
+ *  ``FSINFO_FEAT_GUID_USER_IDS``
+
+    This indicates if a filesystem supports/stores/transports Windows GUIDs as
+    user identifiers (eg. ntfs).
+
+ *  ``FSINFO_FEAT_WINDOWS_ATTRS``
+
+    This indicates if a filesystem supports Windows FILE_* attribute bits
+    (eg. cifs, jfs).  The "FSINFO_ATTR_SUPPORTS" attribute can be used to find
+    out which windows file attributes are supported by the filesystem.
+
+ *  ``FSINFO_FEAT_USER_QUOTAS``
+ *  ``FSINFO_FEAT_GROUP_QUOTAS``
+ *  ``FSINFO_FEAT_PROJECT_QUOTAS``
+
+    These indicate if a filesystem supports quotas for users, groups or
+    projects.
+
+ *  ``FSINFO_FEAT_XATTRS``
+
+    These indicate if a filesystem supports extended attributes.  The
+    "FSINFO_ATTR_LIMITS" attribute can be used to find out the upper limits on
+    the supported name and body lengths.
+
+ *  ``FSINFO_FEAT_JOURNAL``
+ *  ``FSINFO_FEAT_DATA_IS_JOURNALLED``
+
+    These indicate whether the filesystem has a journal and whether data
+    changes are logged to it.
+
+ *  ``FSINFO_FEAT_O_SYNC``
+ *  ``FSINFO_FEAT_O_DIRECT``
+
+    These indicate whether the filesystem supports the O_SYNC and O_DIRECT
+    flags.
+
+ *  ``FSINFO_FEAT_VOLUME_ID``
+ *  ``FSINFO_FEAT_VOLUME_UUID``
+ *  ``FSINFO_FEAT_VOLUME_NAME``
+ *  ``FSINFO_FEAT_VOLUME_FSID``
+
+    These indicate whether ID, UUID, name and FSID identifiers actually exist
+    in the filesystem and thus might be considered persistent.
+
+ *  ``FSINFO_FEAT_IVER_ALL_CHANGE``
+ *  ``FSINFO_FEAT_IVER_DATA_CHANGE``
+ *  ``FSINFO_FEAT_IVER_MONO_INCR``
+
+    These indicate whether i_version in the inode is supported and, if so, what
+    mode it operates in.  The first two indicate if it's changed for any data
+    or metadata change, or whether it's only changed for any data changes; the
+    last indicates whether or not it's monotonically increasing for each such
+    change.
+
+ *  ``FSINFO_FEAT_HARD_LINKS``
+ *  ``FSINFO_FEAT_HARD_LINKS_1DIR``
+
+    These indicate whether the filesystem can have hard links made in it, and
+    whether they can be made between directory or only within the same
+    directory.
+
+ *  ``FSINFO_FEAT_DIRECTORIES``
+ *  ``FSINFO_FEAT_SYMLINKS``
+ *  ``FSINFO_FEAT_DEVICE_FILES``
+ *  ``FSINFO_FEAT_UNIX_SPECIALS``
+
+    These indicate whether directories; symbolic links; device files; or pipes
+    and sockets can be made within the filesystem.
+
+ *  ``FSINFO_FEAT_RESOURCE_FORKS``
+
+    This indicates if the filesystem supports resource forks.
+
+ *  ``FSINFO_FEAT_NAME_CASE_INDEP``
+ *  ``FSINFO_FEAT_NAME_NON_UTF8``
+ *  ``FSINFO_FEAT_NAME_HAS_CODEPAGE``
+
+    These indicate if the filesystem supports case-independent file names,
+    whether the filenames are non-utf8 (see the "FSINFO_ATTR_NAME_ENCODING"
+    attribute) and whether a codepage is in use to transliterate them (see
+    the "FSINFO_ATTR_NAME_CODEPAGE" attribute).
+
+ *  ``FSINFO_FEAT_SPARSE``
+
+    This indicates if a filesystem supports sparse files.
+
+ *  ``FSINFO_FEAT_NOT_PERSISTENT``
+
+    This indicates if a filesystem is not persistent.
+
+ *  ``FSINFO_FEAT_NO_UNIX_MODE``
+
+    This indicates if a filesystem doesn't support UNIX mode bits (though they
+    may be manufactured from other bits, such as Windows file attribute flags).
+
+ *  ``FSINFO_FEAT_HAS_ATIME``
+ *  ``FSINFO_FEAT_HAS_BTIME``
+ *  ``FSINFO_FEAT_HAS_CTIME``
+ *  ``FSINFO_FEAT_HAS_MTIME``
+
+    These indicate which timestamps a filesystem supports (access, birth,
+    change, modify).  The range and resolutions can be queried with the
+    "FSINFO_ATTR_TIMESTAMPS" attribute).



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 12/14] fsinfo: Add support for AFS [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (10 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 11/14] fsinfo: Add API documentation " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 14:02 ` [PATCH 13/14] fsinfo: Example support for Ext4 " David Howells
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

Add fsinfo support to the AFS filesystem.  This allows the export of server
lists, amongst other things, which is necessary to implement some of the
AFS 'fs' command set, such as "checkservers", "getserverprefs" and
"whereis".

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/internal.h           |    1 
 fs/afs/super.c              |  218 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   15 +++
 samples/vfs/test-fsinfo.c   |   49 ++++++++++
 4 files changed, 281 insertions(+), 2 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 1d81fc4c3058..b4b2a8a18e9f 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -248,6 +248,7 @@ struct afs_super_info {
 	struct afs_volume	*volume;	/* volume record */
 	enum afs_flock_mode	flock_mode:8;	/* File locking emulation mode */
 	bool			dyn_root;	/* True if dynamic root */
+	bool			autocell;	/* True if autocell */
 };
 
 static inline struct afs_super_info *AFS_FS_S(struct super_block *sb)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index dda7a9a66848..969248a192a2 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -26,9 +26,13 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/magic.h>
+#include <linux/fsinfo.h>
 #include <net/net_namespace.h>
 #include "internal.h"
 
+#ifdef CONFIG_FSINFO
+static int afs_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
 static void afs_i_init_once(void *foo);
 static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
@@ -54,6 +58,9 @@ int afs_net_id;
 
 static const struct super_operations afs_super_ops = {
 	.statfs		= afs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= afs_fsinfo,
+#endif
 	.alloc_inode	= afs_alloc_inode,
 	.drop_inode	= afs_drop_inode,
 	.destroy_inode	= afs_destroy_inode,
@@ -193,7 +200,7 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
 
 	if (as->dyn_root)
 		seq_puts(m, ",dyn");
-	if (test_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(d_inode(root))->flags))
+	if (as->autocell)
 		seq_puts(m, ",autocell");
 	switch (as->flock_mode) {
 	case afs_flock_mode_unset:	break;
@@ -458,7 +465,7 @@ static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	if (ctx->autocell || as->dyn_root)
+	if (as->autocell || as->dyn_root)
 		set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);
 
 	ret = -ENOMEM;
@@ -498,6 +505,8 @@ static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
 			as->cell = afs_get_cell(ctx->cell);
 			as->volume = __afs_get_volume(ctx->volume);
 		}
+		if (ctx->autocell)
+			as->autocell = true;
 	}
 	return as;
 }
@@ -760,3 +769,208 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return ret;
 }
+
+#ifdef CONFIG_FSINFO
+static const struct fsinfo_timestamp_info afs_timestamp_info = {
+	.atime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+static int afs_fsinfo_get_timestamp(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_timestamp_info *tsinfo = ctx->buffer;
+	*tsinfo = afs_timestamp_info;
+	return sizeof(*tsinfo);
+}
+
+static int afs_fsinfo_get_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_limits *lim = ctx->buffer;
+
+	lim->max_file_size.hi	= 0;
+	lim->max_file_size.lo	= MAX_LFS_FILESIZE;
+	/* Inode numbers can be 96-bit on YFS, but that's hard to determine. */
+	lim->max_ino.hi		= 0;
+	lim->max_ino.lo		= UINT_MAX;
+	lim->max_hard_links	= UINT_MAX;
+	lim->max_uid		= UINT_MAX;
+	lim->max_gid		= UINT_MAX;
+	lim->max_filename_len	= AFSNAMEMAX - 1;
+	lim->max_symlink_len	= AFSPATHMAX - 1;
+	return sizeof(*lim);
+}
+
+static int afs_fsinfo_get_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_supports *p = ctx->buffer;
+
+	p->stx_mask = (STATX_TYPE | STATX_MODE |
+		       STATX_NLINK |
+		       STATX_UID | STATX_GID |
+		       STATX_MTIME | STATX_INO |
+		       STATX_SIZE);
+	p->stx_attributes = STATX_ATTR_AUTOMOUNT;
+	return sizeof(*p);
+}
+
+static int afs_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+
+	fsinfo_set_feature(p, FSINFO_FEAT_IS_NETWORK_FS);
+	fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	fsinfo_set_feature(p, FSINFO_FEAT_ADV_LOCKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_UIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_GIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_ID);
+	fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_NAME);
+	fsinfo_set_feature(p, FSINFO_FEAT_IVER_MONO_INCR);
+	fsinfo_set_feature(p, FSINFO_FEAT_SYMLINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_HARD_LINKS_1DIR);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_INODE_NUMBERS);
+	return sizeof(*p);
+}
+
+static int afs_dyn_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+
+	fsinfo_set_feature(p, FSINFO_FEAT_IS_AUTOMOUNTER_FS);
+	fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	return sizeof(*p);
+}
+
+static int afs_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_volume *volume = as->volume;
+
+	memcpy(ctx->buffer, volume->name, volume->name_len);
+	return volume->name_len;
+}
+
+static int afs_fsinfo_get_cell_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_cell *cell = as->cell;
+
+	memcpy(ctx->buffer, cell->name, cell->name_len);
+	return cell->name_len;
+}
+
+static int afs_fsinfo_get_server_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_server_list *slist;
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_volume *volume = as->volume;
+	struct afs_server *server;
+	int ret = -ENODATA;
+
+	read_lock(&volume->servers_lock);
+	slist = volume->servers;
+	if (slist) {
+		if (ctx->Nth < slist->nr_servers) {
+			server = slist->servers[ctx->Nth].server;
+			ret = sprintf(ctx->buffer, "%pU", &server->uuid);
+		}
+	}
+
+	read_unlock(&volume->servers_lock);
+	return ret;
+}
+
+static int afs_fsinfo_get_server_address(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_afs_server_address *p = ctx->buffer;
+	struct afs_server_list *slist;
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_addr_list *alist;
+	struct afs_volume *volume = as->volume;
+	struct afs_server *server;
+	struct afs_net *net = afs_d2net(path->dentry);
+	unsigned int i;
+	int ret = -ENODATA;
+
+	read_lock(&volume->servers_lock);
+	slist = afs_get_serverlist(volume->servers);
+	read_unlock(&volume->servers_lock);
+
+	if (ctx->Nth >= slist->nr_servers)
+		goto put_slist;
+	server = slist->servers[ctx->Nth].server;
+
+	read_lock(&server->fs_lock);
+	alist = afs_get_addrlist(rcu_dereference_protected(
+					 server->addresses,
+					 lockdep_is_held(&server->fs_lock)));
+	read_unlock(&server->fs_lock);
+	if (!alist)
+		goto put_slist;
+
+	ret = alist->nr_addrs * sizeof(*p);
+	if (ret <= ctx->buf_size) {
+		for (i = 0; i < alist->nr_addrs; i++)
+			memcpy(&p[i].address, &alist->addrs[i],
+			       sizeof(struct sockaddr_rxrpc));
+	}
+
+	afs_put_addrlist(alist);
+put_slist:
+	afs_put_serverlist(net, slist);
+	return ret;
+}
+
+static const struct fsinfo_attribute afs_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	afs_fsinfo_get_timestamp),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		afs_fsinfo_get_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		afs_fsinfo_get_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		afs_fsinfo_get_features),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	afs_fsinfo_get_volume_name),
+	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	afs_fsinfo_get_cell_name),
+	FSINFO_STRING_N	(FSINFO_ATTR_AFS_SERVER_NAME,	afs_fsinfo_get_server_name),
+	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_get_server_address),
+	{}
+};
+
+static const struct fsinfo_attribute afs_dyn_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT(FSINFO_ATTR_TIMESTAMP_INFO,	afs_fsinfo_get_timestamp),
+	FSINFO_VSTRUCT(FSINFO_ATTR_FEATURES,		afs_dyn_fsinfo_get_features),
+	{}
+};
+
+static int afs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	int ret;
+
+	if (as->dyn_root)
+		ret = fsinfo_get_attribute(path, ctx, afs_dyn_fsinfo_attributes);
+	else
+		ret = fsinfo_get_attribute(path, ctx, afs_fsinfo_attributes);
+	return ret;
+}
+
+#endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 826b788b0795..154c13a55819 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -37,6 +37,10 @@
 #define FSINFO_ATTR_MOUNT_POINT_FULL	0x203	/* Absolute path of mount (string) */
 #define FSINFO_ATTR_MOUNT_CHILDREN	0x204	/* Children of this mount (list) */
 
+#define FSINFO_ATTR_AFS_CELL_NAME	0x300	/* AFS cell name (string) */
+#define FSINFO_ATTR_AFS_SERVER_NAME	0x301	/* Name of the Nth server (string) */
+#define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302	/* List of addresses of the Nth server */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -297,4 +301,15 @@ struct fsinfo_volume_uuid {
 
 #define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_AFS_SERVER_ADDRESSES).
+ *
+ * Get the addresses of the Nth server for a network filesystem.
+ */
+struct fsinfo_afs_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+#define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 91434f459ba5..82944f09e0c9 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -23,6 +23,7 @@
 #include <linux/socket.h>
 #include <sys/stat.h>
 #include <arpa/inet.h>
+#include <linux/rxrpc.h>
 
 #ifndef __NR_fsinfo
 #define __NR_fsinfo -1
@@ -329,6 +330,50 @@ static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
 	       r->mnt_id, (unsigned long long)r->mnt_unique_id, r->notify_sum, mp);
 }
 
+static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
+{
+	struct fsinfo_afs_server_address *f = reply;
+	struct sockaddr_storage *ss = (struct sockaddr_storage *)&f->address;
+	struct sockaddr_rxrpc *srx;
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	char proto[32], buf[1024];
+
+	if (ss->ss_family == AF_RXRPC) {
+		srx = (struct sockaddr_rxrpc *)ss;
+		printf("%5u ", srx->srx_service);
+		switch (srx->transport_type) {
+		case SOCK_DGRAM:
+			sprintf(proto, "udp");
+			break;
+		case SOCK_STREAM:
+			sprintf(proto, "tcp");
+			break;
+		default:
+			sprintf(proto, "%3u", srx->transport_type);
+			break;
+		}
+		ss = (struct sockaddr_storage *)&srx->transport;
+	}
+
+	switch (ss->ss_family) {
+	case AF_INET:
+		sin = (struct sockaddr_in *)ss;
+		if (!inet_ntop(AF_INET, &sin->sin_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u/%s %s\n", ntohs(sin->sin_port), proto, buf);
+		return;
+	case AF_INET6:
+		sin6 = (struct sockaddr_in6 *)ss;
+		if (!inet_ntop(AF_INET6, &sin6->sin6_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u/%s %s\n", ntohs(sin6->sin6_port), proto, buf);
+		return;
+	}
+
+	printf("family=%u\n", ss->ss_family);
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -411,6 +456,10 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	string),
 	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT_FULL,	string),
 	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_child),
+
+	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	string),
+	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	string),
+	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 13/14] fsinfo: Example support for Ext4 [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (11 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 12/14] fsinfo: Add support for AFS " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 14:02 ` [PATCH 14/14] fsinfo: Example support for NFS " David Howells
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, dhowells, raven,
	mszeredi, christian, jannh, darrick.wong, kzak, jlayton,
	linux-api, linux-fsdevel, linux-security-module, linux-kernel

Add the ability to list some Ext4 volume timestamps as an example.

Is this useful for ext4?  Is there anything else that could be useful?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
---

 fs/ext4/Makefile            |    1 +
 fs/ext4/ext4.h              |    6 ++++++
 fs/ext4/fsinfo.c            |   45 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c             |    3 +++
 include/uapi/linux/fsinfo.h |   16 +++++++++++++++
 samples/vfs/test-fsinfo.c   |   35 +++++++++++++++++++++++++++++++++
 6 files changed, 106 insertions(+)
 create mode 100644 fs/ext4/fsinfo.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 4ccb3c9189d8..71d5b460c7c7 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -16,3 +16,4 @@ ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
 ext4-inode-test-objs			+= inode-test.o
 obj-$(CONFIG_EXT4_KUNIT_TESTS)		+= ext4-inode-test.o
 ext4-$(CONFIG_FS_VERITY)		+= verity.o
+ext4-$(CONFIG_FSINFO)			+= fsinfo.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a2ee2428ecc..461968a87cd6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -42,6 +42,7 @@
 
 #include <linux/fscrypt.h>
 #include <linux/fsverity.h>
+#include <linux/fsinfo.h>
 
 #include <linux/compiler.h>
 
@@ -3166,6 +3167,11 @@ extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
+/* fsinfo.c */
+#ifdef CONFIG_FSINFO
+extern int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
+
 /* inline.c */
 extern int ext4_get_max_inline_size(struct inode *inode);
 extern int ext4_find_inline_data_nolock(struct inode *inode);
diff --git a/fs/ext4/fsinfo.c b/fs/ext4/fsinfo.c
new file mode 100644
index 000000000000..785f82a74dc9
--- /dev/null
+++ b/fs/ext4/fsinfo.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information for ext4
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/mount.h>
+#include "ext4.h"
+
+static int ext4_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+	const struct ext4_super_block *es = sbi->s_es;
+
+	memcpy(ctx->buffer, es->s_volume_name, sizeof(es->s_volume_name));
+	return strlen(ctx->buffer);
+}
+
+static int ext4_fsinfo_get_timestamps(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+	const struct ext4_super_block *es = sbi->s_es;
+	struct fsinfo_ext4_timestamps *ts = ctx->buffer;
+
+#define Z(R,S) R = S | (((u64)S##_hi) << 32)
+	Z(ts->mkfs_time,	es->s_mkfs_time);
+	Z(ts->mount_time,	es->s_mtime);
+	Z(ts->write_time,	es->s_wtime);
+	Z(ts->last_check_time,	es->s_lastcheck);
+	Z(ts->first_error_time,	es->s_first_error_time);
+	Z(ts->last_error_time,	es->s_last_error_time);
+	return sizeof(*ts);
+}
+
+static const struct fsinfo_attribute ext4_fsinfo_attributes[] = {
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	ext4_fsinfo_get_volume_name),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_get_timestamps),
+	{}
+};
+
+int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_get_attribute(path, ctx, ext4_fsinfo_attributes);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8434217549b3..02b4df073c4b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1477,6 +1477,9 @@ static const struct super_operations ext4_sops = {
 	.freeze_fs	= ext4_freeze,
 	.unfreeze_fs	= ext4_unfreeze,
 	.statfs		= ext4_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= ext4_fsinfo,
+#endif
 	.remount_fs	= ext4_remount,
 	.show_options	= ext4_show_options,
 #ifdef CONFIG_QUOTA
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 154c13a55819..d8d05f0f1473 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -41,6 +41,8 @@
 #define FSINFO_ATTR_AFS_SERVER_NAME	0x301	/* Name of the Nth server (string) */
 #define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302	/* List of addresses of the Nth server */
 
+#define FSINFO_ATTR_EXT4_TIMESTAMPS	0x400	/* Ext4 superblock timestamps */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -312,4 +314,18 @@ struct fsinfo_afs_server_address {
 
 #define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_EXT4_TIMESTAMPS).
+ */
+struct fsinfo_ext4_timestamps {
+	__u64		mkfs_time;
+	__u64		mount_time;
+	__u64		write_time;
+	__u64		last_check_time;
+	__u64		first_error_time;
+	__u64		last_error_time;
+};
+
+#define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 82944f09e0c9..829297e9d1b6 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -374,6 +374,40 @@ static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
 	printf("family=%u\n", ss->ss_family);
 }
 
+static char *dump_ext4_time(char *buffer, time_t tim)
+{
+	struct tm tm;
+	int len;
+
+	if (tim == 0)
+		return "-";
+
+	if (!localtime_r(&tim, &tm)) {
+		perror("localtime_r");
+		exit(1);
+	}
+	len = strftime(buffer, 100, "%F %T", &tm);
+	if (len == 0) {
+		perror("strftime");
+		exit(1);
+	}
+	return buffer;
+}
+
+static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size)
+{
+	struct fsinfo_ext4_timestamps *r = reply;
+	char buffer[100];
+
+	printf("\n");
+	printf("\tmkfs    : %s\n", dump_ext4_time(buffer, r->mkfs_time));
+	printf("\tmount   : %s\n", dump_ext4_time(buffer, r->mount_time));
+	printf("\twrite   : %s\n", dump_ext4_time(buffer, r->write_time));
+	printf("\tfsck    : %s\n", dump_ext4_time(buffer, r->last_check_time));
+	printf("\t1st-err : %s\n", dump_ext4_time(buffer, r->first_error_time));
+	printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time));
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -460,6 +494,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	string),
 	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	string),
 	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_timestamps),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 14/14] fsinfo: Example support for NFS [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (12 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 13/14] fsinfo: Example support for Ext4 " David Howells
@ 2020-03-09 14:02 ` David Howells
  2020-03-09 17:50 ` [PATCH 00/14] VFS: Filesystem information " Jeff Layton
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 14:02 UTC (permalink / raw)
  To: torvalds, viro
  Cc: Trond Myklebust, Anna Schumaker, linux-nfs, dhowells, raven,
	mszeredi, christian, jannh, darrick.wong, kzak, jlayton,
	linux-api, linux-fsdevel, linux-security-module, linux-kernel

Add the ability to list NFS server addresses and hostname, timestamp
information and capabilities as an example.

Is this useful for export from NFS?  Is there anything else that would be
useful?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
---

 fs/nfs/Makefile              |    1 
 fs/nfs/fsinfo.c              |  230 ++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/internal.h            |    6 +
 fs/nfs/nfs4super.c           |    3 +
 fs/nfs/super.c               |    3 +
 include/uapi/linux/fsinfo.h  |   29 +++++
 include/uapi/linux/windows.h |   35 ++++++
 samples/vfs/test-fsinfo.c    |   38 +++++++
 8 files changed, 345 insertions(+)
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/uapi/linux/windows.h

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 2433c3e03cfa..20fbc9596833 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -13,6 +13,7 @@ nfs-y 			:= client.o dir.o file.o getroot.o inode.o super.o \
 nfs-$(CONFIG_ROOT_NFS)	+= nfsroot.o
 nfs-$(CONFIG_SYSCTL)	+= sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
+nfs-$(CONFIG_FSINFO)	+= fsinfo.o
 
 obj-$(CONFIG_NFS_V2) += nfsv2.o
 nfsv2-y := nfs2super.o proc.o nfs2xdr.o
diff --git a/fs/nfs/fsinfo.c b/fs/nfs/fsinfo.c
new file mode 100644
index 000000000000..a0299ec27efd
--- /dev/null
+++ b/fs/nfs/fsinfo.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information for NFS
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/nfs_fs.h>
+#include <linux/windows.h>
+#include "internal.h"
+
+static const struct fsinfo_timestamp_info nfs_timestamp_info = {
+	.atime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+static int nfs_fsinfo_get_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_timestamp_info *r = ctx->buffer;
+	unsigned long long nsec;
+	unsigned int rem, mant;
+	int exp = -9;
+
+	*r = nfs_timestamp_info;
+
+	nsec = server->time_delta.tv_nsec;
+	nsec += server->time_delta.tv_sec * 1000000000ULL;
+	if (nsec == 0)
+		goto out;
+
+	do {
+		mant = nsec;
+		rem = do_div(nsec, 10);
+		if (rem)
+			break;
+		exp++;
+	} while (nsec);
+
+	r->atime.gran_mantissa = mant;
+	r->atime.gran_exponent = exp;
+	r->btime.gran_mantissa = mant;
+	r->btime.gran_exponent = exp;
+	r->ctime.gran_mantissa = mant;
+	r->ctime.gran_exponent = exp;
+	r->mtime.gran_mantissa = mant;
+	r->mtime.gran_exponent = exp;
+
+out:
+	return sizeof(*r);
+}
+
+static int nfs_fsinfo_get_info(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+	struct fsinfo_nfs_info *r = ctx->buffer;
+
+	r->version		= clp->rpc_ops->version;
+	r->minor_version	= clp->cl_minorversion;
+	r->transport_proto	= clp->cl_proto;
+	return sizeof(*r);
+}
+
+static int nfs_fsinfo_get_server_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+
+	return fsinfo_string(clp->cl_hostname, ctx);
+}
+
+static int nfs_fsinfo_get_server_addresses(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+	struct fsinfo_nfs_server_address *addr = ctx->buffer;
+	int ret;
+
+	ret = 1 * sizeof(*addr);
+	if (ret <= ctx->buf_size)
+		memcpy(&addr[0].address, &clp->cl_addr, clp->cl_addrlen);
+	return ret;
+
+}
+
+static int nfs_fsinfo_get_gssapi_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+
+	return fsinfo_string(clp->cl_acceptor, ctx);
+}
+
+static int nfs_fsinfo_get_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_limits *lim = ctx->buffer;
+
+	lim->max_file_size.hi	= 0;
+	lim->max_file_size.lo	= server->maxfilesize;
+	lim->max_ino.hi		= 0;
+	lim->max_ino.lo		= U64_MAX;
+	lim->max_hard_links	= UINT_MAX;
+	lim->max_uid		= UINT_MAX;
+	lim->max_gid		= UINT_MAX;
+	lim->max_filename_len	= NAME_MAX - 1;
+	lim->max_symlink_len	= PATH_MAX - 1;
+	return sizeof(*lim);
+}
+
+static int nfs_fsinfo_get_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_supports *sup = ctx->buffer;
+
+	/* Don't set STATX_INO as i_ino is fabricated and may not be unique. */
+
+	if (!(server->caps & NFS_CAP_MODE))
+		sup->stx_mask |= STATX_TYPE | STATX_MODE;
+	if (server->caps & NFS_CAP_OWNER)
+		sup->stx_mask |= STATX_UID;
+	if (server->caps & NFS_CAP_OWNER_GROUP)
+		sup->stx_mask |= STATX_GID;
+	if (server->caps & NFS_CAP_ATIME)
+		sup->stx_mask |= STATX_ATIME;
+	if (server->caps & NFS_CAP_CTIME)
+		sup->stx_mask |= STATX_CTIME;
+	if (server->caps & NFS_CAP_MTIME)
+		sup->stx_mask |= STATX_MTIME;
+	if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
+		sup->stx_mask |= STATX_SIZE;
+	if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
+		sup->stx_mask |= STATX_NLINK;
+
+	if (server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE)
+		sup->win_file_attrs |= ATTR_ARCHIVE;
+	if (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN)
+		sup->win_file_attrs |= ATTR_HIDDEN;
+	if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
+		sup->win_file_attrs |= ATTR_SYSTEM;
+
+	sup->stx_attributes = STATX_ATTR_AUTOMOUNT;
+	return sizeof(*sup);
+}
+
+static int nfs_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_features *ft = ctx->buffer;
+
+	fsinfo_set_feature(ft, FSINFO_FEAT_IS_NETWORK_FS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_AUTOMOUNTS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_O_SYNC);
+	fsinfo_set_feature(ft, FSINFO_FEAT_O_DIRECT);
+	fsinfo_set_feature(ft, FSINFO_FEAT_ADV_LOCKS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_DEVICE_FILES);
+	fsinfo_set_feature(ft, FSINFO_FEAT_UNIX_SPECIALS);
+	if (server->nfs_client->rpc_ops->version == 4) {
+		fsinfo_set_feature(ft, FSINFO_FEAT_LEASES);
+		fsinfo_set_feature(ft, FSINFO_FEAT_IVER_ALL_CHANGE);
+	}
+
+	if (server->caps & NFS_CAP_OWNER)
+		fsinfo_set_feature(ft, FSINFO_FEAT_UIDS);
+	if (server->caps & NFS_CAP_OWNER_GROUP)
+		fsinfo_set_feature(ft, FSINFO_FEAT_GIDS);
+	if (!(server->caps & NFS_CAP_MODE))
+		fsinfo_set_feature(ft, FSINFO_FEAT_NO_UNIX_MODE);
+	if (server->caps & NFS_CAP_ACLS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_ACL);
+	if (server->caps & NFS_CAP_SYMLINKS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_SYMLINKS);
+	if (server->caps & NFS_CAP_HARDLINKS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HARD_LINKS);
+	if (server->caps & NFS_CAP_ATIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_ATIME);
+	if (server->caps & NFS_CAP_CTIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_CTIME);
+	if (server->caps & NFS_CAP_MTIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_MTIME);
+
+	if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
+		fsinfo_set_feature(ft, FSINFO_FEAT_NAME_CASE_INDEP);
+	if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
+	    (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
+	    (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
+		fsinfo_set_feature(ft, FSINFO_FEAT_WINDOWS_ATTRS);
+
+	return sizeof(*ft);
+}
+
+static const struct fsinfo_attribute nfs_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	nfs_fsinfo_get_timestamp_info),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		nfs_fsinfo_get_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		nfs_fsinfo_get_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		nfs_fsinfo_get_features),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_NFS_INFO,		nfs_fsinfo_get_info),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_SERVER_NAME,	nfs_fsinfo_get_server_name),
+	FSINFO_LIST	(FSINFO_ATTR_NFS_SERVER_ADDRESSES, nfs_fsinfo_get_server_addresses),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_GSSAPI_NAME,	nfs_fsinfo_get_gssapi_name),
+	{}
+};
+
+int nfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_get_attribute(path, ctx, nfs_fsinfo_attributes);
+}
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f80c47d5ff27..59e407066b45 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -10,6 +10,7 @@
 #include <linux/sunrpc/addr.h>
 #include <linux/nfs_page.h>
 #include <linux/wait_bit.h>
+#include <linux/fsinfo.h>
 
 #define NFS_SB_MASK (SB_RDONLY|SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
 
@@ -247,6 +248,11 @@ extern const struct svc_version nfs4_callback_version4;
 /* fs_context.c */
 extern struct file_system_type nfs_fs_type;
 
+/* fsinfo.c */
+#ifdef CONFIG_FSINFO
+extern int nfs_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
+
 /* pagelist.c */
 extern int __init nfs_init_nfspagecache(void);
 extern void nfs_destroy_nfspagecache(void);
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 1475f932d7da..cd38da87cbd3 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -26,6 +26,9 @@ static const struct super_operations nfs4_sops = {
 	.write_inode	= nfs4_write_inode,
 	.drop_inode	= nfs_drop_inode,
 	.statfs		= nfs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= nfs_fsinfo,
+#endif
 	.evict_inode	= nfs4_evict_inode,
 	.umount_begin	= nfs_umount_begin,
 	.show_options	= nfs_show_options,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index dada09b391c6..27ac751d3789 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -76,6 +76,9 @@ const struct super_operations nfs_sops = {
 	.write_inode	= nfs_write_inode,
 	.drop_inode	= nfs_drop_inode,
 	.statfs		= nfs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= nfs_fsinfo,
+#endif
 	.evict_inode	= nfs_evict_inode,
 	.umount_begin	= nfs_umount_begin,
 	.show_options	= nfs_show_options,
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index d8d05f0f1473..346cf0cf42cb 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -43,6 +43,11 @@
 
 #define FSINFO_ATTR_EXT4_TIMESTAMPS	0x400	/* Ext4 superblock timestamps */
 
+#define FSINFO_ATTR_NFS_INFO		0x500	/* Information about an NFS mount */
+#define FSINFO_ATTR_NFS_SERVER_NAME	0x501	/* Name of the server (string) */
+#define FSINFO_ATTR_NFS_SERVER_ADDRESSES 0x502	/* List of addresses of the server */
+#define FSINFO_ATTR_NFS_GSSAPI_NAME	0x503	/* GSSAPI acceptor name */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -328,4 +333,28 @@ struct fsinfo_ext4_timestamps {
 
 #define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_NFS_INFO).
+ *
+ * Get information about an NFS mount.
+ */
+struct fsinfo_nfs_info {
+	__u32		version;
+	__u32		minor_version;
+	__u32		transport_proto;
+};
+
+#define FSINFO_ATTR_NFS_INFO__STRUCT struct fsinfo_nfs_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_NFS_SERVER_ADDRESSES).
+ *
+ * Get the addresses of the server for an NFS mount.
+ */
+struct fsinfo_nfs_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+#define FSINFO_ATTR_NFS_SERVER_ADDRESSES__STRUCT struct fsinfo_nfs_server_address
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/include/uapi/linux/windows.h b/include/uapi/linux/windows.h
new file mode 100644
index 000000000000..17efb9a40529
--- /dev/null
+++ b/include/uapi/linux/windows.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Common windows attributes
+ */
+#ifndef _UAPI_LINUX_WINDOWS_H
+#define _UAPI_LINUX_WINDOWS_H
+
+/*
+ * File Attribute flags
+ */
+#define ATTR_READONLY		0x0001
+#define ATTR_HIDDEN		0x0002
+#define ATTR_SYSTEM		0x0004
+#define ATTR_VOLUME		0x0008
+#define ATTR_DIRECTORY		0x0010
+#define ATTR_ARCHIVE		0x0020
+#define ATTR_DEVICE		0x0040
+#define ATTR_NORMAL		0x0080
+#define ATTR_TEMPORARY		0x0100
+#define ATTR_SPARSE		0x0200
+#define ATTR_REPARSE		0x0400
+#define ATTR_COMPRESSED		0x0800
+#define ATTR_OFFLINE		0x1000	/* ie file not immediately available -
+					   on offline storage */
+#define ATTR_NOT_CONTENT_INDEXED 0x2000
+#define ATTR_ENCRYPTED		0x4000
+#define ATTR_POSIX_SEMANTICS	0x01000000
+#define ATTR_BACKUP_SEMANTICS	0x02000000
+#define ATTR_DELETE_ON_CLOSE	0x04000000
+#define ATTR_SEQUENTIAL_SCAN	0x08000000
+#define ATTR_RANDOM_ACCESS	0x10000000
+#define ATTR_NO_BUFFERING	0x20000000
+#define ATTR_WRITE_THROUGH	0x80000000
+
+#endif /* _UAPI_LINUX_WINDOWS_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 829297e9d1b6..b03869faef01 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -408,6 +408,40 @@ static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size)
 	printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time));
 }
 
+static void dump_nfs_fsinfo_info(void *reply, unsigned int size)
+{
+	struct fsinfo_nfs_info *r = reply;
+
+	printf("ver=%u.%u proto=%u\n", r->version, r->minor_version, r->transport_proto);
+}
+
+static void dump_nfs_fsinfo_server_addresses(void *reply, unsigned int size)
+{
+	struct fsinfo_nfs_server_address *r = reply;
+	struct sockaddr_storage *ss = (struct sockaddr_storage *)&r->address;
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	char buf[1024];
+
+	switch (ss->ss_family) {
+	case AF_INET:
+		sin = (struct sockaddr_in *)ss;
+		if (!inet_ntop(AF_INET, &sin->sin_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u %s\n", ntohs(sin->sin_port), buf);
+		return;
+	case AF_INET6:
+		sin6 = (struct sockaddr_in6 *)ss;
+		if (!inet_ntop(AF_INET6, &sin6->sin6_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u %s\n", ntohs(sin6->sin6_port), buf);
+		return;
+	default:
+		printf("family=%u\n", ss->ss_family);
+		return;
+	}
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -495,6 +529,10 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	string),
 	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_timestamps),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_NFS_INFO,		nfs_fsinfo_info),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_SERVER_NAME,	string),
+	FSINFO_LIST	(FSINFO_ATTR_NFS_SERVER_ADDRESSES, nfs_fsinfo_server_addresses),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_GSSAPI_NAME,	string),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (13 preceding siblings ...)
  2020-03-09 14:02 ` [PATCH 14/14] fsinfo: Example support for NFS " David Howells
@ 2020-03-09 17:50 ` Jeff Layton
  2020-03-09 19:22   ` Andres Freund
  2020-03-09 20:02 ` Miklos Szeredi
  2020-03-09 22:52 ` David Howells
  16 siblings, 1 reply; 50+ messages in thread
From: Jeff Layton @ 2020-03-09 17:50 UTC (permalink / raw)
  To: David Howells, torvalds, viro
  Cc: Theodore Ts'o, Stefan Metzmacher, Andreas Dilger, linux-ext4,
	Aleksa Sarai, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-api, raven, mszeredi, christian, jannh, darrick.wong, kzak,
	linux-fsdevel, linux-security-module, linux-kernel,
	Andres Freund

On Mon, 2020-03-09 at 14:00 +0000, David Howells wrote:
> Here's a set of patches that adds a system call, fsinfo(), that allows
> information about the VFS, mount topology, superblock and files to be
> retrieved.
> 
> The patchset is based on top of the notifications patchset and allows event
> counters implemented in the latter to be retrieved to allow overruns to be
> efficiently managed.
> 
> Included are a couple of sample programs plus limited example code for NFS
> and Ext4.  The example code is not intended to go upstream as-is.
> 
> 
> =======
> THE WHY
> =======
> 
> Why do we want this?
> 
> Using /proc/mounts (or similar) has problems:
> 
>  (1) Reading from it holds a global lock (namespace_sem) that prevents
>      mounting and unmounting.  Lots of data is encoded and mangled into
>      text whilst the lock is held, including superblock option strings and
>      mount point paths.  This causes performance problems when there are a
>      lot of mount objects in a system.
> 
>  (2) Even though namespace_sem is held during a read, reading the whole
>      file isn't necessarily atomic with respect to mount-type operations.
>      If a read isn't satisfied in one go, then it may return to userspace
>      briefly and then continue reading some way into the file.  But changes
>      can occur in the interval that may then go unseen.
> 
>  (3) Determining what has changed means parsing and comparing consecutive
>      outputs of /proc/mounts.
> 
>  (4) Querying a specific mount or superblock means searching through
>      /proc/mounts and searching by path or mount ID - but we might have an
>      fd we want to query.
> 
>  (5) Mount topology is not explicit.  One must derive it manually by
>      comparing entries.
> 
>  (6) Whilst you can poll() it for events, it only tells you that something
>      changed in the namespace, not what or whether you can even see the
>      change.
> 
> To fix the notification issues, the preceding notifications patchset added
> mount watch notifications whereby you can watch for notifications in a
> specific mount subtree.  The notification messages include the ID(s) of the
> affected mounts.
> 
> To support notifications, however, we need to be able to handle overruns in
> the notification queue.  I added a number of event counters to struct
> super_block and struct mount to allow you to pin down the changes, but
> there needs to be a way to retrieve them.  Exposing them through /proc
> would require adding yet another /proc/mounts-type file.  We could add
> per-mount directories full of attributes in sysfs, but that has issues also
> (see below).
> 
> Adding an extensible system call interface for retrieving filesystem
> information also allows other things to be exposed:
> 
>  (1) Jeff Layton's error handling changes need a way to allow error event
>      information to be retrieved.
> 
>  (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
>      actually 3-state { Set, Unset, Not supported }.  It could be useful to
>      provide a way to expose information like this[*].
> 
>  (3) Limits of the numerical metadata values in a filesystem[*].
> 
>  (4) Filesystem capability information[*].  Filesystems don't all have the
>      same capabilities, and even different instances may have different
>      capabilities, particularly with network filesystems where the set of
>      may be server-dependent.  Capabilities might even vary at file
>      granularity - though possibly such information should be conveyed
>      through statx() instead.
> 
>  (5) ID mapping/shifting tables in use for a superblock.
> 
>  (6) Filesystem-specific information.  I need something for AFS so that I
>      can do pioctl()-emulation, thereby allowing me to implement certain of
>      the AFS command line utilities that query state of a particular file.
>      This could also have application for other filesystems, such as NFS,
>      CIFS and ext4.
> 
>  [*] In a lot of cases these are probably fixed and can be memcpy'd from
>      static data.
> 
> There's a further consideration: I want to make it possible to have
> fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
> such that the manager can supervise a mount attempted inside the container.
> The manager would be given an fd pointing to the fs_context struct and
> would then need some way to query it (fsinfo()) and modify it (fsconfig()).
> This could also be used to arbitrate user-requested mounts when containers
> are not in play.
> 
> 
> ============================
> WHY NOT USE PROCFS OR SYSFS?
> ============================
> 
> Why is it better to go with a new system call rather than adding more magic
> stuff to /proc or /sysfs for each superblock object and each mount object?
> 
>  (1) It can be targetted.  It makes it easy to query directly by path or
>      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
>      cannot do three of these things easily.
> 
>  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
>      query information pertinent to a particular file?
> 
>  (3) It's more efficient as we can return specific binary data rather than
>      making huge text dumps.  Granted, sysfs and procfs could present the
>      same data, though as lots of little files which have to be
>      individually opened, read, closed and parsed.
> 
>  (4) We wouldn't have the overhead of open and close (even adding a
>      self-contained readfile() syscall has to do that internally).
> 
>  (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
>      file accessed.  We can use an integer attribute ID instead (yes, this
>      is similar to ioctl) - but could also use a string ID if that is
>      preferred.
> 
>  (6) Can query cross-namespace if, say, a container manager process is
>      given an fs_context that hasn't yet been mounted into a namespace - or
>      hasn't even been fully created yet.
> 
>  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
>      mount happens or is removed - and since systemd makes much use of
>      mount namespaces and mount propagation, this will create a lot of
>      nodes.
> 
> 
> ================
> DESIGN DECISIONS
> ================
> 
>  (1) Information is partitioned into sets of attributes.
> 
>  (2) Attribute IDs are integers as they're fast to compare.
> 
>  (3) Attribute values are typed (struct, list of structs, string, opaque
>      blob).  They type is fixed for a particular attribute.
> 
>  (4) For structure types, the length is also a version.  New fields can be
>      tacked onto the end.
> 
>  (5) When copying a versioned struct to userspace, the core handles a
>      version mismatch by truncating or zero-padding the data as necessary.
>      None of this is seen by the filesystem.
> 
>  (6) The core handles all the buffering and buffer resizing.
> 
>  (7) The filesystem never gets any access to the userspace parameter buffer
>      or result buffer.
> 
>  (8) "Meta" attributes can describe other attributes.
> 
> 
> ========
> OVERVIEW
> ========
> 
> fsinfo() is a system call that allows information about the filesystem at a
> particular path point to be queried as a set of attributes.
> 
> Attribute values are of four basic types:
> 
>  (1) Structure with version-dependent length (the length is the version).
> 
>  (2) Variable-length string.
> 
>  (3) List of structures (all the same length).
> 
>  (4) Opaque blob.
> 
> Attributes can have multiple values either as a sequence of values or a
> sequence-of-sequences of values and all the values of a particular
> attribute must be of the same type.  Values can be up to INT_MAX size,
> subject to memory availability.
> 
> Note that the values of an attribute *are* allowed to vary between dentries
> within a single superblock, depending on the specific dentry that you're
> looking at, but the values still have to be of the type for that attribute.
> 
> I've tried to make the interface as light as possible, so integer attribute
> ID rather than string and the core does all the buffer allocation and
> expansion and all the extensibility support work rather than leaving that
> to the filesystems.  This means that userspace pointers are not exposed to
> the filesystem.
> 
> 
> fsinfo() allows a variety of information to be retrieved about a filesystem
> and the mount topology:
> 
>  (1) General superblock attributes:
> 
>      - Filesystem identifiers (UUID, volume label, device numbers, ...)
>      - The limits on a filesystem's capabilities
>      - Information on supported statx fields and attributes and IOC flags.
>      - A variety single-bit flags indicating supported capabilities.
>      - Timestamp resolution and range.
>      - The amount of space/free space in a filesystem (as statfs()).
>      - Superblock notification counter.
> 
>  (2) Filesystem-specific superblock attributes:
> 
>      - Superblock-level timestamps.
>      - Cell name, workgroup or other netfs grouping concept.
>      - Server names and addresses.
> 
>  (3) VFS information:
> 
>      - Mount topology information.
>      - Mount attributes.
>      - Mount notification counter.
>      - Mount point path.
> 
>  (4) Information about what the fsinfo() syscall itself supports, including
>      the type and struct size of attributes.
> 
> The system is extensible:
> 
>  (1) New attributes can be added.  There is no requirement that a
>      filesystem implement every attribute.  A helper function is provided
>      to scan a list of attributes and a filesystem can have multiple such
>      lists.
> 
>  (2) Version length-dependent structure attributes can be made larger and
>      have additional information tacked on the end, provided it keeps the
>      layout of the existing fields.  If an older process asks for a shorter
>      structure, it will only be given the bits it asks for.  If a newer
>      process asks for a longer structure on an older kernel, the extra
>      space will be set to 0.  In all cases, the size of the data actually
>      available is returned.
> 
>      In essence, the size of a structure is that structure's version: a
>      smaller size is an earlier version and a later version includes
>      everything that the earlier version did.
> 
>  (3) New single-bit capability flags can be added.  This is a structure-typed
>      attribute and, as such, (2) applies.  Any bits you wanted but the kernel
>      doesn't support are automatically set to 0.
> 
> fsinfo() may be called like the following, for example:
> 
> 	struct fsinfo_params params = {
> 		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> 		.Nth		= 2,
> 	};
> 	struct fsinfo_server_address address;
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &address, sizeof(address));
> 
> The above example would query an AFS filesystem to retrieve the address
> list for the 3rd server, and:
> 
> 	struct fsinfo_params params = {
> 		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
> 	};
> 	char server_name[256];
> 	len = fsinfo(AT_FDCWD, "/home/dhowells/", &params,
> 		     &server_name, sizeof(server_name));
> 
> would retrieve the name of the NFS server as a string.
> 
> In future, I want to make fsinfo() capable of querying a context created by
> fsopen() or fspick(), e.g.:
> 
> 	fd = fsopen("ext4", 0);
> 	struct fsinfo_params params = {
> 		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
> 		.request	= FSINFO_ATTR_CONFIGURATION;
> 	};
> 	char buffer[65536];
> 	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> 
> even if that context doesn't currently have a superblock attached.
> 
> The patches can be found here also:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 
> on branch:
> 
> 	fsinfo-core
> 
> 
> ===================
> SIGNIFICANT CHANGES
> ===================
> 
>  ver #18:
> 
>  (*) Moved the mount and superblock notification patches into a different
>      branch.
> 
>  (*) Made superblock configuration (->show_opts), bindmount path
>      (->show_path) and filesystem statistics (->show_stats) available as
>      the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.
> 
>  (*) Made mountpoint device name available, filtered through the superblock
>      (->show_devname), as the SOURCE attribute.
> 
>  (*) Made the mountpoint available as a full path as well as a relative
>      one.
> 
>  (*) Added more event counters to MOUNT_INFO, including a subtree
>      notification counter, to make it easier to clean up after a
>      notification overrun.
> 
>  (*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
>      five event counters.
> 
>  (*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
>      also so that mount ID reuse can be detected.
> 
>  (*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
>      avoid duplicate information.
> 
>  (*) Switched to using the RESOLVE_* flags rather than AT_* flags for
>      pathwalk control.  Added more RESOLVE_* flags.
> 
>  (*) Used a lock instead of RCU to enumerate children for the
>      MOUNT_CHILDREN attribute for safety.  This is probably worth
>      revisiting at a later date, however.
> 
> 
>  ver #17:
> 
>  (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.
> 
>  (*) Rearranged the order in which fsinfo() does things so that the
>      superblock operations table can have a function pointer rather than a
>      table pointer.  The ->fsinfo() op is now called at least twice, once
>      to determine the size of buffer needed and then to retrieve the data.
>      If the retrieval step indicates yet more space is needed, the buffer
>      will be expanded and that step repeated.
> 
>  (*) Merge the element size into the size in the fsinfo_attribute def and
>      don't set size for strings or opaques.  Let a helper work that out.
>      This means that strings can actually get larger then 4K.
> 
>  (*) A helper is provided to scan a list of attributes and call the
>      appropriate get function.  This can be called from a filesystem's
>      ->fsinfo() method multiple times.  It also handles attribute
>      enumeration and info querying.
> 
>  (*) Rearranged the patches to put all the notification patches first.
>      This allowed some of the bits to be squashed together.  At some point,
>      I'll move the notification patches into a different branch.
> 
>  ver #16:
> 
>  (*) Split the features bits out of the fsinfo() core into their own patch
>      and got rid of the name encoding attributes.
> 
>  (*) Renamed the 'array' type to 'list' and made AFS use it for returning
>      server address lists.
> 
>  (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
>      where each attribute has a ->get() method to deal with it.  These
>      tables can then be returned with an fsinfo meta attribute.
> 
>  (*) Dropped the fscontext query and parameter/description retrieval
>      attributes for now.
> 
>  (*) Picked the mount topology attributes into this branch.
> 
>  (*) Picked the mount notifications into this branch and rebased on top of
>      notifications-pipe-core.
> 
>  (*) Picked the superblock notifications into this branch.
> 
>  (*) Add sample code for Ext4 and NFS.
> 
> David
> ---
> David Howells (14):
>       VFS: Add additional RESOLVE_* flags
>       fsinfo: Add fsinfo() syscall to query filesystem information
>       fsinfo: Provide a bitmap of supported features
>       fsinfo: Allow retrieval of superblock devname, options and stats
>       fsinfo: Allow fsinfo() to look up a mount object by ID
>       fsinfo: Add a uniquifier ID to struct mount
>       fsinfo: Allow mount information to be queried
>       fsinfo: Allow the mount topology propogation flags to be retrieved
>       fsinfo: Provide notification overrun handling support
>       fsinfo: sample: Mount listing program
>       fsinfo: Add API documentation
>       fsinfo: Add support for AFS
>       fsinfo: Example support for Ext4
>       fsinfo: Example support for NFS
> 
> 
>  Documentation/filesystems/fsinfo.rst        |  564 +++++++++++++++++
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 
>  arch/arm/tools/syscall.tbl                  |    1 
>  arch/arm64/include/asm/unistd.h             |    2 
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
>  fs/Kconfig                                  |    7 
>  fs/Makefile                                 |    1 
>  fs/afs/internal.h                           |    1 
>  fs/afs/super.c                              |  218 +++++++
>  fs/d_path.c                                 |    2 
>  fs/ext4/Makefile                            |    1 
>  fs/ext4/ext4.h                              |    6 
>  fs/ext4/fsinfo.c                            |   45 +
>  fs/ext4/super.c                             |    3 
>  fs/fsinfo.c                                 |  720 ++++++++++++++++++++++
>  fs/internal.h                               |   13 
>  fs/mount.h                                  |    3 
>  fs/namespace.c                              |  362 +++++++++++
>  fs/nfs/Makefile                             |    1 
>  fs/nfs/fsinfo.c                             |  230 +++++++
>  fs/nfs/internal.h                           |    6 
>  fs/nfs/nfs4super.c                          |    3 
>  fs/nfs/super.c                              |    3 
>  fs/open.c                                   |    8 
>  include/linux/fcntl.h                       |    3 
>  include/linux/fs.h                          |    4 
>  include/linux/fsinfo.h                      |  111 +++
>  include/linux/syscalls.h                    |    4 
>  include/uapi/asm-generic/unistd.h           |    4 
>  include/uapi/linux/fsinfo.h                 |  360 +++++++++++
>  include/uapi/linux/mount.h                  |   10 
>  include/uapi/linux/openat2.h                |    8 
>  include/uapi/linux/windows.h                |   35 +
>  kernel/sys_ni.c                             |    1 
>  samples/vfs/Makefile                        |    7 
>  samples/vfs/test-fsinfo.c                   |  880 +++++++++++++++++++++++++++
>  samples/vfs/test-mntinfo.c                  |  277 ++++++++
>  50 files changed, 3905 insertions(+), 14 deletions(-)
>  create mode 100644 Documentation/filesystems/fsinfo.rst
>  create mode 100644 fs/ext4/fsinfo.c
>  create mode 100644 fs/fsinfo.c
>  create mode 100644 fs/nfs/fsinfo.c
>  create mode 100644 include/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/windows.h
>  create mode 100644 samples/vfs/test-fsinfo.c
>  create mode 100644 samples/vfs/test-mntinfo.c
> 
> 
The PostgreSQL devs asked a while back for some way to tell whether
there have been any writeback errors on a superblock w/o having to do
any sort of flush -- just "have there been any so far".

I sent a patch a few weeks ago to make syncfs() return errors when there
have been writeback errors on the superblock. It's not merged yet, but
once we have something like that in place, we could expose info from the
errseq_t to userland using this interface.

Something like this patch would do it (which depends on a few others in
my tree, nothing very large though):

---------------------------8<-----------------------

[PATCH] vfs: allow fsinfo to fetch the current state of s_wb_err

Add a new "error_state" struct to fsinfo, and teach the kernel to fill
that out from sb->s_wb_info. There are two fields:

wb_error_last: the most recently recorded errno for the filesystem

wb_error_cookie: this value will change vs. the previously fetched
                 value if a new error was recorded since it was last
		 checked. Callers should treat this as an opaque value
		 that can be compared to earlier fetched values.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/fsinfo.c                 | 11 +++++++++++
 include/uapi/linux/fsinfo.h | 13 +++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 6d2bc03998e4..3bbe6d7b1a79 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -275,6 +275,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_SOURCE,		fsinfo_generic_mount_source),
 	FSINFO_STRING	(FSINFO_ATTR_CONFIGURATION,	fsinfo_generic_seq_read),
 	FSINFO_STRING	(FSINFO_ATTR_FS_STATISTICS,	fsinfo_generic_seq_read),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_ERROR_STATE,	fsinfo_generic_error_state),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
@@ -376,6 +377,16 @@ static int fsinfo_get_attribute_info(struct path *path,
 	return -EOPNOTSUPP; /* We want to go through all the lists */
 }
 
+static int fsinfo_generic_error_state(struct path *path,
+				      struct fsinfo_context *ctx)
+{
+	struct fsinfo_error_state *es = ctx->buffer;
+
+	es->wb_error_cookie = errseq_scrape(&path->dentry->d_sb->s_wb_err);
+	es->wb_error_last = es->wb_error_cookie & MAX_ERRNO;
+	return sizeof(*es);
+}
+
 /**
  * fsinfo_get_attribute - Look up and handle an attribute
  * @path: The object to query
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 346cf0cf42cb..3d33744c2320 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -27,6 +27,7 @@
 #define FSINFO_ATTR_SOURCE		0x09	/* Superblock source/device name (string) */
 #define FSINFO_ATTR_CONFIGURATION	0x0a	/* Superblock configuration/options (string) */
 #define FSINFO_ATTR_FS_STATISTICS	0x0b	/* Superblock filesystem statistics (string) */
+#define FSINFO_ATTR_ERROR_STATE	0x0c	/* errseq_t state */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
@@ -357,4 +358,16 @@ struct fsinfo_nfs_server_address {
 
 #define FSINFO_ATTR_NFS_SERVER_ADDRESSES__STRUCT struct fsinfo_nfs_server_address
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_ERROR_STATE).
+ *
+ * Retrieve the error state for a filesystem.
+ */
+struct fsinfo_error_state {
+	__u32		wb_error_cookie;	/* writeback error cookie */
+	__u32		wb_error_last;		/* latest writeback error */
+};
+
+#define FSINFO_ATTR_ERROR_STATE__STRUCT struct fsinfo_error_state
+
 #endif /* _UAPI_LINUX_FSINFO_H */
-- 
2.24.1



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 17:50 ` [PATCH 00/14] VFS: Filesystem information " Jeff Layton
@ 2020-03-09 19:22   ` Andres Freund
  2020-03-09 22:49     ` Jeff Layton
  0 siblings, 1 reply; 50+ messages in thread
From: Andres Freund @ 2020-03-09 19:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: David Howells, torvalds, viro, Theodore Ts'o,
	Stefan Metzmacher, Andreas Dilger, linux-ext4, Aleksa Sarai,
	Trond Myklebust, Anna Schumaker, linux-nfs, linux-api, raven,
	mszeredi, christian, jannh, darrick.wong, kzak, linux-fsdevel,
	linux-security-module, linux-kernel

Hi,

On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> The PostgreSQL devs asked a while back for some way to tell whether
> there have been any writeback errors on a superblock w/o having to do
> any sort of flush -- just "have there been any so far".

Indeed.


> I sent a patch a few weeks ago to make syncfs() return errors when there
> have been writeback errors on the superblock. It's not merged yet, but
> once we have something like that in place, we could expose info from the
> errseq_t to userland using this interface.

I'm still a bit worried about the details of errseq_t being exposed to
userland. Partially because it seems to restrict further evolution of
errseq_t, and partially because it will likely up with userland trying
to understand it (it's e.g. just too attractive to report a count of
errors etc).

Is there a reason to not instead report a 64bit counter instead of the
cookie? In contrast to the struct file case we'd only have the space
overhead once per superblock, rather than once per #files * #fd. And it
seems that the maintenance of that counter could be done without
widespread changes, e.g. instead/in addition to your change:

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..897439475315 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -51,7 +51,10 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
>  		return;
>
>  	/* Record in wb_err for checkers using errseq_t based tracking */
> -	filemap_set_wb_err(mapping, error);
> +	__filemap_set_wb_err(mapping, error);
> +
> +	/* Record it in superblock */
> +	errseq_set(&mapping->host->i_sb->s_wb_err, error);
>
>  	/* Record it in flags for now, for legacy callers */
>  	if (error == -ENOSPC)

Btw, seems like mapping_set_error() should have a non-inline cold path?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (14 preceding siblings ...)
  2020-03-09 17:50 ` [PATCH 00/14] VFS: Filesystem information " Jeff Layton
@ 2020-03-09 20:02 ` Miklos Szeredi
  2020-03-09 22:52 ` David Howells
  16 siblings, 0 replies; 50+ messages in thread
From: Miklos Szeredi @ 2020-03-09 20:02 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, Theodore Ts'o, Stefan Metzmacher,
	Andreas Dilger, linux-ext4, Aleksa Sarai, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-api, raven, mszeredi, christian,
	jannh, darrick.wong, kzak, jlayton, linux-fsdevel,
	linux-security-module, linux-kernel

On Mon, Mar 09, 2020 at 02:00:46PM +0000, David Howells wrote:
> ============================
> WHY NOT USE PROCFS OR SYSFS?
> ============================

And here's the updated patch (hopefully addressed all of Al's concerns)
that uses procfs and a new mountfs.

Get mountinfo from open file:

  cat /proc/$PID/fdmount/$FD/*

Get mountinfo by mount ID:

  mount -t mountfs mountfs /mountfs
  cat /mountfs/$MNT_ID/*

> Why is it better to go with a new system call rather than adding more magic
> stuff to /proc or /sysfs for each superblock object and each mount object?
> 
>  (1) It can be targetted.  It makes it easy to query directly by path or
>      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
>      cannot do three of these things easily.

See above: with the addition of open(path, O_PATH) it can do all of these.

> 
>  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
>      query information pertinent to a particular file?

Not quite sure why this would be easier for a new ad-hoc interface than for
the well established filesystem API.

> 
>  (3) It's more efficient as we can return specific binary data rather than
>      making huge text dumps.  Granted, sysfs and procfs could present the
>      same data, though as lots of little files which have to be
>      individually opened, read, closed and parsed.
> 
>  (4) We wouldn't have the overhead of open and close (even adding a
>      self-contained readfile() syscall has to do that internally).
> 
>  (5) Opening a file in procfs or sysfs has a pathwalk overhead for each
>      file accessed.  We can use an integer attribute ID instead (yes, this
>      is similar to ioctl) - but could also use a string ID if that is
>      preferred.

Is that super-high performance really warranted?  What would be the
application of that?

> 
>  (6) Can query cross-namespace if, say, a container manager process is
>      given an fs_context that hasn't yet been mounted into a namespace - or
>      hasn't even been fully created yet.

This patch can do that too.

> 
>  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
>      mount happens or is removed - and since systemd makes much use of
>      mount namespaces and mount propagation, this will create a lot of
>      nodes.

This patch creates a single struct mountfs_entry per mount, which is 48bytes.

Now onto the advantages of a filesystem based API:

 - immediately usable from all programming languages, including scripts

 - same goes for future extensions: no need to update libc, utils, language
   bindings, strace, etc...

Thanks,
Miklos

---
 fs/Makefile              |    1 
 fs/mount.h               |    8 
 fs/mountfs/Makefile      |    1 
 fs/mountfs/super.c       |  502 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c           |   31 ++
 fs/proc/base.c           |    2 
 fs/proc/fd.c             |   82 +++++++
 fs/proc/fd.h             |    3 
 fs/proc_namespace.c      |   22 --
 fs/seq_file.c            |   23 ++
 include/linux/seq_file.h |    1 
 11 files changed, 654 insertions(+), 22 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -135,3 +135,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -153,3 +154,10 @@ static inline bool is_anon_ns(struct mnt
 {
 	return ns->seq == 0;
 }
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt;
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int tmp, err = -ENODEV;
+
+	mnt_namespace_lock_read();
+
+	mnt = entry->mnt;
+	if (!mnt || !mnt->mnt_ns)
+		goto out;
+
+	err = 0;
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+		path_put(&root);
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		seq_printf(sf, "%i\n", tmp);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		struct mount *child;
+		bool first = true;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			if (!first)
+				seq_putc(sf, ',');
+			else
+				first = false;
+			seq_printf(sf, "%i", child->mnt_id);
+		}
+		if (!first)
+			seq_putc(sf, '\n');
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			seq_printf(sf, "%i\n", tmp);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			tmp = get_dominating_id(mnt, &root);
+			if (tmp)
+				seq_printf(sf, "%i\n", tmp);
+		}
+	} else {
+		WARN_ON(1);
+		err = -EIO;
+	}
+out:
+	mnt_namespace_unlock_read();
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+	spin_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	spin_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	snprintf(buf, sizeof(buf), "%lu", mnt_id);
+	if (strcmp(buf, name) != 0)
+		return NULL;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	spin_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+		kref_get(&entry->kref);
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len, pos, id;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	pos = ctx->pos - 2;
+	do {
+		spin_lock(&mountfs_lock);
+		mountfs_find_node(pos, &node);
+		pos = 1U + INT_MAX;
+		do {
+			if (!node) {
+				spin_unlock(&mountfs_lock);
+				goto out;
+			}
+			entry = mountfs_node_to_entry(node);
+			node = rb_next(node);
+		} while (!mountfs_entry_visible(entry));
+		if (node)
+			pos = mountfs_node_to_entry(node)->id;
+		id = entry->id;
+		spin_unlock(&mountfs_lock);
+
+		len = snprintf(name, sizeof(name), "%i", id);
+		ctx->pos = id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+			return 0;
+	} while (pos <= INT_MAX);
+out:
+	ctx->pos =  pos + 2;
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	dentry = d_hash_and_lookup(root, &this);
+	if (dentry && dentry->d_inode->i_private != entry) {
+		d_invalidate(dentry);
+		dput(dentry);
+		dentry = NULL;
+	}
+	if (!dentry) {
+		dentry = d_alloc(root, &this);
+		if (!dentry)
+			return -ENOMEM;
+
+		kref_get(&entry->kref);
+		old = mountfs_lookup_entry(dentry, entry, 0);
+		if (old) {
+			dput(dentry);
+			if (IS_ERR(old))
+				return PTR_ERR(old);
+			dentry = old;
+		}
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -943,6 +943,8 @@ struct vfsmount *vfs_create_mount(struct
 
 	if (fc->sb_flags & SB_KERNMOUNT)
 		mnt->mnt.mnt_flags = MNT_INTERNAL;
+	else
+		mountfs_create(mnt);
 
 	atomic_inc(&fc->root->d_sb->s_active);
 	mnt->mnt.mnt_sb		= fc->root->d_sb;
@@ -1013,7 +1015,7 @@ vfs_submount(const struct dentry *mountp
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
 					int flag)
 {
 	struct super_block *sb = old->mnt.mnt_sb;
@@ -1079,6 +1081,17 @@ static struct mount *clone_mnt(struct mo
 	return ERR_PTR(err);
 }
 
+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+			       int flag)
+{
+	struct mount *mnt = clone_mnt_common(old, root, flag);
+
+	if (!IS_ERR(mnt))
+		mountfs_create(mnt);
+
+	return mnt;
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	struct hlist_node *p;
@@ -1091,6 +1104,7 @@ static void cleanup_mnt(struct mount *mn
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1171,6 +1185,8 @@ static void mntput_no_expire(struct moun
 	unlock_mount_hash();
 	shrink_dentry_list(&list);
 
+	mountfs_remove(mnt);
+
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
 		if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1237,13 +1253,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
 struct vfsmount *mnt_clone_internal(const struct path *path)
 {
 	struct mount *p;
-	p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+	p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
 	if (IS_ERR(p))
 		return ERR_CAST(p);
 	p->mnt.mnt_flags |= MNT_INTERNAL;
 	return &p->mnt;
 }
 
+
 #ifdef CONFIG_PROC_FS
 /* iterator; we want it to have access to namespace_sem, thus here... */
 static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1385,6 +1402,16 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void mnt_namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations pro
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations pro
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_fil
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struc
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, str
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
@ 2020-03-09 20:56   ` Stefan Metzmacher
  2020-03-09 21:13   ` David Howells
  2020-03-10  0:55   ` Aleksa Sarai
  2 siblings, 0 replies; 50+ messages in thread
From: Stefan Metzmacher @ 2020-03-09 20:56 UTC (permalink / raw)
  To: David Howells, torvalds, viro
  Cc: Aleksa Sarai, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1032 bytes --]

Hi David,

> Add additional RESOLVE_* flags to correspond to AT_* flags that aren't
> currently implemented:
> 
> 	RESOLVE_NO_TRAILING_SYMLINKS    for AT_SYMLINK_NOFOLLOW
> 	RESOLVE_NO_TRAILING_AUTOMOUNTS  for AT_NO_AUTOMOUNT
> 	RESOLVE_EMPTY_PATH              for AT_EMPTY_PATH

Thanks for changing the names!

> This is necessary for fsinfo() to use RESOLVE_* flags instead of AT_* flags
> if the latter are to be considered deprecated for new system calls.
> 
> Also make openat2() handle RESOLVE_NO_TRAILING_SYMLINKS.
> 
> Automounting is currently forced by doing an open(), so adding support to
> openat2() for RESOLVE_NO_TRAILING_AUTOMOUNTS is not trivial.

lookup_flags &= ~LOOKUP_AUTOMOUNT won't work?

At least it should cause EINVAL (or something similar) instead of being
silently ignored. The same applies to RESOLVE_EMPTY_PATH, it should be
handled with some logic or also cause EINVAL.

vfs_statx()/vfs_stat_set_lookup_flags() seems to have a similar logic
using the AT_* flags.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
  2020-03-09 20:56   ` Stefan Metzmacher
@ 2020-03-09 21:13   ` David Howells
  2020-03-10  0:55   ` Aleksa Sarai
  2 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2020-03-09 21:13 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: dhowells, torvalds, viro, Aleksa Sarai, raven, mszeredi,
	christian, jannh, darrick.wong, kzak, jlayton, linux-api,
	linux-fsdevel, linux-security-module, linux-kernel

Stefan Metzmacher <metze@samba.org> wrote:

> > Automounting is currently forced by doing an open(), so adding support to
> > openat2() for RESOLVE_NO_TRAILING_AUTOMOUNTS is not trivial.
> 
> lookup_flags &= ~LOOKUP_AUTOMOUNT won't work?

No.  LOOKUP_OPEN overrides that.

David


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 19:22   ` Andres Freund
@ 2020-03-09 22:49     ` Jeff Layton
  2020-03-10  0:18       ` Andres Freund
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Layton @ 2020-03-09 22:49 UTC (permalink / raw)
  To: Andres Freund
  Cc: David Howells, torvalds, viro, Theodore Ts'o,
	Stefan Metzmacher, Andreas Dilger, linux-ext4, Aleksa Sarai,
	Trond Myklebust, Anna Schumaker, linux-nfs, linux-api, raven,
	mszeredi, christian, jannh, darrick.wong, kzak, linux-fsdevel,
	linux-security-module, linux-kernel

On Mon, 2020-03-09 at 12:22 -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> > The PostgreSQL devs asked a while back for some way to tell whether
> > there have been any writeback errors on a superblock w/o having to do
> > any sort of flush -- just "have there been any so far".
> 
> Indeed.
> 
> 
> > I sent a patch a few weeks ago to make syncfs() return errors when there
> > have been writeback errors on the superblock. It's not merged yet, but
> > once we have something like that in place, we could expose info from the
> > errseq_t to userland using this interface.
> 
> I'm still a bit worried about the details of errseq_t being exposed to
> userland. Partially because it seems to restrict further evolution of
> errseq_t, and partially because it will likely up with userland trying
> to understand it (it's e.g. just too attractive to report a count of
> errors etc).

Trying to interpret the counter field won't really tell you anything.
The counter is not incremented unless someone has queried the value
since it was last checked. A single increment could represent a single
writeback error or 10000 identical ones.

There _is_ a flag that tells you whether someone has queried it, but
that gets masked off before copying the cookie to userland.

> Is there a reason to not instead report a 64bit counter instead of the
> cookie? In contrast to the struct file case we'd only have the space
> overhead once per superblock, rather than once per #files * #fd. And it
> seems that the maintenance of that counter could be done without
> widespread changes, e.g. instead/in addition to your change:
> 

What problem would moving to a 64-bit counter solve? I get the concern
about people trying to get a counter out of the cookie field, but giving
people an explicit 64-bit counter seems even more open to
misinterpretation.

All that said, is an opaque cookie still something you'd find useful?

> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index ccb14b6a16b5..897439475315 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -51,7 +51,10 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
> >  		return;
> > 
> >  	/* Record in wb_err for checkers using errseq_t based tracking */
> > -	filemap_set_wb_err(mapping, error);
> > +	__filemap_set_wb_err(mapping, error);
> > +
> > +	/* Record it in superblock */
> > +	errseq_set(&mapping->host->i_sb->s_wb_err, error);
> > 
> >  	/* Record it in flags for now, for legacy callers */
> >  	if (error == -ENOSPC)
> 
> Btw, seems like mapping_set_error() should have a non-inline cold path?

Good point. I'll do that in the next iteration.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
                   ` (15 preceding siblings ...)
  2020-03-09 20:02 ` Miklos Szeredi
@ 2020-03-09 22:52 ` David Howells
  2020-03-10  9:18   ` Miklos Szeredi
  16 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-09 22:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, torvalds, viro, Theodore Ts'o, Stefan Metzmacher,
	Andreas Dilger, linux-ext4, Aleksa Sarai, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-api, raven, mszeredi, christian,
	jannh, darrick.wong, kzak, jlayton, linux-fsdevel,
	linux-security-module, linux-kernel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> >  (1) It can be targetted.  It makes it easy to query directly by path or
> >      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
> >      cannot do three of these things easily.
> 
> See above: with the addition of open(path, O_PATH) it can do all of these.

That's a horrible interface.  To query a file by path, you have to do:

	fd = open(path, O_PATH);
	sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
	fd2 = open(procpath, O_RDONLY);
	read(fd2, ...);
	close(fd2);
	close(fd);

See point (3) about efficiency also.  You're having to open *two* files.

> >  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
> >      query information pertinent to a particular file?
> 
> Not quite sure why this would be easier for a new ad-hoc interface than for
> the well established filesystem API.

You're right.  That's why fsinfo() uses standard pathwalk where possible,
e.g.:

	fsinfo(AT_FDCWD, "/path/to/file", ...);

or a fairly standard fd-querying interface:

	fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH },  ...);

to query an open file descriptor.  These are well-established filesystem APIs.

Where I vary from this is allowing direct specification of a mount ID also,
with a special flag to say that's what I'm doing:

	fsinfo(AT_FDCWD, "23", { flags = FSINFO_QUERY_FLAGS_MOUNT },  ...);

> >  (7) Don't have to create/delete a bunch of sysfs/procfs nodes each time a
> >      mount happens or is removed - and since systemd makes much use of
> >      mount namespaces and mount propagation, this will create a lot of
> >      nodes.
> 
> This patch creates a single struct mountfs_entry per mount, which is 48bytes.

fsinfo() doesn't create any.  Furthermore, it seems that mounts get multiplied
8-10 times by systemd - though, as you say, it's not necessarily a great deal
of memory.

> Now onto the advantages of a filesystem based API:
> 
>  - immediately usable from all programming languages, including scripts

This is not true.  You can't open O_PATH from shell scripts, so you can't
query things by path that you can't or shouldn't open (dev file paths, for
example; symlinks).

I imagine you're thinking of something like:

	{
		id=`cat /proc/self/fdmount/5/parent_mount`
	} 5</my/path/to/my/file

but what if /my/path/to/my/file is actually /dev/foobar?

I've had a grep through the bash sources, but can't seem to find anywhere that
uses O_PATH.

>  - same goes for future extensions: no need to update libc, utils, language
>    bindings, strace, etc...

Applications and libraries using these attributes would have to change anyway
to make use of additional information.

But it's not a good argument since you now have to have text parsers that
change over time.

David


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 22:49     ` Jeff Layton
@ 2020-03-10  0:18       ` Andres Freund
  0 siblings, 0 replies; 50+ messages in thread
From: Andres Freund @ 2020-03-10  0:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: David Howells, torvalds, viro, Theodore Ts'o,
	Stefan Metzmacher, Andreas Dilger, linux-ext4, Aleksa Sarai,
	Trond Myklebust, Anna Schumaker, linux-nfs, linux-api, raven,
	mszeredi, christian, jannh, darrick.wong, kzak, linux-fsdevel,
	linux-security-module, linux-kernel

Hi,

On 2020-03-09 18:49:31 -0400, Jeff Layton wrote:
> On Mon, 2020-03-09 at 12:22 -0700, Andres Freund wrote:
> > On 2020-03-09 13:50:59 -0400, Jeff Layton wrote:
> > > I sent a patch a few weeks ago to make syncfs() return errors when there
> > > have been writeback errors on the superblock. It's not merged yet, but
> > > once we have something like that in place, we could expose info from the
> > > errseq_t to userland using this interface.
> >
> > I'm still a bit worried about the details of errseq_t being exposed to
> > userland. Partially because it seems to restrict further evolution of
> > errseq_t, and partially because it will likely up with userland trying
> > to understand it (it's e.g. just too attractive to report a count of
> > errors etc).
>
> Trying to interpret the counter field won't really tell you anything.
> The counter is not incremented unless someone has queried the value
> since it was last checked. A single increment could represent a single
> writeback error or 10000 identical ones.

Oh, right.  A zero errseq would still indicate something, but that's
probably fine.


> > Is there a reason to not instead report a 64bit counter instead of the
> > cookie? In contrast to the struct file case we'd only have the space
> > overhead once per superblock, rather than once per #files * #fd. And it
> > seems that the maintenance of that counter could be done without
> > widespread changes, e.g. instead/in addition to your change:

> What problem would moving to a 64-bit counter solve? I get the concern
> about people trying to get a counter out of the cookie field, but giving
> people an explicit 64-bit counter seems even more open to
> misinterpretation.

Well, you could get an actual error count out of it? I was thinking that
that value would get incremented every time mapping_set_error() is
called, which should make it a meaningful count?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
  2020-03-09 20:56   ` Stefan Metzmacher
  2020-03-09 21:13   ` David Howells
@ 2020-03-10  0:55   ` Aleksa Sarai
  2020-03-10  1:14     ` Linus Torvalds
  2020-03-10  7:25     ` David Howells
  2 siblings, 2 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-10  0:55 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, Stefan Metzmacher, raven, mszeredi, christian,
	jannh, darrick.wong, kzak, jlayton, linux-api, linux-fsdevel,
	linux-security-module, linux-kernel

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 3754 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAvnnzlyhqnl2PotBiIBAxAAL8npEO80ibeJj4pECRihUw
iv3qCVtEwUUFpFEBObc7ZFVGaLFl6iXlfQteardCOd+4gJ54kzMPXXdeUjSUHwRs
hF4Da9o3/MenwwcSAQdA5FTufWH8snFtGKLwt7fBUPiYB36HHcrBl1akCG00HRIw
JV8BsnKspwBIXNMTuvb5OFma70k7b3O03l1kG7S0z1JGpcjosJnV2Qytkf+b0gRB
0usBQMudbs7e0fGjtVQP7+TGGr/sqt3QQPdPIMS+kFg8hrd3Sc2XvdgY4MbW5p10
u4/o26+cDcuyy/7cxRqmuViYAI1BsdG946XJrdGYoA31SHaVNhUTfIn//8IiLMxo
p52Lj4dUlv3yw7NCXxD2VdW9U2xg0YNAvmml4ZiRZNGt40FCt9EplJlRO3OVx/4F
EFL+um1EL/wd3sZn2TIXK9LiM7GBc3Goues0pgaaU2bK0TjgIHBdeDh4vgsLZxsD
TLhCtAsSMFYQ59f1NnMzn8Rg0BtC/iFg1WJcGWjO33KSfObq3CfirUs9bEE/cPWv
Ax8xkiNqjcz5mkYS//IO0OVBSycz9VphOvoHehJK4k3yeDDcfEWC0npiPFJ4QtAQ
6k04DiAjXwxXALllgfV0o857r8sOfJC133KoOXGgmexASzdGqYzV3ZNEgIhUOClg
0ldSW2fN0Xk3JvBqbL7IrHLk1mUQtVtStiJJLqhQxLwAi+ye9rKWDlsQIKO8bxBJ
o5pGgi5Y2jvbH3aBAR732frh4jlxxGcmmuZwAYlcgFR0iCv4P4khHGwLR/4LOWTK
PU+Oqu8zyIOnfFYkDOzkY+/u0PhCfi0gmk/DyVGD13IsujT40W8HItnEC6EqIyeF
l8828ZgNQHt7mxnALh67J2JSTal93SSCouL8Ww0seN4MvOW3gmHrKrSaUqCFyUO6
HYyfjgbnnwRMKRZD5jxQQD42L38NqFmrnvsZyVwnIpG8aCjnBDTXBaDs4q4xAxQw
HTkgTXn9buwhI9ZJPI7fFyAxXbEeGqwn4Q8svD3V7tHwhmv4ncScYSWQ6wwGixwK
c8AQuQeEeK/8iFiWbziWNoNs54tCRycUvocc+pER7hvgYV/YXQHJCPfcvXaJxELY
ZnTfRq31JtaLPAUzai6w+d7PnzetGwr9d8tqaBddAPHDQPeRIiDrbIsQpBFIvxON
hnY5P8FaSKGD1s81Lufe0qpUQth/bRygryHWkzwwywQ/mtWMvixwBKcW4JISaCkp
U8A2ttg3eZ1ke/T/BG61CuMoivNP1l+jOKM/VaBYj9lHQJDU0+bUZSRnoVYJTbZf
isfaB5Qi5Q0o8z1GliawT8Mqg842ewQxRtVR2ZwUC3RX+/RX9BDy3nl55K/vpdOI
E1TNBD1qQ1Jd5CMHCR2B8FRmV69gqwFgy0tN+EtjurF6KSH9fxWUZacHwl9Livkr
cUMcj5oKbFJwWV7ILMonXFxMEPoTS8UWYHyB8LZHsxCo66SjfK2qqIxnuImxdATd
zy1Y6SgELVPJWV7m5/PB35Akgnffnuc/CgHAccvzOiF5wy+qe9qlxxAoQOowFdn/
AuB3xumRlZLQ+1EukQfS0a1V/SegoAW5kXbKtpuv4M6F9RQQL7MrNyOxddZDViUP
Ab2psIHae5vpApBRYZnojl8AxXILmQ8LfvQcxJYr54DhAHBsqgXuTP5aB1KT0g5Z
7qgz26BPUuTBxmMTyNX0IieikURuIDT35ivyXRkqZEINxT74rPSMUbYvXiyB/BX4
Wt616jG+Xny5+VAsKH1y9NBwVUTAK/YKERaOAZb9qvD7l5kLWQCJN/Nfj4qYIUZV
PvqwN2HXTug6MOVqhd7wA6B45FZtL4voBW/lRDvFKebxdnhiBigLZEetPJGPoXcE
unjFhfTvg1ETuOBJukN7uyyMWtic4g9dUc8T9ITZMFRDCbphntcMH2w9KBarKRxr
0+6kb1TM58D/pJ3kHDEr8rZLQ2LeOWiUf3C86gnD7KZxMK2YlrdHDrvEEiJ2vD8F
GcSlACJqH2bv75+qhMnw4hUddFslJ/3oCldvX5qOKHSF53bpcRDdKzZZRGVSHp3o
mcOqz65hzX1gaKrS1miohIYy3JIEKzm32XB5Uzu7ZTPgeakIZlOgPpuJl6nxlgNd
+H8heGjJzvNHZvIxQJAC5mCga9pCGH0slqtVr/QliRS2AytVlJsayIMTNu+5d/Pl
DhLH093RHtjuAcRisaBnbpBX4ZR+4j3byPv0Ua5aoaY+ujOwcnkxR/7cZ4aGIVGw
2jNQhIy2KzTFn1kvxMQVDAsmqLd7g6v8J+Zoa/UuhjNvDofnGCi27uUpRN5TjWkv
4SdREZDz56auugR2ocSU6zMmItfvBe/llS6ZtOOUL1mOOTcYFSpQbXvMMMkLZgrq
nC5DduucHLc9iimskKDVE6pG3d4jJKwfkhCpOZPMwben4XtnRXA2WXntU4hEyAC9
Ex0HvfftFU4IVbBWgsAjPv/1MaC0eKsJ5ILzqfRIiN5mzvIhamLJc34YoTmABeYN
db5RvYJ2zTjHs+OxlA28c4B2GkNiC8UQffKENaJ35PePuePiscwCaKpaXxLsE+O5
/Ff7YHVnG7R8fm4r20Mzcl+wwXmcQ+Oo9bBTT1p/rHniVYde6MMpfZ7oZqSFyqhd
uKy6vi0168TE0W8ELMn18s7HPjt2hzUfjAvlXsxC2Z8kyLstJsdkI0G6ky0/s5La
9n6Ew0m7L/BnRnAp2nMHym3BqD7s8kNah4qptKxdH0OcGbtr51wck+NBB7HFMPfY
oNkTDmtg8V9WDf42uUPsgMMP17y9wSu4k8lNfGuB5g9MqDjWnOYGPRUhdQlnvuqz
GvTYJuNJGa8/Lv0ef8+CcVC3slB47Bt2BNiIpl5nIGasD5saFSUxkxnw7SwbcBwH
omfdVeMZD66L/Wu4IiosWPhIsCVJAGTJACWvsy/GbXozL8Eh+WvIFGo7TYpTAXdb
ueLEXhhH1PSqoDxkUbja74IoRXeN1fTp+LJvsobF2eMXn2VUdSJ3EwlhLdTR5XeR
7h/0WYzWPbQvukS+LWj7ZVlcgH5RFiibwuelxFs5Mpf1DexeUC3Ls6Covtmf6UK7
P4wEhqe/PQAXtr7ZENUaKs0ZcbmhO9fAutRXDY3IPVouE3CLZOQ28G4vCPHUVsw1
mbTBtKwE+PzsDd6hz1RgXkHr/z/Tbr5LQtSTs++QO8OSx7GMT6jcncvzO6ZYHPpQ
Tkbi4EgKq3bkcrE8I6LIPSGoGwtSpksxPeOV5N5qDucH87CdK7hdEwtlID7VEBiu
ji2mxL2/Mzas30C5kfcuvJwZFA+ZrB4CengRKjjQgozmjP0p0ptgHBdyq4+d3ipf
jJtf8dSbESAPiyWNbCKbUx66dWuOB9gbvnX2egq7g/Zv0Sfm+LS8l7WBmfkFMqg6
7/f9lE6NaWdrTmpH7nfP35HECKs0mPLSUZasCUo19icGsa3Q84cdX0R3qK7jyhLo
aIJeMy/FwQa1vk62oGZnMAeoxbWB1NMb3K3k/AQqCC0Rzap+uN1s0DSocFGIqibW
52xhz7S3JU6K4iTPkC9Bg1ZnGrEzvEcWucAAI9JxwGvzLxh/+A==
=cnO7
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-10  0:55   ` Aleksa Sarai
@ 2020-03-10  1:14     ` Linus Torvalds
  2020-03-10  7:25     ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: Linus Torvalds @ 2020-03-10  1:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: David Howells, Al Viro, Stefan Metzmacher, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

On Mon, Mar 9, 2020 at 5:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2020-03-09, David Howells <dhowells@redhat.com> wrote:
> > This is necessary for fsinfo() to use RESOLVE_* flags instead of AT_* flags
> > if the latter are to be considered deprecated for new system calls.
> >
> > Also make openat2() handle RESOLVE_NO_TRAILING_SYMLINKS.

No, please let's not do this.

We have O_NOFOLLOW, and we can't get rid of it.

So adding RESOLVE_NO_TRAILING_SYMLINKS isn't a cleanup. It's just
extra complexity for absolutely zero gain.


> After thinking about what Christian said some more, I reckon we
> shouldn't support both O_NOFOLLOW and RESOLVE_NO_TRAILING_SYMLINKS. But
> that means we'll need to cherry-pick this patch and get it into mainline
> before v5.6.

No.

It simply means that we shouldn't have RESOLVE_NO_TRAILING_SYMLINKS at all.

Adding that flag is a mistake. It causes problems like this, where
subtlenly people say "what if both flags are set".

Just don't do it.

There's no way in hell we can ever get rid of O_NOFOLLOW anyway, since
people will continue to use plain open() and openat().

So adding RESOLVE_NO_TRAILING_SYMLINKS is entirely redundant.

Don't deprecate the old flags that are going to always stay around,
don't add stupid new flags that add no value.

It's that easy.

              Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-10  0:55   ` Aleksa Sarai
  2020-03-10  1:14     ` Linus Torvalds
@ 2020-03-10  7:25     ` David Howells
  2020-03-11 17:59       ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-10  7:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Aleksa Sarai, Al Viro, Stefan Metzmacher, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > > Also make openat2() handle RESOLVE_NO_TRAILING_SYMLINKS.
> 
> No, please let's not do this.
> 
> We have O_NOFOLLOW, and we can't get rid of it.
> 
> So adding RESOLVE_NO_TRAILING_SYMLINKS isn't a cleanup. It's just
> extra complexity for absolutely zero gain.

Okay.  So what's the equivalent of AT_SYMLINK_NOFOLLOW in RESOLVE_* flag
terms?  RESOLVE_NO_SYMLINKS is not equivalent, though O_NOFOLLOW is.  The
reason I ask is that RESOLVE_* flags can't be easily extended to non-open
syscalls that don't take O_* flags without it.  Would you prefer that new
non-open syscalls continue to take AT_* and ignore RESOLVE_* flags?  That
would be fine by me.

David


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved [ver #18]
  2020-03-09 14:02 ` [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
@ 2020-03-10  8:42   ` Christian Brauner
  0 siblings, 0 replies; 50+ messages in thread
From: Christian Brauner @ 2020-03-10  8:42 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel

On Mon, Mar 09, 2020 at 02:02:01PM +0000, David Howells wrote:
> Allow the mount topology propogation flags to be retrieved as part of the
> FSINFO_ATTR_MOUNT_INFO attributes.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

(Btw, I had a patchset for the old stat* family of syscalls a while back
https://lwn.net/ml/linux-fsdevel/20180418092722.20136-1-christian.brauner@ubuntu.com/)

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/14] fsinfo: Allow mount information to be queried [ver #18]
  2020-03-09 14:01 ` [PATCH 07/14] fsinfo: Allow mount information to be queried " David Howells
@ 2020-03-10  9:04   ` Miklos Szeredi
  0 siblings, 0 replies; 50+ messages in thread
From: Miklos Szeredi @ 2020-03-10  9:04 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	Jeff Layton, Linux API, linux-fsdevel, LSM, linux-kernel

On Mon, Mar 9, 2020 at 3:02 PM David Howells <dhowells@redhat.com> wrote:
>
> Allow mount information, including information about the topology tree to
> be queried with the fsinfo() system call.  Setting AT_FSINFO_QUERY_MOUNT
> allows overlapping mounts to be queried by indicating that the syscall
> should interpet the pathname as a number indicating the mount ID.
>
> To this end, a number of fsinfo() attributes are provided:
>
>  (1) FSINFO_ATTR_MOUNT_INFO.
>
>      This is a structure providing information about a mount, including:
>
>         - Mounted superblock ID (mount ID uniquifier).
>         - Mount ID (can be used with AT_FSINFO_QUERY_MOUNT).
>         - Parent mount ID.
>         - Mount attributes (eg. R/O, NOEXEC).
>         - Mount change/notification counter.
>
>      Note that the parent mount ID is overridden to the ID of the queried
>      mount if the parent lies outside of the chroot or dfd tree.
>
>  (2) FSINFO_ATTR_MOUNT_PATH.
>
>      This a string providing information about a bind mount relative the
>      the root that was bound off, though it may get overridden by the
>      filesystem (NFS unconditionally sets it to "/", for example).
>
>  (3) FSINFO_ATTR_MOUNT_POINT.
>
>      This is a string indicating the name of the mountpoint within the
>      parent mount, limited to the parent's mounted root and the chroot.
>
>  (4) FSINFO_ATTR_MOUNT_POINT_FULL.
>
>      This is a string indicating the full path of the mountpoint, limited to
>      the chroot.
>
>  (5) FSINFO_ATTR_MOUNT_CHILDREN.
>
>      This produces an array of structures, one for each child and capped
>      with one for the argument mount (checked after listing all the
>      children).  Each element contains the mount ID and the change counter
>      of the respective mount object.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  fs/d_path.c                 |    2
>  fs/fsinfo.c                 |   14 +++
>  fs/internal.h               |   10 ++
>  fs/namespace.c              |  177 +++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fsinfo.h |   36 +++++++++
>  samples/vfs/test-fsinfo.c   |   43 ++++++++++
>  6 files changed, 281 insertions(+), 1 deletion(-)
>
> diff --git a/fs/d_path.c b/fs/d_path.c
> index 0f1fc1743302..4c203f64e45e 100644
> --- a/fs/d_path.c
> +++ b/fs/d_path.c
> @@ -229,7 +229,7 @@ static int prepend_unreachable(char **buffer, int *buflen)
>         return prepend(buffer, buflen, "(unreachable)", 13);
>  }
>
> -static void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
> +void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
>  {
>         unsigned seq;
>
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index bafeb73feaf4..6d2bc03998e4 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -236,6 +236,14 @@ static int fsinfo_generic_seq_read(struct path *path, struct fsinfo_context *ctx
>                         ret = sb->s_op->show_options(&m, path->mnt->mnt_root);
>                 break;
>
> +       case FSINFO_ATTR_MOUNT_PATH:
> +               if (sb->s_op->show_path) {
> +                       ret = sb->s_op->show_path(&m, path->mnt->mnt_root);
> +               } else {
> +                       seq_dentry(&m, path->mnt->mnt_root, " \t\n\\");
> +               }
> +               break;
> +
>         case FSINFO_ATTR_FS_STATISTICS:
>                 if (sb->s_op->show_stats)
>                         ret = sb->s_op->show_stats(&m, path->mnt->mnt_root);
> @@ -270,6 +278,12 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
>
>         FSINFO_LIST     (FSINFO_ATTR_FSINFO_ATTRIBUTES, (void *)123UL),
>         FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
> +
> +       FSINFO_VSTRUCT  (FSINFO_ATTR_MOUNT_INFO,        fsinfo_generic_mount_info),
> +       FSINFO_STRING   (FSINFO_ATTR_MOUNT_PATH,        fsinfo_generic_seq_read),
> +       FSINFO_STRING   (FSINFO_ATTR_MOUNT_POINT,       fsinfo_generic_mount_point),
> +       FSINFO_STRING   (FSINFO_ATTR_MOUNT_POINT_FULL,  fsinfo_generic_mount_point_full),
> +       FSINFO_LIST     (FSINFO_ATTR_MOUNT_CHILDREN,    fsinfo_generic_mount_children),
>         {}
>  };
>
> diff --git a/fs/internal.h b/fs/internal.h
> index abbd5299e7dc..1a318dc85f2f 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -15,6 +15,7 @@ struct mount;
>  struct shrink_control;
>  struct fs_context;
>  struct user_namespace;
> +struct fsinfo_context;
>
>  /*
>   * block_dev.c
> @@ -47,6 +48,11 @@ extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
>   */
>  extern void __init chrdev_init(void);
>
> +/*
> + * d_path.c
> + */
> +extern void get_fs_root_rcu(struct fs_struct *fs, struct path *root);
> +
>  /*
>   * fs_context.c
>   */
> @@ -93,6 +99,10 @@ extern void __mnt_drop_write_file(struct file *);
>  extern void dissolve_on_fput(struct vfsmount *);
>  extern int lookup_mount_object(struct path *, int, struct path *);
>  extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_info(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_point(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_point_full(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_context *);
>
>  /*
>   * fs_struct.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54e8eb93fdd6..a6cb8c6b004f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4149,4 +4149,181 @@ int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
>         goto out_unlock;
>  }
>
> +/*
> + * Retrieve information about the nominated mount.
> + */
> +int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
> +{
> +       struct fsinfo_mount_info *p = ctx->buffer;
> +       struct super_block *sb;
> +       struct mount *m;
> +       struct path root;
> +       unsigned int flags;
> +
> +       m = real_mount(path->mnt);
> +       sb = m->mnt.mnt_sb;
> +
> +       p->sb_unique_id         = sb->s_unique_id;
> +       p->mnt_unique_id        = m->mnt_unique_id;
> +       p->mnt_id               = m->mnt_id;
> +       p->parent_id            = m->mnt_parent->mnt_id;
> +
> +       get_fs_root(current->fs, &root);
> +       if (path->mnt == root.mnt) {
> +               p->parent_id = p->mnt_id;
> +       } else {
> +               rcu_read_lock();
> +               if (!are_paths_connected(&root, path))
> +                       p->parent_id = p->mnt_id;
> +               rcu_read_unlock();
> +       }
> +       if (IS_MNT_SHARED(m))
> +               p->group_id = m->mnt_group_id;
> +       if (IS_MNT_SLAVE(m)) {
> +               int master = m->mnt_master->mnt_group_id;
> +               int dom = get_dominating_id(m, &root);

This isn't safe without namespace_sem or mount_lock.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/14] VFS: Filesystem information [ver #18]
  2020-03-09 22:52 ` David Howells
@ 2020-03-10  9:18   ` Miklos Szeredi
  0 siblings, 0 replies; 50+ messages in thread
From: Miklos Szeredi @ 2020-03-10  9:18 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Al Viro, Theodore Ts'o, Stefan Metzmacher,
	Andreas Dilger, linux-ext4, Aleksa Sarai, Trond Myklebust,
	Anna Schumaker, Linux NFS list, Linux API, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel

On Mon, Mar 9, 2020 at 11:53 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > >  (1) It can be targetted.  It makes it easy to query directly by path or
> > >      fd, but can also query by mount ID or fscontext fd.  procfs and sysfs
> > >      cannot do three of these things easily.
> >
> > See above: with the addition of open(path, O_PATH) it can do all of these.
>
> That's a horrible interface.  To query a file by path, you have to do:
>
>         fd = open(path, O_PATH);
>         sprintf(procpath, "/proc/self/fdmount/%u/<attr>");
>         fd2 = open(procpath, O_RDONLY);
>         read(fd2, ...);
>         close(fd2);
>         close(fd);
>
> See point (3) about efficiency also.  You're having to open *two* files.

I completely agree, opening two files is surely going to kill
performance of application needing to retrieve a billion mount
attributes per second.</sarcasm>

> > >  (2) Easier to provide LSM oversight.  Is the accessing process allowed to
> > >      query information pertinent to a particular file?
> >
> > Not quite sure why this would be easier for a new ad-hoc interface than for
> > the well established filesystem API.
>
> You're right.  That's why fsinfo() uses standard pathwalk where possible,
> e.g.:
>
>         fsinfo(AT_FDCWD, "/path/to/file", ...);
>
> or a fairly standard fd-querying interface:
>
>         fsinfo(fd, "", { resolve_flags = RESOLVE_EMPTY_PATH },  ...);
>
> to query an open file descriptor.  These are well-established filesystem APIs.

Yes.  The problem is with the "..." part where you pass random
structures to a function.   That's useful sometimes, but at the very
least it breaks type safety, and not what I would call a "clean" API.

> > Now onto the advantages of a filesystem based API:
> >
> >  - immediately usable from all programming languages, including scripts
>
> This is not true.  You can't open O_PATH from shell scripts, so you can't
> query things by path that you can't or shouldn't open (dev file paths, for
> example; symlinks).

Yes.  However, you just wrote the core of a utility that could do this
(in 6 lines, no less).  Now try that feat with fsinfo(2)!

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information [ver #18]
  2020-03-09 14:01 ` [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
@ 2020-03-10  9:31   ` Christian Brauner
  2020-03-10  9:32     ` [PATCH v19 01/14] fsinfo: Add fsinfo() syscall to query filesystem information Christian Brauner
  0 siblings, 1 reply; 50+ messages in thread
From: Christian Brauner @ 2020-03-10  9:31 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, linux-api, raven, mszeredi, christian, jannh,
	darrick.wong, kzak, jlayton, linux-fsdevel,
	linux-security-module, linux-kernel

On Mon, Mar 09, 2020 at 02:01:06PM +0000, David Howells wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
> 
> ===============
> NEW SYSTEM CALL
> ===============
> 
> The new system call looks like:
> 
> 	int ret = fsinfo(int dfd,
> 			 const char *pathname,
> 			 const struct fsinfo_params *params,
> 			 size_t params_size,
> 			 void *result_buffer,
> 			 size_t result_buf_size);
> 
> The params parameter optionally points to a block of parameters:
> 
> 	struct fsinfo_params {
> 		__u32	resolve_flags;
> 		__u32	flags;
> 		__u32	request;
> 		__u32	Nth;
> 		__u32	Mth;
> 	};
> 
> If params is NULL, the default is that params->request is
> FSINFO_ATTR_STATFS and all the other fields are 0.  params_size indicates
> the size of the parameter struct.  If the parameter block is short compared
> to what the kernel expects, the missing length will be set to 0; if the
> parameter block is longer, an error will be given if the excess is not all
> zeros.
> 
> The object to be queried is specified as follows - part param->flags
> indicates the type of reference:
> 
>  (1) FSINFO_FLAGS_QUERY_PATH - dfd, pathname and at_flags indicate a
>      filesystem object to query.  There is no separate system call
>      providing an analogue of lstat() - RESOLVE_NO_TRAILING_SYMLINKS should
>      be set in at_flags instead.  RESOLVE_NO_TRAILING_AUTOMOUNTS can also
>      be used to an allow automount point to be queried without triggering
>      it.
> 
>  (2) FSINFO_FLAGS_QUERY_FD - dfd indicates a file descriptor pointing to
>      the filesystem object to query.  pathname should be NULL.
> 
>  (3) FSINFO_FLAGS_QUERY_MOUNT - pathname indicates the numeric ID of the
>      mountpoint to query as a string.  dfd is used to constrain which
>      mounts can be accessed.  If dfd is AT_FDCWD, the mount must be within
>      the subtree rooted at chroot, otherwise the mount must be within the
>      subtree rooted at the directory specified by dfd.
> 
>  (4) In the future FSINFO_FLAGS_QUERY_FSCONTEXT will be added - dfd will
>      indicate a context handle fd obtained from fsopen() or fspick(),
>      allowing that to be queried before the target superblock is attached
>      to the filesystem or even created.
> 
> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	FSINFO_ATTR_STATFS		- statfs-style info
> 	FSINFO_ATTR_IDS			- Filesystem IDs
> 	FSINFO_ATTR_LIMITS		- Filesystem limits
> 	FSINFO_ATTR_SUPPORTS		- Support for statx, ioctl, etc.
> 	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
> 	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
> 	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
> 	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
> 	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
> 	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs
> 
> Some attributes (such as the servers backing a network filesystem) can have
> multiple values.  These can be enumerated by setting params->Nth and
> params->Mth to 0, 1, ... until ENODATA is returned.
> 
> result_buffer and result_buf_size point to the reply buffer.  The buffer is
> filled up to the specified size, even if this means truncating the reply.
> The size of the full reply is returned, irrespective of the amount data
> that was copied.  In future versions, this will allow extra fields to be
> tacked on to the end of the reply, but anyone not expecting them will only
> get the subset they're expecting.  If either buffer of result_buf_size are
> 0, no copy will take place and the data size will be returned.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org

You're missing to wire-up the syscall into the arm64 unistd32.h table
and this is all in one patch. I'd rather do it like we have done for all
other syscalls recently, and split this into:
- actual syscall implementation
- final wiring-up patch
Will make it easier to apply and spot merge conflicts when multiple
syscalls are proposed. I'm going to respond to this mail here with two
patches. One could replace this one I'm responding to and the other one
should probably go on top of the series.
(Please note that the same missing arm64 unistd32.h handling also likely
 affects the watch syscalls as I haven't seen them in there when I added
 fsinfo().)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v19 01/14] fsinfo: Add fsinfo() syscall to query filesystem information
  2020-03-10  9:31   ` Christian Brauner
@ 2020-03-10  9:32     ` Christian Brauner
  2020-03-10  9:32       ` [PATCH v19 14/14] arch: wire up fsinfo syscall Christian Brauner
  0 siblings, 1 reply; 50+ messages in thread
From: Christian Brauner @ 2020-03-10  9:32 UTC (permalink / raw)
  To: christian.brauner
  Cc: christian, darrick.wong, dhowells, jannh, jlayton, kzak,
	linux-api, linux-fsdevel, linux-kernel, linux-security-module,
	mszeredi, raven, torvalds, viro

From: David Howells <dhowells@redhat.com>

Add a system call to allow filesystem information to be queried.  A request
value can be given to indicate the desired attribute.  Support is provided
for enumerating multi-value attributes.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

	int ret = fsinfo(int dfd,
			 const char *pathname,
			 const struct fsinfo_params *params,
			 size_t params_size,
			 void *result_buffer,
			 size_t result_buf_size);

The params parameter optionally points to a block of parameters:

	struct fsinfo_params {
		__u32	resolve_flags;
		__u32	flags;
		__u32	request;
		__u32	Nth;
		__u32	Mth;
	};

If params is NULL, the default is that params->request is
FSINFO_ATTR_STATFS and all the other fields are 0.  params_size indicates
the size of the parameter struct.  If the parameter block is short compared
to what the kernel expects, the missing length will be set to 0; if the
parameter block is longer, an error will be given if the excess is not all
zeros.

The object to be queried is specified as follows - part param->flags
indicates the type of reference:

 (1) FSINFO_FLAGS_QUERY_PATH - dfd, pathname and at_flags indicate a
     filesystem object to query.  There is no separate system call
     providing an analogue of lstat() - RESOLVE_NO_TRAILING_SYMLINKS should
     be set in at_flags instead.  RESOLVE_NO_TRAILING_AUTOMOUNTS can also
     be used to an allow automount point to be queried without triggering
     it.

 (2) FSINFO_FLAGS_QUERY_FD - dfd indicates a file descriptor pointing to
     the filesystem object to query.  pathname should be NULL.

 (3) FSINFO_FLAGS_QUERY_MOUNT - pathname indicates the numeric ID of the
     mountpoint to query as a string.  dfd is used to constrain which
     mounts can be accessed.  If dfd is AT_FDCWD, the mount must be within
     the subtree rooted at chroot, otherwise the mount must be within the
     subtree rooted at the directory specified by dfd.

 (4) In the future FSINFO_FLAGS_QUERY_FSCONTEXT will be added - dfd will
     indicate a context handle fd obtained from fsopen() or fspick(),
     allowing that to be queried before the target superblock is attached
     to the filesystem or even created.

params->request indicates the attribute/attributes to be queried.  This can
be one of:

	FSINFO_ATTR_STATFS		- statfs-style info
	FSINFO_ATTR_IDS			- Filesystem IDs
	FSINFO_ATTR_LIMITS		- Filesystem limits
	FSINFO_ATTR_SUPPORTS		- Support for statx, ioctl, etc.
	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs

Some attributes (such as the servers backing a network filesystem) can have
multiple values.  These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

result_buffer and result_buf_size point to the reply buffer.  The buffer is
filled up to the specified size, even if this means truncating the reply.
The size of the full reply is returned, irrespective of the amount data
that was copied.  In future versions, this will allow extra fields to be
tacked on to the end of the reply, but anyone not expecting them will only
get the subset they're expecting.  If either buffer of result_buf_size are
0, no copy will take place and the data size will be returned.

[christian.brauner@ubuntu.com: split out syscall wire-up]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
 fs/Kconfig                  |   7 +
 fs/Makefile                 |   1 +
 fs/fsinfo.c                 | 582 +++++++++++++++++++++++++++++++++
 include/linux/fs.h          |   4 +
 include/linux/fsinfo.h      |  73 +++++
 include/uapi/linux/fsinfo.h | 186 +++++++++++
 kernel/sys_ni.c             |   1 +
 samples/vfs/Makefile        |   5 +
 samples/vfs/test-fsinfo.c   | 633 ++++++++++++++++++++++++++++++++++++
 9 files changed, 1492 insertions(+)
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c

diff --git a/fs/Kconfig b/fs/Kconfig
index fef1365c23a5..01d0d436b3cd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
 	  Enable this to perform validation of the parameter description for a
 	  filesystem when it is registered.
 
+config FSINFO
+	bool "Enable the fsinfo() system call"
+	help
+	  Enable the file system information querying system call to allow
+	  comprehensive information to be retrieved about a filesystem,
+	  superblock or mount object.
+
 if BLOCK
 
 config FS_IOMAP
diff --git a/fs/Makefile b/fs/Makefile
index 4477757780d0..b6bf2424c7f7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
 obj-$(CONFIG_SYSCTL)		+= drop_caches.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_FSINFO)		+= fsinfo.o
 obj-y				+= iomap/
 
 obj-y				+= quota/
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
new file mode 100644
index 000000000000..b7b81e9d7e21
--- /dev/null
+++ b/fs/fsinfo.c
@@ -0,0 +1,582 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/statfs.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/fsinfo.h>
+#include <uapi/linux/mount.h>
+#include "internal.h"
+
+/**
+ * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
+ * @s: The string to store (may be NULL)
+ * @ctx: The parameter context
+ */
+int fsinfo_string(const char *s, struct fsinfo_context *ctx)
+{
+	unsigned int len;
+	char *p = ctx->buffer;
+	int ret = 0;
+
+	if (s) {
+		len = min_t(size_t, strlen(s), ctx->buf_size - 1);
+		if (!ctx->want_size_only) {
+			memcpy(p, s, len);
+			p[len] = 0;
+		}
+		ret = len;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(fsinfo_string);
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_statfs *p = ctx->buffer;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0)
+		return ret;
+
+	p->f_blocks.lo	= buf.f_blocks;
+	p->f_bfree.lo	= buf.f_bfree;
+	p->f_bavail.lo	= buf.f_bavail;
+	p->f_files.lo	= buf.f_files;
+	p->f_ffree.lo	= buf.f_ffree;
+	p->f_favail.lo	= buf.f_ffree;
+	p->f_bsize	= buf.f_bsize;
+	p->f_frsize	= buf.f_frsize;
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_ids *p = ctx->buffer;
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0 && ret != -ENOSYS)
+		return ret;
+	if (ret == 0)
+		memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+
+	sb = path->dentry->d_sb;
+	p->f_fstype	= sb->s_magic;
+	p->f_dev_major	= MAJOR(sb->s_dev);
+	p->f_dev_minor	= MINOR(sb->s_dev);
+	p->f_sb_id	= sb->s_unique_id;
+	strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
+	return sizeof(*p);
+}
+
+int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_limits *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->max_file_size.hi	= 0;
+	p->max_file_size.lo	= sb->s_maxbytes;
+	p->max_ino.hi		= 0;
+	p->max_ino.lo		= UINT_MAX;
+	p->max_hard_links	= sb->s_max_links;
+	p->max_uid		= UINT_MAX;
+	p->max_gid		= UINT_MAX;
+	p->max_projid		= UINT_MAX;
+	p->max_filename_len	= NAME_MAX;
+	p->max_symlink_len	= PATH_MAX;
+	p->max_xattr_name_len	= XATTR_NAME_MAX;
+	p->max_xattr_body_len	= XATTR_SIZE_MAX;
+	p->max_dev_major	= 0xffffff;
+	p->max_dev_minor	= 0xff;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_limits);
+
+int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_supports *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->stx_mask = STATX_BASIC_STATS;
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_supports);
+
+static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
+	.atime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_timestamp_info *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+	s8 exponent;
+
+	*p = fsinfo_default_timestamp_info;
+
+	if (sb->s_time_gran < 1000000000) {
+		if (sb->s_time_gran < 1000)
+			exponent = -9;
+		else if (sb->s_time_gran < 1000000)
+			exponent = -6;
+		else
+			exponent = -3;
+
+		p->atime.gran_exponent = exponent;
+		p->mtime.gran_exponent = exponent;
+		p->ctime.gran_exponent = exponent;
+		p->btime.gran_exponent = exponent;
+	}
+
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
+
+static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_volume_uuid *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	memcpy(p, &sb->s_uuid, sizeof(*p));
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
+}
+
+static const struct fsinfo_attribute fsinfo_common_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+
+	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
+	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+	{}
+};
+
+/*
+ * Determine an attribute's minimum buffer size and, if the buffer is large
+ * enough, get the attribute value.
+ */
+static int fsinfo_get_this_attribute(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	int buf_size;
+
+	if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
+		return -ENODATA;
+	if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
+		return -ENODATA;
+
+	switch (attr->type) {
+	case FSINFO_TYPE_VSTRUCT:
+		ctx->clear_tail = true;
+		buf_size = attr->size;
+		break;
+	case FSINFO_TYPE_STRING:
+	case FSINFO_TYPE_OPAQUE:
+	case FSINFO_TYPE_LIST:
+		buf_size = 4096;
+		break;
+	default:
+		return -ENOPKG;
+	}
+
+	if (ctx->buf_size < buf_size)
+		return buf_size;
+
+	return attr->get(path, ctx);
+}
+
+static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	__u32 *p = ctx->buffer;
+	unsigned int i;
+
+	if (ctx->usage >= ctx->buf_size ||
+	    ctx->buf_size - ctx->usage < sizeof(__u32)) {
+		ctx->usage += sizeof(__u32);
+		return;
+	}
+
+	for (i = 0; i < ctx->usage / sizeof(__u32); i++)
+		if (p[i] == attr->attr_id)
+			return;
+
+	p[i] = attr->attr_id;
+	ctx->usage += sizeof(__u32);
+}
+
+static int fsinfo_list_attributes(struct path *path,
+				  struct fsinfo_context *ctx,
+				  const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	for (a = attributes; a->get; a++)
+		fsinfo_attributes_insert(ctx, a);
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+static int fsinfo_get_attribute_info(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+	struct fsinfo_attribute_info *p = ctx->buffer;
+
+	if (!ctx->buf_size)
+		return sizeof(*p);
+
+	for (a = attributes; a->get; a++) {
+		if (a->attr_id == ctx->Nth) {
+			p->attr_id	= a->attr_id;
+			p->type		= a->type;
+			p->flags	= a->flags;
+			p->size		= a->size;
+			p->size		= a->size;
+			return sizeof(*p);
+		}
+	}
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+/**
+ * fsinfo_get_attribute - Look up and handle an attribute
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ * @attributes: List of attributes to search.
+ *
+ * Look through a list of attributes for one that matches the requested
+ * attribute then call the handler for it.
+ */
+int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
+			 const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return fsinfo_get_attribute_info(path, ctx, attributes);
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return fsinfo_list_attributes(path, ctx, attributes);
+	default:
+		for (a = attributes; a->get; a++)
+			if (a->attr_id == ctx->requested_attr)
+				return fsinfo_get_this_attribute(path, ctx, a);
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL(fsinfo_get_attribute);
+
+/**
+ * generic_fsinfo - Handle an fsinfo attribute generically
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ */
+static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
+{
+	int ret;
+
+	if (path->dentry->d_sb->s_op->fsinfo) {
+		ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
+		if (ret != -EOPNOTSUPP)
+			return ret;
+	}
+	ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return -ENODATA;
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return ctx->usage;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * vfs_fsinfo - Retrieve filesystem information
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ *
+ * Get an attribute on a filesystem or an object within a filesystem.  The
+ * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
+ * if it's a multi-valued attribute, the particular value is selected by
+ * @ctx->Nth and then @ctx->Mth.
+ *
+ * For common attributes, a value may be fabricated if it is not supported by
+ * the filesystem.
+ *
+ * On success, the size of the attribute's value is returned (0 is a valid
+ * size).  A buffer will have been allocated and will be pointed to by
+ * @ctx->buffer.  The caller must free this with kvfree().
+ *
+ * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
+ * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
+ * doesn't exist for the specified object or -ENODATA if the attribute exists,
+ * but the Nth,Mth value does not exist.  -EMSGSIZE indicates that the value is
+ * unmanageable internally and -ENOPKG indicates other internal failure.
+ *
+ * Errors such as -EIO may also come from attempts to access media or servers
+ * to obtain the requested information if it's not immediately to hand.
+ *
+ * [*] Note that the caller may set @ctx->want_size_only if it only wants the
+ *     size of the value and not the data.  If this is set, a buffer may not be
+ *     allocated under some circumstances.  This is intended for size query by
+ *     userspace.
+ *
+ * [*] Note that @ctx->clear_tail will be returned set if the data should be
+ *     padded out with zeros when writing it to userspace.
+ */
+static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	struct dentry *dentry = path->dentry;
+	int ret;
+
+	ret = security_sb_statfs(dentry);
+	if (ret)
+		return ret;
+
+	/* Call the handler to find out the buffer size required. */
+	ctx->buf_size = 0;
+	ret = fsinfo_call(path, ctx);
+	if (ret < 0 || ctx->want_size_only)
+		return ret;
+	ctx->buf_size = ret;
+
+	do {
+		/* Allocate a buffer of the requested size. */
+		if (ctx->buf_size > INT_MAX)
+			return -EMSGSIZE;
+		ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
+		if (!ctx->buffer)
+			return -ENOMEM;
+
+		ctx->usage = 0;
+		ctx->skip = 0;
+		ret = fsinfo_call(path, ctx);
+		if (IS_ERR_VALUE((long)ret))
+			return ret;
+		if ((unsigned int)ret <= ctx->buf_size)
+			return ret; /* It fitted */
+
+		/* We need to resize the buffer */
+		ctx->buf_size = roundup(ret, PAGE_SIZE);
+		kvfree(ctx->buffer);
+		ctx->buffer = NULL;
+	} while (!signal_pending(current));
+
+	return -ERESTARTSYS;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *pathname,
+			   unsigned int resolve_flags, struct fsinfo_context *ctx)
+{
+	struct path path;
+	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret = -EINVAL;
+
+	if (resolve_flags & ~VALID_RESOLVE_FLAGS)
+		return -EINVAL;
+
+	if (resolve_flags & RESOLVE_NO_XDEV)
+		lookup_flags |= LOOKUP_NO_XDEV;
+	if (resolve_flags & RESOLVE_NO_MAGICLINKS)
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (resolve_flags & RESOLVE_NO_SYMLINKS)
+		lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (resolve_flags & RESOLVE_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
+	if (resolve_flags & RESOLVE_IN_ROOT)
+		lookup_flags |= LOOKUP_IN_ROOT;
+	if (resolve_flags & RESOLVE_NO_TRAILING_SYMLINKS)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (resolve_flags & RESOLVE_NO_TRAILING_AUTOMOUNTS)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (resolve_flags & RESOLVE_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+	ret = user_path_at(dfd, pathname, lookup_flags, &path);
+	if (ret)
+		goto out;
+
+	ret = vfs_fsinfo(&path, ctx);
+	path_put(&path);
+	if (retry_estale(ret, lookup_flags)) {
+		lookup_flags |= LOOKUP_REVAL;
+		goto retry;
+	}
+out:
+	return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
+{
+	struct fd f = fdget_raw(fd);
+	int ret = -EBADF;
+
+	if (f.file) {
+		ret = vfs_fsinfo(&f.file->f_path, ctx);
+		fdput(f);
+	}
+	return ret;
+}
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @pathname: Filesystem to query or NULL.
+ * @params: Parameters to define request (NULL: FSINFO_ATTR_STATFS).
+ * @params_size: Size of parameter buffer.
+ * @result_buffer: Result buffer.
+ * @result_buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem.  The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned.  If
+ * @result_buf_size is 0 or @result_buffer is NULL, only the size is returned.
+ * If the size of the value is larger than @result_buf_size, it will be
+ * truncated by the copy.  If the size of the value is smaller than
+ * @result_buf_size then the excess buffer space will be cleared.  The full
+ * size of the value will be returned, irrespective of how much data is
+ * actually placed in the buffer.
+ */
+SYSCALL_DEFINE6(fsinfo,
+		int, dfd,
+		const char __user *, pathname,
+		const struct fsinfo_params __user *, params,
+		size_t, params_size,
+		void __user *, result_buffer,
+		size_t, result_buf_size)
+{
+	struct fsinfo_context ctx;
+	struct fsinfo_params user_params;
+	unsigned int result_size;
+	void *r;
+	int ret;
+
+	if ((!params &&  params_size) ||
+	    ( params && !params_size) ||
+	    (!result_buffer &&  result_buf_size) ||
+	    ( result_buffer && !result_buf_size))
+		return -EINVAL;
+	if (result_buf_size > UINT_MAX)
+		return -EOVERFLOW;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.requested_attr	= FSINFO_ATTR_STATFS;
+	ctx.flags		= FSINFO_FLAGS_QUERY_PATH;
+	ctx.want_size_only	= (result_buf_size == 0);
+
+	if (params) {
+		ret = copy_struct_from_user(&user_params, sizeof(user_params),
+					    params, params_size);
+		if (ret < 0)
+			return ret;
+		if (user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
+			return -EINVAL;
+		ctx.flags = user_params.flags;
+		ctx.requested_attr = user_params.request;
+		ctx.Nth = user_params.Nth;
+		ctx.Mth = user_params.Mth;
+	}
+
+	switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
+	case FSINFO_FLAGS_QUERY_PATH:
+		ret = vfs_fsinfo_path(dfd, pathname, user_params.resolve_flags, &ctx);
+		break;
+	case FSINFO_FLAGS_QUERY_FD:
+		if (pathname)
+			return -EINVAL;
+		ret = vfs_fsinfo_fd(dfd, &ctx);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (ret < 0)
+		goto error;
+
+	r = ctx.buffer + ctx.skip;
+	result_size = min_t(size_t, ret, result_buf_size);
+	if (result_size > 0 &&
+	    copy_to_user(result_buffer, r, result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+	/* Clear any part of the buffer that we won't fill if we're putting a
+	 * struct in there.  Strings, opaque objects and arrays are expected to
+	 * be variable length.
+	 */
+	if (ctx.clear_tail &&
+	    result_buf_size > result_size &&
+	    clear_user(result_buffer + result_size,
+		       result_buf_size - result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+error:
+	kvfree(ctx.buffer);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9181cfcd5265..39178f89a6ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -69,6 +69,7 @@ struct fsverity_info;
 struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
+struct fsinfo_context;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1964,6 +1965,9 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+#ifdef CONFIG_FSINFO
+	int (*fsinfo)(struct path *, struct fsinfo_context *);
+#endif
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
 
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..bf806669b4fb
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#ifdef CONFIG_FSINFO
+
+#include <uapi/linux/fsinfo.h>
+
+struct path;
+
+#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
+
+struct fsinfo_context {
+	__u32		flags;		/* [in] FSINFO_FLAGS_* */
+	__u32		requested_attr;	/* [in] What is being asking for */
+	__u32		Nth;		/* [in] Instance of it (some may have multiple) */
+	__u32		Mth;		/* [in] Subinstance */
+	bool		want_size_only;	/* [in] Just want to know the size, not the data */
+	bool		clear_tail;	/* [out] T if tail of buffer should be cleared */
+	unsigned int	skip;		/* [out] Number of bytes to skip in buffer */
+	unsigned int	usage;		/* [tmp] Amount of buffer used (if large) */
+	unsigned int	buf_size;	/* [tmp] Size of ->buffer[] */
+	void		*buffer;	/* [out] The reply buffer */
+};
+
+/*
+ * A filesystem information attribute definition.
+ */
+struct fsinfo_attribute {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type:8;		/* The type of the attribute's value(s) */
+	unsigned int		flags:8;
+	unsigned int		size:16;	/* - Value size (FSINFO_STRUCT/LIST) */
+	int (*get)(struct path *path, struct fsinfo_context *params);
+};
+
+#define __FSINFO(A, T, S, G, F) \
+	{ .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
+
+#define _FSINFO(A, T, S, G)	__FSINFO(A, T, S, G, 0)
+#define _FSINFO_N(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N)
+#define _FSINFO_NM(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
+
+#define _FSINFO_VSTRUCT(A,S,G)	  _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_N(A,S,G)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+
+extern int fsinfo_string(const char *, struct fsinfo_context *);
+extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
+extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
+				const struct fsinfo_attribute *);
+
+#endif /* CONFIG_FSINFO */
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..b56ebd525b03
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/openat2.h>
+
+/*
+ * The filesystem attributes that can be requested.  Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+#define FSINFO_ATTR_STATFS		0x00	/* statfs()-style state */
+#define FSINFO_ATTR_IDS			0x01	/* Filesystem IDs */
+#define FSINFO_ATTR_LIMITS		0x02	/* Filesystem limits */
+#define FSINFO_ATTR_SUPPORTS		0x03	/* What's supported in statx, iocflags, ... */
+#define FSINFO_ATTR_TIMESTAMP_INFO	0x04	/* Inode timestamp info */
+#define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
+#define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
+#define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+	__u32	flags;		/* Flags controlling fsinfo() specifically */
+#define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
+#define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
+#define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
+	__u32	resolve_flags;	/* RESOLVE_* flags */
+	__u32	request;	/* ID of requested attribute */
+	__u32	Nth;		/* Instance of it (some may have multiple) */
+	__u32	Mth;		/* Subinstance of Nth instance */
+};
+
+enum fsinfo_value_type {
+	FSINFO_TYPE_VSTRUCT	= 0,	/* Version-lengthed struct (up to 4096 bytes) */
+	FSINFO_TYPE_STRING	= 1,	/* NUL-term var-length string (up to 4095 chars) */
+	FSINFO_TYPE_OPAQUE	= 2,	/* Opaque blob (unlimited size) */
+	FSINFO_TYPE_LIST	= 3,	/* List of ints/structs (unlimited size) */
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
+ *
+ * This gives information about the attributes supported by fsinfo for the
+ * given path.
+ */
+struct fsinfo_attribute_info {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type;		/* The type of the attribute's value(s) */
+	unsigned int		flags;
+#define FSINFO_FLAGS_N		0x01		/* - Attr has a set of values */
+#define FSINFO_FLAGS_NM		0x02		/* - Attr has a set of sets of values */
+	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
+};
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
+
+struct fsinfo_u128 {
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+	__u64	hi;
+	__u64	lo;
+#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
+	__u64	lo;
+	__u64	hi;
+#endif
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_STATFS).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
+	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
+	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
+	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
+	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
+	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */
+	__u64	f_bsize;		/* Optimal block size */
+	__u64	f_frsize;		/* Fragment size */
+};
+
+#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_IDS).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+	char	f_fs_name[15 + 1];	/* Filesystem name */
+	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
+	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
+	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
+	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
+	__u32	f_dev_minor;
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+	struct fsinfo_u128 max_file_size;	/* Maximum file size */
+	struct fsinfo_u128 max_ino;		/* Maximum inode number */
+	__u64	max_uid;			/* Maximum UID supported */
+	__u64	max_gid;			/* Maximum GID supported */
+	__u64	max_projid;			/* Maximum project ID supported */
+	__u64	max_hard_links;			/* Maximum number of hard links on a file */
+	__u64	max_xattr_body_len;		/* Maximum xattr content length */
+	__u32	max_xattr_name_len;		/* Maximum xattr name length */
+	__u32	max_filename_len;		/* Maximum filename length */
+	__u32	max_symlink_len;		/* Maximum symlink content length */
+	__u32	max_dev_major;			/* Maximum device major representable */
+	__u32	max_dev_minor;			/* Maximum device minor representable */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
+	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
+	__u32	fs_ioc_getflags;	/* What FS_IOC_GETFLAGS may return */
+	__u32	fs_ioc_setflags_set;	/* What FS_IOC_SETFLAGS may set */
+	__u32	fs_ioc_setflags_clear;	/* What FS_IOC_SETFLAGS may clear */
+	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
+
+struct fsinfo_timestamp_one {
+	__s64	minimum;	/* Minimum timestamp value in seconds */
+	__s64	maximum;	/* Maximum timestamp value in seconds */
+	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
+	__s8	gran_exponent;
+	__u8	__padding[5];
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
+ */
+struct fsinfo_timestamp_info {
+	struct fsinfo_timestamp_one	atime;	/* Access time */
+	struct fsinfo_timestamp_one	mtime;	/* Modification time */
+	struct fsinfo_timestamp_one	ctime;	/* Change time */
+	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
+};
+
+#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
+ */
+struct fsinfo_volume_uuid {
+	__u8	uuid[16];
+};
+
+#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ce01f86e5db..519317f3904c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
 COND_SYSCALL(io_uring_register);
+COND_SYSCALL(fsinfo);
 
 /* fs/xattr.c */
 
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 65acdde5c117..9159ad1d7fc5 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,10 +1,15 @@
 # SPDX-License-Identifier: GPL-2.0-only
 # List of programs to build
+
 hostprogs := \
+	test-fsinfo \
 	test-fsmount \
 	test-statx
 
 always-y := $(hostprogs)
 
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-fsinfo += -static -lm
+
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
new file mode 100644
index 000000000000..67aebf9fc9d8
--- /dev/null
+++ b/samples/vfs/test-fsinfo.c
@@ -0,0 +1,633 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static bool debug = 0;
+static bool list_last;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+	       struct fsinfo_params *params, size_t params_size,
+	       void *result_buffer, size_t result_buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename,
+		       params, params_size,
+		       result_buffer, result_buf_size);
+}
+
+struct fsinfo_attribute {
+	unsigned int		attr_id;
+	enum fsinfo_value_type	type;
+	unsigned int		size;
+	const char		*name;
+	void (*dump)(void *reply, unsigned int size);
+};
+
+static const struct fsinfo_attribute fsinfo_attributes[];
+
+static ssize_t get_fsinfo(const char *, const char *, struct fsinfo_params *, void **);
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+	unsigned offset, print_offset = 1, col = 0;
+
+	from /= 4;
+	to = (to + 3) / 4;
+
+	for (offset = from; offset < to; offset++) {
+		if (print_offset) {
+			printf("%04x: ", offset * 8);
+			print_offset = 0;
+		}
+		printf("%08x", data[offset]);
+		col++;
+		if ((col & 3) == 0) {
+			printf("\n");
+			print_offset = 1;
+		} else {
+			printf(" ");
+		}
+	}
+
+	if (!print_offset)
+		printf("\n");
+}
+
+static void dump_attribute_info(void *reply, unsigned int size)
+{
+	struct fsinfo_attribute_info *attr_info = reply;
+	const struct fsinfo_attribute *attr;
+	char type[32], val_size[32];
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:	strcpy(type, "V-STRUCT");	break;
+	case FSINFO_TYPE_STRING:	strcpy(type, "STRING");		break;
+	case FSINFO_TYPE_OPAQUE:	strcpy(type, "OPAQUE");		break;
+	case FSINFO_TYPE_LIST:		strcpy(type, "LIST");		break;
+	default:
+		sprintf(type, "type-%x", attr_info->type);
+		break;
+	}
+
+	if (attr_info->flags & FSINFO_FLAGS_N)
+		strcat(type, " x N");
+	else if (attr_info->flags & FSINFO_FLAGS_NM)
+		strcat(type, " x NM");
+
+	for (attr = fsinfo_attributes; attr->name; attr++)
+		if (attr->attr_id == attr_info->attr_id)
+			break;
+
+	if (attr_info->size)
+		sprintf(val_size, "%u", attr_info->size);
+	else
+		strcpy(val_size, "-");
+
+	printf("%8x %-12s %08x %5s %s\n",
+	       attr_info->attr_id,
+	       type,
+	       attr_info->flags,
+	       val_size,
+	       attr->name ? attr->name : "");
+}
+
+static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
+{
+	struct fsinfo_statfs *f = reply;
+
+	printf("\n");
+	printf("\tblocks       : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_blocks.lo,
+	       (unsigned long long)f->f_bfree.lo,
+	       (unsigned long long)f->f_bavail.lo);
+
+	printf("\tfiles        : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_files.lo,
+	       (unsigned long long)f->f_ffree.lo,
+	       (unsigned long long)f->f_favail.lo);
+	printf("\tbsize        : %llu\n", f->f_bsize);
+	printf("\tfrsize       : %llu\n", f->f_frsize);
+}
+
+static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
+{
+	struct fsinfo_ids *f = reply;
+
+	printf("\n");
+	printf("\tdev          : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+	printf("\tfs           : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+	printf("\tfsid         : %llx\n", (unsigned long long)f->f_fsid);
+	printf("\tsbid         : %llx\n", (unsigned long long)f->f_sb_id);
+}
+
+static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
+{
+	struct fsinfo_limits *f = reply;
+
+	printf("\n");
+	printf("\tmax file size: %llx%016llx\n",
+	       (unsigned long long)f->max_file_size.hi,
+	       (unsigned long long)f->max_file_size.lo);
+	printf("\tmax ino      : %llx%016llx\n",
+	       (unsigned long long)f->max_ino.hi,
+	       (unsigned long long)f->max_ino.lo);
+	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
+	       (unsigned long long)f->max_uid,
+	       (unsigned long long)f->max_gid,
+	       (unsigned long long)f->max_projid);
+	printf("\tmax dev      : maj=%x min=%x\n",
+	       f->max_dev_major, f->max_dev_minor);
+	printf("\tmax links    : %llx\n",
+	       (unsigned long long)f->max_hard_links);
+	printf("\tmax xattr    : n=%x b=%llx\n",
+	       f->max_xattr_name_len,
+	       (unsigned long long)f->max_xattr_body_len);
+	printf("\tmax len      : file=%x sym=%x\n",
+	       f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
+{
+	struct fsinfo_supports *f = reply;
+
+	printf("\n");
+	printf("\tstx_attr     : %llx\n", (unsigned long long)f->stx_attributes);
+	printf("\tstx_mask     : %x\n", f->stx_mask);
+	printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
+	       f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
+	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
+}
+
+static void print_time(struct fsinfo_timestamp_one *t, char stamp)
+{
+	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
+	       stamp,
+	       t->gran_mantissa * pow(10., t->gran_exponent),
+	       (long long)t->minimum,
+	       (long long)t->maximum);
+}
+
+static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
+{
+	struct fsinfo_timestamp_info *f = reply;
+
+	printf("\n");
+	print_time(&f->atime, 'a');
+	print_time(&f->mtime, 'm');
+	print_time(&f->ctime, 'c');
+	print_time(&f->btime, 'b');
+}
+
+static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
+{
+	struct fsinfo_volume_uuid *f = reply;
+
+	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+	       "-%02x%02x%02x%02x%02x%02x\n",
+	       f->uuid[ 0], f->uuid[ 1],
+	       f->uuid[ 2], f->uuid[ 3],
+	       f->uuid[ 4], f->uuid[ 5],
+	       f->uuid[ 6], f->uuid[ 7],
+	       f->uuid[ 8], f->uuid[ 9],
+	       f->uuid[10], f->uuid[11],
+	       f->uuid[12], f->uuid[13],
+	       f->uuid[14], f->uuid[15]);
+}
+
+static void dump_string(void *reply, unsigned int size)
+{
+	char *s = reply, *p;
+	bool nl = false, last_nl = false;
+
+	p = s;
+	if (size >= 4096) {
+		size = 4096;
+		p[4092] = '.';
+		p[4093] = '.';
+		p[4094] = '.';
+		p[4095] = 0;
+	} else {
+		p[size] = 0;
+	}
+
+	for (p = s; *p; p++) {
+		if (*p == '\n') {
+			last_nl = nl = true;
+			continue;
+		}
+		last_nl = false;
+		if (!isprint(*p) && *p != '\t')
+			*p = '?';
+	}
+
+	if (nl)
+		putchar('\n');
+	printf("%s", s);
+	if (!last_nl)
+		putchar('\n');
+}
+
+#define dump_fsinfo_meta_attribute_info		(void *)0x123
+#define dump_fsinfo_meta_attributes		(void *)0x123
+
+/*
+ *
+ */
+#define __FSINFO(A, T, S, G, F, N)					\
+	{ .attr_id = A, .type = T, .size = S, .name = N, .dump = dump_##G }
+
+#define _FSINFO(A,T,S,G,N)	__FSINFO(A, T, S, G, 0, N)
+#define _FSINFO_N(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N, N)
+#define _FSINFO_NM(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM, N)
+
+#define _FSINFO_VSTRUCT(A,S,G,N)    _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_N(A,S,G,N)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_NM(A,S,G,N) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G, #A)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G, #A)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+
+static const struct fsinfo_attribute fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		string),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	string),
+	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
+	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
+	{}
+};
+
+static void dump_value(unsigned int attr_id,
+		       const struct fsinfo_attribute *attr,
+		       const struct fsinfo_attribute_info *attr_info,
+		       void *reply, unsigned int size)
+{
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	attr->dump(reply, size);
+}
+
+static void dump_list(unsigned int attr_id,
+		      const struct fsinfo_attribute *attr,
+		      const struct fsinfo_attribute_info *attr_info,
+		      void *reply, unsigned int size)
+{
+	size_t elem_size = attr_info->size;
+	unsigned int ix = 0;
+
+	printf("\n");
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	list_last = false;
+	while (size >= elem_size) {
+		printf("\t[%02x] ", ix);
+		if (size == elem_size)
+			list_last = true;
+		attr->dump(reply, size);
+		reply += elem_size;
+		size -= elem_size;
+		ix++;
+	}
+}
+
+/*
+ * Call fsinfo, expanding the buffer as necessary.
+ */
+static ssize_t get_fsinfo(const char *file, const char *name,
+			  struct fsinfo_params *params, void **_r)
+{
+	ssize_t ret;
+	size_t buf_size = 4096;
+	void *r;
+
+	for (;;) {
+		r = malloc(buf_size);
+		if (!r) {
+			perror("malloc");
+			exit(1);
+		}
+		memset(r, 0xbd, buf_size);
+
+		errno = 0;
+		ret = fsinfo(AT_FDCWD, file, params, sizeof(*params), r, buf_size - 1);
+		if (ret == -1)
+			goto error;
+
+		if (ret <= buf_size - 1)
+			break;
+		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+	}
+
+	if (debug)
+		printf("fsinfo(%s,%s,%u,%u) = %zd\n",
+		       file, name, params->Nth, params->Mth, ret);
+
+	((char *)r)[ret] = 0;
+	*_r = r;
+	return ret;
+
+error:
+	*_r = NULL;
+	free(r);
+	if (debug)
+		printf("fsinfo(%s,%s,%u,%u) = %m\n",
+		       file, name, params->Nth, params->Mth);
+	return ret;
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params,
+		   const struct fsinfo_attribute_info *attr_info, bool raw)
+{
+	const struct fsinfo_attribute *attr;
+	const char *name;
+	size_t size = 4096;
+	char namebuf[32];
+	void *r;
+
+	for (attr = fsinfo_attributes; attr->name; attr++) {
+		if (attr->attr_id == params->request) {
+			name = attr->name;
+			if (strncmp(name, "fsinfo_generic_", 15) == 0)
+				name += 15;
+			goto found;
+		}
+	}
+
+	sprintf(namebuf, "<unknown-%x>", params->request);
+	name = namebuf;
+	attr = NULL;
+
+found:
+	size = get_fsinfo(file, name, params, &r);
+
+	if (size == -1) {
+		if (errno == ENODATA) {
+			if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
+			    params->Nth == 0 && params->Mth == 0) {
+				fprintf(stderr,
+					"Unexpected ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			free(r);
+			return (params->Mth == 0) ? 2 : 1;
+		}
+		if (errno == EOPNOTSUPP) {
+			if (params->Nth > 0 || params->Mth > 0) {
+				fprintf(stderr,
+					"Should return -ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			//printf("\e[33m%s\e[m: <not supported>\n",
+			//       fsinfo_attr_names[attr]);
+			free(r);
+			return 2;
+		}
+		perror(file);
+		exit(1);
+	}
+
+	if (raw) {
+		if (size > 4096)
+			size = 4096;
+		dump_hex(r, 0, size);
+		free(r);
+		return 0;
+	}
+
+	switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
+	case 0:
+		printf("\e[33m%s\e[m: ", name);
+		break;
+	case FSINFO_FLAGS_N:
+		printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
+		break;
+	case FSINFO_FLAGS_NM:
+		printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
+		break;
+	}
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:
+	case FSINFO_TYPE_STRING:
+		dump_value(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_LIST:
+		dump_list(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_OPAQUE:
+		free(r);
+		return 0;
+
+	default:
+		fprintf(stderr, "Fishy about %u 0x%x,%x,%x\n",
+			params->request, attr_info->type, attr_info->flags, attr_info->size);
+		exit(1);
+	}
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+	return *(const int *)a - *(const int *)b;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	struct fsinfo_attribute_info attr_info;
+	struct fsinfo_params params = {
+		.resolve_flags	= RESOLVE_NO_TRAILING_SYMLINKS,
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+	};
+	unsigned int *attrs, ret, nr, i;
+	bool meta = false;
+	int raw = 0, opt, Nth, Mth;
+
+	while ((opt = getopt(argc, argv, "Madlr"))) {
+		switch (opt) {
+		case 'M':
+			meta = true;
+			continue;
+		case 'a':
+			params.resolve_flags |= RESOLVE_NO_TRAILING_AUTOMOUNTS;
+			params.flags = FSINFO_FLAGS_QUERY_PATH;
+			continue;
+		case 'd':
+			debug = true;
+			continue;
+		case 'l':
+			params.resolve_flags &= ~RESOLVE_NO_TRAILING_SYMLINKS;
+			params.flags = FSINFO_FLAGS_QUERY_PATH;
+			continue;
+		case 'r':
+			raw = 1;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1) {
+		printf("Format: test-fsinfo [-Madlr] <path>\n");
+		exit(2);
+	}
+
+	/* Retrieve a list of supported attribute IDs */
+	params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
+	params.Nth = 0;
+	params.Mth = 0;
+	ret = get_fsinfo(argv[0], "attributes", &params, (void **)&attrs);
+	if (ret == -1) {
+		fprintf(stderr, "Unable to get attribute list: %m\n");
+		exit(1);
+	}
+
+	if (ret % sizeof(attrs[0])) {
+		fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
+		exit(2);
+	}
+
+	nr = ret / sizeof(attrs[0]);
+	qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
+
+	if (meta) {
+		printf("ATTR ID  TYPE         FLAGS    SIZE  NAME\n");
+		printf("======== ============ ======== ===== =========\n");
+		for (i = 0; i < nr; i++) {
+			params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+			params.Nth = attrs[i];
+			params.Mth = 0;
+			ret = fsinfo(AT_FDCWD, argv[0],
+				     &params, sizeof(params),
+				     &attr_info, sizeof(attr_info));
+			if (ret == -1) {
+				fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+				exit(1);
+			}
+
+			dump_attribute_info(&attr_info, ret);
+		}
+		exit(0);
+	}
+
+	for (i = 0; i < nr; i++) {
+		params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+		params.Nth = attrs[i];
+		params.Mth = 0;
+		ret = fsinfo(AT_FDCWD, argv[0],
+			     &params, sizeof(params),
+			     &attr_info, sizeof(attr_info));
+		if (ret == -1) {
+			fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+			exit(1);
+		}
+
+		if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
+		    attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
+			continue;
+
+		if (attrs[i] != attr_info.attr_id) {
+			fprintf(stderr, "ID for %03x returned %03x\n",
+				attrs[i], attr_info.attr_id);
+			break;
+		}
+		Nth = 0;
+		do {
+			Mth = 0;
+			do {
+				params.request = attrs[i];
+				params.Nth = Nth;
+				params.Mth = Mth;
+
+				switch (try_one(argv[0], &params, &attr_info, raw)) {
+				case 0:
+					continue;
+				case 1:
+					goto done_M;
+				case 2:
+					goto done_N;
+				}
+			} while (++Mth < 100);
+
+		done_M:
+			if (Mth >= 100) {
+				fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
+				break;
+			}
+
+		} while (++Nth < 100);
+
+	done_N:
+		if (Nth >= 100) {
+			fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
+			break;
+		}
+	}
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v19 14/14] arch: wire up fsinfo syscall
  2020-03-10  9:32     ` [PATCH v19 01/14] fsinfo: Add fsinfo() syscall to query filesystem information Christian Brauner
@ 2020-03-10  9:32       ` Christian Brauner
  0 siblings, 0 replies; 50+ messages in thread
From: Christian Brauner @ 2020-03-10  9:32 UTC (permalink / raw)
  To: christian.brauner
  Cc: christian, darrick.wong, dhowells, jannh, jlayton, kzak,
	linux-api, linux-fsdevel, linux-kernel, linux-security-module,
	mszeredi, raven, torvalds, viro

This wires up the fsinfo() syscall for all architectures.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/linux/syscalls.h                    | 4 ++++
 include/uapi/asm-generic/unistd.h           | 4 +++-
 20 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 7c0115af9010..4d0b07dde12d 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	watch_mount			sys_watch_mount
 550	common	watch_sb			sys_watch_sb
+551	common	fsinfo				sys_fsinfo
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index f256f009a89f..fdda8382b420 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index bc0f923e0e04..388eeb71cff0 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		441
+#define __NR_compat_syscalls		442
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c1c61635f89c..1f7d2c8d481a 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -883,6 +883,8 @@ __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_openat2, sys_openat2)
 #define __NR_pidfd_getfd 438
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+#define __NR_fsinfo 441
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index a4dafc659647..2316e60e031a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 893fb4151547..efc2723ca91f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 54aaf0d40c64..745c0f462fce 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index fd34dd0efed0..499f83562a8c 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	watch_mount			sys_watch_mount
 440	n32	watch_sb			sys_watch_sb
+441	n32	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index db0f4c0a0a0b..b3188bc3ab3c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	watch_mount			sys_watch_mount
 440	n64	watch_sb			sys_watch_sb
+441	n64	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ce2e1326de8f..1a3e8ed5e538 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	watch_mount			sys_watch_mount
 440	o32	watch_sb			sys_watch_sb
+441	o32	fsinfo				sys_fsinfo
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 6e4a7c08b64b..2572c215d861 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 08943f3b8206..39d7ac7e918c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -521,3 +521,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index b3b8529d2b74..ae4cefd3dd1b 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount		sys_watch_mount			sys_watch_mount
 440	common	watch_sb		sys_watch_sb			sys_watch_sb
+441  common	fsinfo			sys_fsinfo			sys_fsinfo
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 89307a20657c..05945b9aee4b 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4ff841a00450..b71b34d4b45c 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e2731d295f88..e118ba9aca4c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
 439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
 440	i386	watch_sb		sys_watch_sb			__ia32_sys_watch_sb
+441	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f4391176102c..067f247471d0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
 439	common	watch_mount		__x64_sys_watch_mount
 440	common	watch_sb		__x64_sys_watch_sb
+441	common	fsinfo			__x64_sys_fsinfo
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 8e7d731ed6cf..e1ec25099d10 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c84440d57f52..76064c0807e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -47,6 +47,7 @@ struct stat64;
 struct statfs;
 struct statfs64;
 struct statx;
+struct fsinfo_params;
 struct __sysctl_args;
 struct sysinfo;
 struct timespec;
@@ -1007,6 +1008,9 @@ asmlinkage long sys_watch_mount(int dfd, const char __user *path,
 				unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_watch_sb(int dfd, const char __user *path,
 			     unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
+			   struct fsinfo_params __user *params, size_t params_size,
+			   void __user *result_buffer, size_t result_buf_size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bff318b7ffa..7d764f86d3f5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_watch_mount, sys_watch_mount)
 #define __NR_watch_sb 440
 __SYSCALL(__NR_watch_sb, sys_watch_sb)
+#define __NR_fsinfo 441
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
 
 #undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different
-- 
2.25.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-10  7:25     ` David Howells
@ 2020-03-11 17:59       ` Linus Torvalds
  2020-03-12  9:08         ` Stefan Metzmacher
  2020-03-13  9:50         ` Aleksa Sarai
  0 siblings, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2020-03-11 17:59 UTC (permalink / raw)
  To: David Howells
  Cc: Aleksa Sarai, Al Viro, Stefan Metzmacher, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

On Tue, Mar 10, 2020 at 12:25 AM David Howells <dhowells@redhat.com> wrote:
?
> Okay.  So what's the equivalent of AT_SYMLINK_NOFOLLOW in RESOLVE_* flag
> terms?

Nothing.

openat2() takes two sets of flags. We'll never get rid of
AT_SYMLINK_NOFOLLOW / O_NOFOLLOW, and we've added RESOLVE_NO_SYMLINKS
to the new set of flags. It's just a separate namespace.

We will _not_ be adding a RESOLVE_XYZ flag for O_NOFOLLOW or
AT_SYMLINK_NOFOLLOW. At least not visible to user space - because as
people already figured out, that just causes problems with consistency
issues.

And yes, the fact that we then have three different user-visible
namespaces (O_xyz flags for open(), AT_xyz flags for linkat(), and now
RESOLVE_xyz flags for openat2()) is sad and messy. But it's an
inherent messiness from just how the world works. We can't get rid of
it.

If we need linkat2() and friends, so be it. Do we?

Could we have a _fourth_ set of flags that are simply for internal use
that is a superset of them all? Sure. But no, it's almost certainly
not worth it. Four is not better than three.

Now, some type-safety in the kernel to make sure that we can't mix
AT_xyz with O_xyz or RESOLVE_xyz - that might be worth it. Although
judging by past experience, not enough people run sparse for it to
really be worth it.

               Linus

PS. Yeah, we also have that LOOKUP_xyz namespace, and the access mode
namespace, so we already have those internal format versions too.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-11 17:59       ` Linus Torvalds
@ 2020-03-12  9:08         ` Stefan Metzmacher
  2020-03-12 16:24           ` Linus Torvalds
  2020-03-12 16:56           ` David Howells
  2020-03-13  9:50         ` Aleksa Sarai
  1 sibling, 2 replies; 50+ messages in thread
From: Stefan Metzmacher @ 2020-03-12  9:08 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: Aleksa Sarai, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 3065 bytes --]

Hi Linus,

>> Okay.  So what's the equivalent of AT_SYMLINK_NOFOLLOW in RESOLVE_* flag
>> terms?
> 
> Nothing.
> 
> openat2() takes two sets of flags. We'll never get rid of
> AT_SYMLINK_NOFOLLOW / O_NOFOLLOW, and we've added RESOLVE_NO_SYMLINKS
> to the new set of flags. It's just a separate namespace.
> 
> We will _not_ be adding a RESOLVE_XYZ flag for O_NOFOLLOW or
> AT_SYMLINK_NOFOLLOW. At least not visible to user space - because as
> people already figured out, that just causes problems with consistency
> issues.
> 
> And yes, the fact that we then have three different user-visible
> namespaces (O_xyz flags for open(), AT_xyz flags for linkat(), and now
> RESOLVE_xyz flags for openat2()) is sad and messy. But it's an
> inherent messiness from just how the world works. We can't get rid of
> it.

For openat2() and other existing syscalls I agree, that it's good to
have just a single bit to control a feature.

The whole discussion was triggered by the introduction of a completely
new fsinfo()
call:

>> The new system call looks like:
>> 
>> 	int ret = fsinfo(int dfd,
>> 			 const char *pathname,
>> 			 const struct fsinfo_params *params,
>> 			 size_t params_size,
>> 			 void *result_buffer,
>> 			 size_t result_buf_size);
>> 
>> The params parameter optionally points to a block of parameters:
>> 
>> 	struct fsinfo_params {
>> 		__u32	resolve_flags;

If I remember correctly with was named at_flags initially.
And I thought it would be great to also have the new RESOLVE_YXZ feature
available for that new path based syscall.

Would you propose to have 'at_flags' and 'resolve_flags' passed in here?
Or is there something even better you would propose for new syscalls?

>> 		__u32	flags;
>> 		__u32	request;
>> 		__u32	Nth;
>> 		__u32	Mth;
>> 	};

> If we need linkat2() and friends, so be it. Do we?

Yes, I'm going to propose something like this, as it would make the life
much easier for Samba to have the new features available on all path
based syscalls.

In addition I'll propose to have a way to specify the source of
removeat and unlinkat also by fd in addition to the the source parent fd
and relative path, the reason are also to detect races of path
recycling. pidfd_open() solved a similar problem for pid recycling.
> Could we have a _fourth_ set of flags that are simply for internal use
> that is a superset of them all? Sure. But no, it's almost certainly
> not worth it. Four is not better than three.

As you pointed our below the LOOKUP_yxz namespace is already in place...
And the discussion was more about an possible single namespace for
completely new syscalls.

> Now, some type-safety in the kernel to make sure that we can't mix
> AT_xyz with O_xyz or RESOLVE_xyz - that might be worth it. Although
> judging by past experience, not enough people run sparse for it to
> really be worth it.

I'm new to all this and maybe too naive, but would a build bot running
sparse on linux-next be able to catch this early enough?

metze



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12  9:08         ` Stefan Metzmacher
@ 2020-03-12 16:24           ` Linus Torvalds
  2020-03-12 17:11             ` Stefan Metzmacher
  2020-03-12 19:25             ` Al Viro
  2020-03-12 16:56           ` David Howells
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2020-03-12 16:24 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: David Howells, Aleksa Sarai, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

On Thu, Mar 12, 2020 at 2:08 AM Stefan Metzmacher <metze@samba.org> wrote:
>
> The whole discussion was triggered by the introduction of a completely
> new fsinfo() call:
>
> Would you propose to have 'at_flags' and 'resolve_flags' passed in here?

Yes, I think that would be the way to go.

> > If we need linkat2() and friends, so be it. Do we?
>
> Yes, I'm going to propose something like this, as it would make the life
> much easier for Samba to have the new features available on all path
> based syscalls.

Will samba actually use them? I think we've had extensions before that
weren't worth the non-portability pain?

But yes, if we have a major package like samba use it, then by all
means let's add linkat2(). How many things are we talking about? We
have a number of system calls that do *not* take flags, but do do
pathname walking. I'm thinking things like "mkdirat()"?)

> In addition I'll propose to have a way to specify the source of
> removeat and unlinkat also by fd in addition to the the source parent fd
> and relative path, the reason are also to detect races of path
> recycling.

Would that be basically just an AT_EMPTY_PATH kind of thing? IOW,
you'd be able to remove a file by doing

   fd = open(path.., O_PATH);
   unlinkat(fd, "", AT_EMPTY_PATH);

Hmm. We have _not_ allowed filesystem changes without that last
component lookup. Of course, with our dentry model, we *can* do it,
but this smells fairly fundamental to me.

It might avoid some of the extra system calls (ie you could use
openat2() to do the path walking part, and then
unlinkat(AT_EMPTY_PATH) to remove it, and have a "fstat()" etc in
between the verify that it's the right type of file or whatever - and
you'd not need an unlinkat2() with resolve flags).

I think Al needs to ok this kind of change. Maybe you've already
discussed it with him and I just missed it.

            Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12  9:08         ` Stefan Metzmacher
  2020-03-12 16:24           ` Linus Torvalds
@ 2020-03-12 16:56           ` David Howells
  2020-03-12 18:09             ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: David Howells @ 2020-03-12 16:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Stefan Metzmacher, Aleksa Sarai, Al Viro, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > The whole discussion was triggered by the introduction of a completely
> > new fsinfo() call:
> >
> > Would you propose to have 'at_flags' and 'resolve_flags' passed in here?
> 
> Yes, I think that would be the way to go.

Okay, I can do that.

Any thoughts on which set of flags should override the other?  If we're making
RESOLVE_* flags the new definitive interface, then I feel they should probably
override the AT_* flags where there's a conflict, ie. RESOLVE_NO_SYMLINKS
should override AT_SYMLINK_FOLLOW for example.

David


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 16:24           ` Linus Torvalds
@ 2020-03-12 17:11             ` Stefan Metzmacher
  2020-03-12 19:37               ` Al Viro
                                 ` (2 more replies)
  2020-03-12 19:25             ` Al Viro
  1 sibling, 3 replies; 50+ messages in thread
From: Stefan Metzmacher @ 2020-03-12 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Aleksa Sarai, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke


[-- Attachment #1.1: Type: text/plain, Size: 3679 bytes --]

Am 12.03.20 um 17:24 schrieb Linus Torvalds:
> On Thu, Mar 12, 2020 at 2:08 AM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> The whole discussion was triggered by the introduction of a completely
>> new fsinfo() call:
>>
>> Would you propose to have 'at_flags' and 'resolve_flags' passed in here?
> 
> Yes, I think that would be the way to go.

Ok, that's also fine for me:-)

>>> If we need linkat2() and friends, so be it. Do we?
>>
>> Yes, I'm going to propose something like this, as it would make the life
>> much easier for Samba to have the new features available on all path
>> based syscalls.
> 
> Will samba actually use them? I think we've had extensions before that
> weren't worth the non-portability pain?

Yes, we're currently moving to the portable *at() calls as a start.
And we already make use of Linux only feature for performance reasons
in other places. Having the new resolve flags will make it possible to
move some of the performance intensive work into non-linux specific
modules as fallback.

I hope that we'll use most of this through io_uring in the end,
that's the reason Jens added the IORING_REGISTER_PERSONALITY feature
used for IORING_OP_OPENAT2.

> But yes, if we have a major package like samba use it, then by all
> means let's add linkat2(). How many things are we talking about? We
> have a number of system calls that do *not* take flags, but do do
> pathname walking. I'm thinking things like "mkdirat()"?)

I haven't looked them up in detail yet.
Jeremy can you provide a list?

Do you think we could route some of them like mkdirat() and mknodat()
via openat2() instead of creating new syscalls?

>> In addition I'll propose to have a way to specify the source of
>> removeat and unlinkat also by fd in addition to the the source parent fd
>> and relative path, the reason are also to detect races of path
>> recycling.
> 
> Would that be basically just an AT_EMPTY_PATH kind of thing? IOW,
> you'd be able to remove a file by doing
> 
>    fd = open(path.., O_PATH);
>    unlinkat(fd, "", AT_EMPTY_PATH);
> 
> Hmm. We have _not_ allowed filesystem changes without that last
> component lookup. Of course, with our dentry model, we *can* do it,
> but this smells fairly fundamental to me.
>
> It might avoid some of the extra system calls (ie you could use
> openat2() to do the path walking part, and then
> unlinkat(AT_EMPTY_PATH) to remove it, and have a "fstat()" etc in
> between the verify that it's the right type of file or whatever - and
> you'd not need an unlinkat2() with resolve flags).

If that works safely for hardlinks and having another process doing a
rename between openat2() and unlinkat(), we could try that.

My initial naive idea was to have one syscall instead of
linkat2/renameat3/unlinkat2.

int xlinkat(int src_dfd, const char *src_path,
            int dst_dfd, const char *dst_path,
            const struct xlinkat_how *how, size_t how_size);

struct xlinkat_how {
       __u64 src_at_flags;
       __u64 src_resolve_flags;
       __u64 dst_at_flags;
       __u64 dst_resolve_flags;
       __u64 rename_flags;
       __s32 src_fd;
};

With src_dfd=-1, src_path=NULL, how.src_fd = -1, this would be like
linkat().
With dst_dfd=-1, dst_path=NULL, it would be like unlinkat().
Otherwise a renameat2().

If how.src_fd is not -1, it would be checked to be the same path as
specified by src_dfd and src_path.

> I think Al needs to ok this kind of change. Maybe you've already
> discussed it with him and I just missed it.

This is the first time I'm discussing this.

Thanks for the useful feedback!
metze



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 16:56           ` David Howells
@ 2020-03-12 18:09             ` Linus Torvalds
  2020-03-13  9:53               ` Aleksa Sarai
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2020-03-12 18:09 UTC (permalink / raw)
  To: David Howells
  Cc: Stefan Metzmacher, Aleksa Sarai, Al Viro, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

On Thu, Mar 12, 2020 at 9:56 AM David Howells <dhowells@redhat.com> wrote:
>
> Any thoughts on which set of flags should override the other?

Do we need to care? I don't think we actually have conflicts, because
the semantics aren't the same, and they are about independent issues.

>                  If we're making
> RESOLVE_* flags the new definitive interface, then I feel they should probably
> override the AT_* flags where there's a conflict, ie. RESOLVE_NO_SYMLINKS
> should override AT_SYMLINK_FOLLOW for example.

That's just for a linkat2() system call? I think the natural semantic
is the one that falls out directly: RESOLVE_NO_SYMLINKS will cause it
to fail with -ELOOP if it is a symlink.

NOTE! This isn't really a "conflict". It's actually two different and
independent things:

 - without AT_SYMLINK_FOLLOW, a linkat() simply won't even try to
follow the symlink, and will link to the symlink itself instead.

 - RESOLVE_NO_SYMLINKS says "never follow symlinks".

Note how one does *NOT* override the other, quite the reverse. They
are about different things. One is about the _behavior_ when the last
component is a symlink, and the other is about permission to follow
any symlinks.

So all combinations make sense:

 - no AT_SYMLINK_FOLLOW, no RESOLVE_NO_SYMLINKS: just link to the
target, whether it's a symlink or not

   This is obviously our historical link() behavior.

 - no AT_SYMLINK_FOLLOW, yes RESOLVE_NO_SYMLINKS: just link to the
target, whether it's a symlink or not, but if there's a symlink in the
middle, return -ELOOP

   Note how this case doesn't follow the last one, so
RESOLVE_NO_SYMLINKS isn't an issue for the last component, but _is_ an
issue for the components in the middle.

 - AT_SYMLINK_FOLLOW, no RESOLVE_NO_SYMLINKS: just link to the target,
following the symlink if it exists

   This is obviously the historical AT_SYMLINK_FOLLOW behavior

 - AT_SYMLINK_FOLLOW | RESOLVE_NO_SYMLINKS: just link to the target,
return -ELOOP if it's a symlink (of if there's a symlink on the way).

   This is the natural behavior for "refuse to follow any symlinks anywhere".

note how they are all completely sane versions, and in no case does
one flag really override the other.

If anything, we actually miss a third flag: the "don't allow linking
to a final symlink at all" (but allow intermediate symlinks). We've
never had that behavior, although I think POSIX makes that case
undefined (ie you're not guaranteed to be able to link to a symlink in
the first place in POSIX).

I guess that third case could be emulated with open(O_PATH) + fstat to
check it's not a symlink + linkat(fd,AT_EMPTY_PATH) if it turns
somebody would want something like that (and we decided that
AT_EMPTY_PATH is ok for linkat()).

I doubt anybody cares.

                 Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 16:24           ` Linus Torvalds
  2020-03-12 17:11             ` Stefan Metzmacher
@ 2020-03-12 19:25             ` Al Viro
  1 sibling, 0 replies; 50+ messages in thread
From: Al Viro @ 2020-03-12 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stefan Metzmacher, David Howells, Aleksa Sarai, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

On Thu, Mar 12, 2020 at 09:24:49AM -0700, Linus Torvalds wrote:
> Would that be basically just an AT_EMPTY_PATH kind of thing? IOW,
> you'd be able to remove a file by doing
> 
>    fd = open(path.., O_PATH);
>    unlinkat(fd, "", AT_EMPTY_PATH);
> 
> Hmm. We have _not_ allowed filesystem changes without that last
> component lookup. Of course, with our dentry model, we *can* do it,
> but this smells fairly fundamental to me.

That's a bloody bad idea.  It breeds fuckloads of corner cases, it does not
match the locking model at all and I don't want to even think of e.g.
the interplay with open-by-fhandle ("Parent?  What parent?"), etc.

Fundamentally, there are operations on objects and there are operations
on links to objects.  Mixing those is the recipe for massive headache.

> It might avoid some of the extra system calls (ie you could use
> openat2() to do the path walking part, and then
> unlinkat(AT_EMPTY_PATH) to remove it, and have a "fstat()" etc in
> between the verify that it's the right type of file or whatever - and
> you'd not need an unlinkat2() with resolve flags).
> 
> I think Al needs to ok this kind of change. Maybe you've already
> discussed it with him and I just missed it.

They have not.  And IME samba folks tend to present the set of
primitives they want without bothering to explain what do they
want to factorize that way, let alone why it should be factorized
that way...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 17:11             ` Stefan Metzmacher
@ 2020-03-12 19:37               ` Al Viro
  2020-03-12 21:48               ` Jeremy Allison
  2020-03-13  9:59               ` Aleksa Sarai
  2 siblings, 0 replies; 50+ messages in thread
From: Al Viro @ 2020-03-12 19:37 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, David Howells, Aleksa Sarai, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

On Thu, Mar 12, 2020 at 06:11:09PM +0100, Stefan Metzmacher wrote:

> If that works safely for hardlinks and having another process doing a
> rename between openat2() and unlinkat(), we could try that.
> 
> My initial naive idea was to have one syscall instead of
> linkat2/renameat3/unlinkat2.
> 
> int xlinkat(int src_dfd, const char *src_path,
>             int dst_dfd, const char *dst_path,
>             const struct xlinkat_how *how, size_t how_size);
> 
> struct xlinkat_how {
>        __u64 src_at_flags;
>        __u64 src_resolve_flags;
>        __u64 dst_at_flags;
>        __u64 dst_resolve_flags;
>        __u64 rename_flags;
>        __s32 src_fd;
> };
> 
> With src_dfd=-1, src_path=NULL, how.src_fd = -1, this would be like
> linkat().
> With dst_dfd=-1, dst_path=NULL, it would be like unlinkat().
> Otherwise a renameat2().
>
> If how.src_fd is not -1, it would be checked to be the same path as
> specified by src_dfd and src_path.

"Checked" as in...?  And is that the same path or another link to the
same object, or...?

The idea of dumping all 3 into the same syscall looks wrong - compare
the effects of link() and rename() on the opened files, for starters,
and try to come up with documentation for all of that.  Multiplexors
tend to be very bad, in large part because they have so bloody
convoluted semantics...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 17:11             ` Stefan Metzmacher
  2020-03-12 19:37               ` Al Viro
@ 2020-03-12 21:48               ` Jeremy Allison
  2020-03-13  9:59               ` Aleksa Sarai
  2 siblings, 0 replies; 50+ messages in thread
From: Jeremy Allison @ 2020-03-12 21:48 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, David Howells, Aleksa Sarai, Al Viro, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Ralph Böhme, Volker Lendecke

On Thu, Mar 12, 2020 at 06:11:09PM +0100, Stefan Metzmacher wrote:
> Am 12.03.20 um 17:24 schrieb Linus Torvalds:
> > On Thu, Mar 12, 2020 at 2:08 AM Stefan Metzmacher <metze@samba.org> wrote:
> >>
> >> The whole discussion was triggered by the introduction of a completely
> >> new fsinfo() call:
> >>
> >> Would you propose to have 'at_flags' and 'resolve_flags' passed in here?
> > 
> > Yes, I think that would be the way to go.
> 
> Ok, that's also fine for me:-)
> 
> >>> If we need linkat2() and friends, so be it. Do we?
> >>
> >> Yes, I'm going to propose something like this, as it would make the life
> >> much easier for Samba to have the new features available on all path
> >> based syscalls.
> > 
> > Will samba actually use them? I think we've had extensions before that
> > weren't worth the non-portability pain?
> 
> Yes, we're currently moving to the portable *at() calls as a start.
> And we already make use of Linux only feature for performance reasons
> in other places. Having the new resolve flags will make it possible to
> move some of the performance intensive work into non-linux specific
> modules as fallback.
> 
> I hope that we'll use most of this through io_uring in the end,
> that's the reason Jens added the IORING_REGISTER_PERSONALITY feature
> used for IORING_OP_OPENAT2.
> 
> > But yes, if we have a major package like samba use it, then by all
> > means let's add linkat2(). How many things are we talking about? We
> > have a number of system calls that do *not* take flags, but do do
> > pathname walking. I'm thinking things like "mkdirat()"?)
> 
> I haven't looked them up in detail yet.
> Jeremy can you provide a list?

Fixing the flags argument on fchmodat() to actually *implement*
AT_SYMLINK_NOFOLLOW would be a good start :-).

As for the syscalls that don't have
flags I'm thinking of the things like:

getxattr/setxattr/removexattr just off the top of my head.

Jeremy.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-11 17:59       ` Linus Torvalds
  2020-03-12  9:08         ` Stefan Metzmacher
@ 2020-03-13  9:50         ` Aleksa Sarai
  1 sibling, 0 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-13  9:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Al Viro, Stefan Metzmacher, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 4108 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAr9E2HRvdDH5ajEFVPEiqaxSwpINPm8CzRzew0MFMrmIw
90Ktocpst/UkdAJ78GGiB36iHVbMB9WfhNU1aZs4SfvjrLzFCnvuLyrPso0HWuob
hF4Da9o3/MenwwcSAQdARxeU7HzFXanhbxCzfVz0rq8kuwhWSpEN0oYd2ZEyZgQw
iM/ayHo15AHTn0PuXCww+ePDlodOnm6Pdy0UZkkZvf4gZ2nFepNe9yqAF7uYUgYP
0usBOZP0PbzEEbbaVkuQnhYHdLgr77XDiUGTAjYkBRZ4B+WKTrZ5W4I0kC7P7anE
vUW8V4A8Mn0B5/JZVn3M/xSUErvGNO53Hb8AUDKXqKtXz4G3dENn64zd/+dmYn6E
+x466uNVm+7Styim/JZ8GkWE6SpxLhiVf0qfWGqXLJjr/+HA/b0nfBKTvRXntK/t
+Y55Sm+tfp6BW9Z+gwR5/eVZHEsuzxEuhDfXhgkm41awMimt8ypcciUswwb5ytZv
i5dOp0dwzW1n32eLJgCobsjZaFX8rEGiZJdlsilOJGT4ndCiFRyrKkmCkzZxtgw1
PmkMy4p+8auYC/xknN0/t0mud1rw1LiA48KQcfl8s7sQlr7AOeTqroBUh7ZmvBup
4YLEnPcTwQv+iu6q6rt6hhBkdn5JUJwU5YA011NjWUJ7hJlJCbmjTGXL1C69h0i2
85yiKxfd1pd3NoGQLEQGFW0L8tER2tLL4WwtJS0QpicV8V5d7pZIPlz0JLwFyjYh
iDDsUmUEVg3zmFHLDngOZTMToJKBWXvfPKM4YW+6EVIJKlbAUgYbFA/Z3DfCTOl2
Q2HZTtoDKW7hucIAw2SPflUMnDWuyCGCgfrBrRMrSfqLrcLr3aLZqWdm0fVl5o0m
J5BS9gK48S14IScKKjknuaPSCNwTEptha11dgjdtUUwQ0JEHi24Nv1pq56TR5YZz
hVpAdIeZ2qiSdrFAkFqyJLdyeNjIIXFYiiZ/svTU6JEacqaSzzMhjKbsjXv00sX0
zf9iRCnSrRqelc67dAmSFqp3zRNZpOAypcNExr1I7AltHXBk9p0D7B7DWH1xMuB8
61btv//G9V+entYNqP1bhQqzKj4ogMxcwYRMnQkeOmSeQkowTzCCpGcpc2zNbpJR
B0dtwQcv05bENd6dP6lhK3m0wUR04eLrv+IkC8zLd4jrtQR6lzmHfPCi4oOP2zfa
j8pJKZiElhc7CJE93/HRQ2MLW5CSDvT7sDAdBZc5EVibdnwXbxF5i6Gg7lfqOeUB
Ex853ZK6qgOmIrAQYQnPpRL5UKvy4hYpr+ma8bXAPaEpXU5LkLdz3nAfu82rdrWv
q9OxSVhJjuayt98+wkaeqc9/d+cu9Ztz+NLTbOw+HaIJOpt3EeT1oA0brb5Uj6Oc
uxkM4LsAmyey34FyZxw6NDBZZlQlMcurNjVVSJkSPEx4Ocll2Go4ZdBiZCjn2b0g
exWg7xGzJ40uiZ+XQbK0yhJe8BKwM8mLR1JepxkwU2cOPpAhaYf+cgKjRoZ0SQ9d
6cThNYAUahiMWGqZlYvwSbjXgJb/KYQF7Q5e+o5shBKAacSmIfcJj2kdjCmk5uEt
r+KMPN2Z74lirYdWZNsPMYzojAXUOv5UhK2+CWcgjl2z2MRgL2rLACECeBt8ITHv
J0Tte9nM9+LGpHC9akH4JFvEh1KE2jrE241Yu3JroZ2aOFbPlUsNllg6UbthNGus
Vjv9FhfaiG6g5xGJ/xGymlHuXQ3LuJYZlROJFqCvqTV2BWuj+RP0OgoakeufMe7C
LlOpVQ1bRfe3PCYpzfSTHmds2BDpgtVReFJ1kHsvNjn0AzLBCx6zIS9uj18dMkvh
CmvoplNYMls9/s2rnlZUoLNBtvR7RUavCzwWW6NXlxcJmhoFVq/nguKKhXM6nFAj
kEtRntp9tyHbc/I42j4qD95A3KnE+PKen2g+a9ybfJBRfdIaZ5I0hwR8piDcIYQE
l/qxBZHCA/nw4sj0O7/piYM/bMwqhDHDACFrrd4UtGnTfTnyHEYGeXssg37TWNpU
xTTAAcH1atPvJeyXOPnPdQcziRMBk0TviYPCIU/VgXlimlbkCTzIHcrOxahNVzoT
FqBj8ttmBRnPbpVkbbFOskZiT/a69Kg63KQTIY9TFu0UROCClJCYhLfZdHwpgMfq
/7MXmwnJQkC4cXEIUpfXvPM+ooQv1ir4bjyMqiNk2SgvM9o/K1Aa2uUe2moqZ2l4
RsxaxDIbceuIk3p82WdsgyeHoN9ByK1G8Q3MOsIATc1f6ycD3f3JD8w7Njq/HVqZ
izZX4XDZtGBGkGGZ/mT2qySiCkV+iUk1p3uSyhK+x95o4m/6uK6cSLpPnTdUxbKc
fOf5FJO7INN9422pCtPFgovm7gY8o0j/qbmS44o/pf0FV4P7uqxSYYlFrHNsTxqB
2LA/Dqu//24lKoVWb/3V/l7/rT3imZCxz/ZECEriupg7nRIpylzTZeqg5qxuRxrg
LboECZx1umwokq0fTqQ2O9ziM1KEW00/sMVDL3j5qUvydVzZGwYzlvvDFTgGgflj
E7QM/XsNL/dvUoT0DpTNqhf1tUXMHh5vtENc8C16+SiPT2Oz4kb6Zu6HS6QTuYkI
EBklT1SCOMD0Y1ix2xVIUh2jXG/gbHPkuVJhH7Xp48AbV6QOf6F5i6liZd8YAPmA
iZacnggF4rOz+uwf6fv4Yi3VjXRudNYehxYMm+6HSqfEiW0SXC8LsoaNkciyrZKy
ROyPFFLkJTAIPVDJlYAx+TCnHo0RkV+6wXnocJMQYTakbvV24V2Uem/kS+K7l3fV
m7Q4sUhtsr6LwS/CTTLuyt4gp+/HLXleKU4PR6Vh19Vg+ZcqAVQOac7pDc21Xw/o
++D95zDnomWXUj5pQ3o86zxHKsNa18xnpCvx1S8BxuOYdqiySEVVdv70D+XpLWie
z1eyDCm2/MFN7Fr98YM4pYuz+oVUuu4WTPMclhrvv/fgfukTr970Pi70GdPhfita
yOqRts5gcnKndfEicRzOLnefhRBjMG8TfKjP9jKqEqsVuS4XSFplRYa8jTv5r2+i
GhMjaJ3R/4+YubiqjSgsF2gqmzoPLSGjt9loOo2vVxGjwomfiwwM2B6y55xOsJik
axX8wDG4tL/z1xc62XzsprbhjNJberXK4k8eLaylV4f3DQAgFE53qOi9L+ezM8es
mK6ET6dTTF9Xt0UDDp9Oyog9J3tycOwYd6jKCkdIYOCCfE/XlPy88npbUHnZ3yL7
AupSax7XMoXIPP+hiM+eJe1IbFG1M/6UeMs2wdh4FxQOw7mU5s+PWvj2d7WoeM0r
0VfNxsNeT8/+bXtxo+VA5fuv9QEw+MraiL0nDTyUFv/bZfZUpCLD8eQ8NmOEYa5N
mAumnFyY93fFkzWaRUemOVUhq8ujkTRLrXgK6xuV5OaStCh4zdTGxl2PftqjwjE9
l2+EyyoRFHPNaeJle2YRbHitWNp6fdzvJZ/44WP6jxlug4sOu7v8MOEdVD/C0I7g
Cqk9phVyDCLSYuH0ofijfss+7UQL4zZ93OrPwOTVvO14tK7e2fwoAe2ADXADBJV+
5PnEpmYzl1WAJMoHU9oZaFHeEGPOf+Ra6+no+hg6Ezjl6pytjFzh4MWr2kTon2gK
FnxkyxjsITMC6oirOzck3To758AnZ1EhiRfEt5UFrTTGxUJp8Lis9Ibejujnq7+n
x2GCeczO8xn/OGZMmjjumRzAYUpKmxdRaQTfCMXZZJVHomE6rOmvFD3/Q6VnMyu4
Fe2JU5Pq79a0SuY7HwUcZDUEgR/iAcYEq4X9sR2L/8+p9KnJx9zVHhsrPPrzfK6U
vqboSkDnRQyRAOl8lsMaBBadmk0WGYPS47pIeyH+F+mquWK5IgkLEVZ/wYDxK5rY
vhXWeTbDmq9Ka5ZMasNM/B98y2IdwAOLTKSLC8JMpLw9VHTTTsjgoJcwU4RdGV9X
4ADqPBmctK1qKBqb
=9mAc
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 18:09             ` Linus Torvalds
@ 2020-03-13  9:53               ` Aleksa Sarai
  0 siblings, 0 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-13  9:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Stefan Metzmacher, Al Viro, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 2040 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAECXHdwnRm275WPukqxKfQ+WCS8+Fa3zKpFRrq1QwrEIw
8/QjFyBNdoQYMcpqemCUbsEczkW06ttH2+6J+fh0Xp0yhTp4Rc/UGFMAfS8tk/mV
hF4Da9o3/MenwwcSAQdA/QObRnL4UitLI5chK2xchDqZaqL5l5wIN5rOwvWmfycw
SWvHxsXazotf4u5DgpT5/QP2IXtpW7RGiXfYihdqFjo3MMBZSjaUM4aA9bPzJpQ/
0uoBOEYXocjRfIxSoWGXbNIRPfaXzQINnGdP19hx6XonrD1CyhFdQ4dI/xnr/n4R
OCxvWcV+qkggZWJ7bp31pY8mtA0DZIiFo08ve6JbfhDzVV8UYmVskqZ6GdIH9gKJ
NsRgyixBqR4LRIF4EfiRFJIaKXBNJ1is835L2d7XyIQ7qxj69RAvTdeiUoz/m7oq
zyJ8YsF1ibA0De6TPoVzGFtQJ4iReKFaIHOk5pY06brBHzHjDDOdPp02rlkr5I9X
UYDjHgq84xESs2eT4bCaFN4AHy5uKlUinqB+83+JT3l4F8zHvPVZ6AkifwNFezQN
AfzLbORFjzBl95x2ujrr0rwV888jcEczyn2P4y7G/sYs6n65e8KcE+hpMq751Jcm
O0L0IEw+EFiig0sYdrqJMEhI1r2NVM1inKArHK/hO8A49p55+1tDMAFgd2hA96q1
ecFyIxTulF/uNxjN5EVaKv90vXe1kKLvkZf3/OOuIyDAWbvHmpT1WzZX/9mcfgqc
p3e9UNkvb2SkGb+j3e246wFAc0f+5wXE+2HgYt9Nb2AMdoFqpCPSWE7NESWtgci1
Cn226dAzIZ59Sw+fz/sihw0q207vX8XpIHzLjkZYcE7T6qSndPKxMXVyalEEOY1D
huswgUgZ3INJi/ECYREobd1tPkiRYdnkbSqPbAkFZl7CnawXstn8v3vu7HvFAs+L
V1+a8dwWXK7nLG5gqLrGOMiZrnM9+Z2ODWrdf/T67ZfQ0ExGhAY1KW7OGbwK9LNO
pdYiISOxDdIvzOxALCcXMKQitvA2g7An2Pz3eGLi2cKxI+oc2uE3aNqnoyTAPxNm
AScm5RJ4t+sIasEzAXjVhWyT8nE0BLC1nt0vCpl/zeVceMAVB7hDmdLBw7zxNGFY
t/VNP6z76au+LMHhgl/NojG2F+plteEqzDgQOFXPtnlzA138vcGjU+6tsi1zc5En
YDQj1+3oMMvpcgebQzPvA/rQiliPM9OSHG+9AIyE0S8Ma3VLTjU7a0WE7p2JafmT
JqHfqk9u3tGU8IcKAqfuq+8jxoTAvHg0zADOI4j2mC3TZc81j23wf5u7baj34pEd
mPhOo706xgmqf2e6VtcjOzhhXU7Vsr10x7jlNN5Fx/FjEiIWArxk4KS8Ie/xWn6S
P0BDGZS/KMCv1YsRZWv4N1zS/J8OmdpzlcXXNZZqTFlvImVFkamaAL4YSXnAGMed
7uerb1fb892nZ5hqebUSg2myjHRCXRe4pk2XNXJ0yQgSUfkE2r5xm60xx9H/GJrI
ZtxtpVeqQGPX4jQ2cXSKeohBR820gYBls+vZxNLouYUd+bnC8EhLXAbkTcUerI63
DD1S6eFGqnILJZkTBDo6cYkwwDAhWhx93yXyMo+/tYP1+ZJnhtueyCik7fEczKkO
Y9dOWiDDARdqSrw4AtqMwCBrLv5hJlRDT0hrcNwfLg9O7+m+J3KkqXy4ZHrDvrsq
PjE6tXDC4NJ55Yxovetmy/Kizw8b/q5CZaorpSls0CURVoQXhFa9wFbruH6auEyG
rCJIDEM1G/R12YTVOR1nI/WvxLYfZ02mmpYTeyetaZ+Ut5g+JqoRuQK5aIsyV63d
/A3qYIJakJ0Y6wt7LBD6v85AhJnLm1/XJ5zMwSDyk80UJrvjKssAAAz3hINUqWoP
ZmEyLikdE7jyhGjOaq/f91E1iMA=
=/f3u
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-12 17:11             ` Stefan Metzmacher
  2020-03-12 19:37               ` Al Viro
  2020-03-12 21:48               ` Jeremy Allison
@ 2020-03-13  9:59               ` Aleksa Sarai
  2020-03-13 10:00                 ` Aleksa Sarai
                                   ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-13  9:59 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, David Howells, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 2800 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAFok3nbHHCWq7mPCsYcvhrF2txYNvrgX+Oa/ESFiXPz4w
9e6TX2OlJ2yy24uQokp/EPBq0EuVMU+0KxCw0Zn88JabUwAJiSC0zBLyq+jOoESC
hF4Da9o3/MenwwcSAQdAo0dq9v4oApDucrwNcaOuSXqjIApuaE/yfokDrjd5oHMw
4VNlFJCDJHxydPQscQwoDWABvkZ8Y0rKjNgg9ZFrohMG/w9244MHzjdl62gNLyV4
0uoB2mN2OpB+RzoL1N8t5dLznnJLS0VITK58v0xpR9oFCzZSClQ/nQ1U4gxUO/6z
j2unsp4sxoorvBzCU90P0hu8QzlVS8+SV414DnQvfhJx6cfAPpGFAm04lVMnR93i
BGMQS0oh8VPQr67pzmZQNMtmUly9ltrWNIkpcRGG0/7Fm/ad3WbfS/XQ5H2VvbKt
hJ4XIW3NDn6w0tp79NZwu6OEqJdEcwiohFc+c+fTN32C6kCFJjrtGJrg03C7MR+r
1kVwsXLe5c4TjouSNuUxn1618H9SwGmdyqER3uy77CxYPyaaD/KmWQ+aJ+5qFup5
9ipsnLj9RIME2LQAwc6S096RQRnRaflmAKlkjfSGkM4WIUbpPBVustBILAtsm1uB
QPw11fwFvcZfmWdtv6mFnv0u2giZOJSr2G0uL+ngqKKkshnC2k2jATaob7kKBQwq
MMypHYKnGF86o9F2SMND9GAC55nnPEXld+SHqYIoDjQJDSPb4kfuV6qnm+xmG3Fr
+s2OfZNwYKA97IPF8HDCsbDa0xt5xEIhwb3zR6fcdXu8Q12nVtRPjI4MZ3P2lbqr
C9z8ApIVQ+hl86AMswS8lRnllqOLXsWkA/iJXDr6KdDsQSUGQ04uJVaZzvx7zTmv
odkApiGP56PjwO9cCV7v/haWX4JnOKrrm6AgcDCr4z0I2NZXEK3/LuJKMNU9Xcxv
wYnMxnrR0spYv+dwRFdqFRpmXlgsBLU/4Ct3u+i3TWPC4FAyAxVpWfw0a2GMeUg7
PeUfceXMARdz6nMdVszEQUv9YP3SnstOmYYd8U2Q1/allYfl4ynl0balA1b5BYY/
u6jx9/22Ew7z4vyYdQ4RuplpbqB6+MUiTIVzOll4F5Rc4EI3i3rNuB113E0cF+uU
HPzqNeuJfxX55VFLjuz/TppsNoyg7cEk/E/7inj/qqIB7qNQPsSaNaz3Ra2OFlNT
E/t6rCE+zf4QNBg/qB0JbLHSJe7HBQsQFwIkrE13CxhchbMJtAd0bi9HbwiuvKpr
zJkpJbLSXG99gi1xH6CBIi89ADvjdFAJyXL9YP9o112SAlqYSbSBB9+1QfzIHBEH
JowKu4zjGv23Q6fCYs2YtxWpC+vtJriSm7GXeUhehEzyzhaAY2ciT3gzIfhT28ty
ChJfnVW+00mxVnd/rLiQ5+QIpEnX7KW91PmGy0Bjvy/p4DYEg4DhaWAmM6kF2kLd
1peQW9ZpC4Wgdc9+ALAugNGfXxOlIgllvHMOYY43QkyLyjsr/UBRl3X2f54alcup
Vt4XbQIoEYVF/USUN8DWvRzFWaJK4tw0UJTjAOpfIibzRBlDSi2b+kyut8qn1+Pe
Mshj8rb09xrN8vX/2iuoKh2y6SNpJ4kwUW9cbncbDreBXaU7OKxfTAsB8G4cTcI7
5GH5qoetgoALlwE+J7TrwCl76zlViSt9GdRo3ZEZwQ6H2WXzKDBeNNHY0x0Y+Oqy
mDEbPgF3ykeAXfzSZwoXrN1v0sWV0bp5Cem4g0wEIsVWMD9DqZ3d+MtbKv5eKeTJ
+Folj6c1rdwcaIbVkB9uHE9sAh0a7IO5q8E5wpL4N+MZ2mwZbHpmFxnBye9hs1zB
Wkfzh2oQ0D9FBMJ6R6snae62+1tKUPS4N6ZAz1N+BzbJxjGquqvjKciOjNqTIsQ2
7heHKyGnY4hYVeBOFi9LS4IgfT4+we/54U45JqNghUm+/WWK+bLX2WCB0jqTI5kW
V9GiilHJHiSyR4ipc9xpnxEZHQJmblF5XzlqPp3HEYwCPEZiPNe6lR6VkEfCz+ml
AkV8wH14GdQT3+JJAWQEyRF2pvPaYYnc/9FgxOJYncf53RuTEOr5u/1vcm17ttNX
4p6hMGHJXCwlIdrqWdpazJ0WOxjhQn71W1693gEiHKo8UllIwRm1LyAMd+5/L81b
i8KhDcZmWNUU0or619I/iIdgFW42NqOJ9L5AK1MazXw3MD6AXCMgNNAVT2a+Y8Ad
dEe/vB4NUfXqpC0doJlOdxNFqrLll/a64ca07vxYyTX3x3cOZBaF2dO4iBeRYF9J
swQawF/38BCjTNCIDvKOhsTSIE+bwPkrb5cVMJDw5wjI6YmFqGgqsnXaAqH1K9Eb
LUzvLDQfpEiNAcVA6BG/2gpT1MLIz/ga0KWKW6B4nbtok3hLYkpJnZ4C+F5eocIt
RnimCYZukPIh9U69oH8IyFWNEitU8ZKSjnLCXhc1VGBnZ3vkj9zaNFpY6CQq3pPH
e3VGPqACooMbWbym7cGgk+lvNxUlznmqbhajAAe+fRNIYjpcLHLEa1x6S1LvZ9pR
jj9DGJUqvAx0xTYQP2UD1+98SBEC7kJF0Rmd+LYLooA23KAJTrgC+PfihhhXAIF1
4/m7jkk8QjJNRPeMJCh/2zahnWSffLs4Hz6fHAOFaYVhkHkeUzNUHwdozR0okhAe
PJudmQ==
=ixSU
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-13  9:59               ` Aleksa Sarai
@ 2020-03-13 10:00                 ` Aleksa Sarai
  2020-03-13 16:48                 ` Jeremy Allison
  2020-03-13 18:28                 ` Al Viro
  2 siblings, 0 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-13 10:00 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Linus Torvalds, David Howells, Al Viro, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 1796 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAmVvIB2FuP5R9tS0nT/A3OzIwO98t9YvwPHsG5Z4d7VIw
088d7qpc6jwngun1T+xkELrxkhVhgnFg2bQlBpO1LNQITlnlcfZ2itZTrW1kBCF3
hF4Da9o3/MenwwcSAQdA3dMPDNYbnIfXOb7YS/mJ0u2LA/xLn3SEmGgWdCR8pk8w
cGMLm06QefIHaiS7Z3huaLgN/4v1bZ0JqatErfj5rm/Of8iElRUFafFXZmvJhDj+
0uoBa6AYnhbJSkcOvR9P66d8spYoEGSg5qVDSXjRz1tUHubdwImGpfSIKRCdjcyZ
36igUvHlJMTOeQqOwSPICyIyFMmjryDIsIHrLWC5cp0fuoxfaz1kcZG65DLgerRG
WqdaTY2XddcsIylHWb3BI7kKH8Sg7OfCoOU/+3bMQpBMfkzBaGtc4+pBcBPfTapv
YjBot2lMddhrMZiDtiez4cj9SBfcPNSToRmLT4oO1dHZNNrJZDG3puPpBBz2rYd7
S1ds6CVHGKF25JZyHkNMHfHm9/0tDVXWfB5NX547ZTfq0mSNELXmxohc5JFiYGZ/
mJvY7skeFNlbOYhdmKGUamUPne7M93qJX8XVaZQV5r6qkDORYwVs3lSrlImLog+q
HlSwaicdw0MSR7yJMrBtnV553egigdCHduTgZAT0xkA2RkVEaOvGUoq8UlRk8vgC
w0mHFhEGRWiThv8aACJ3n1iZH4BTFRZAfNMGAUd5cWHxlm7aBA52W8JRd7G6klZn
JFt+WvIDgNYwAraJFKa3bGw6Yha4McYtHSFw9i7k92jOE/SagTmm85okPWrY7vkL
lTdDjNcvAhGO4zMDxFpDqKdzXoD6Z7L5tnlqCd+5/51v3mp1aQ4X0Nek72Tsslw/
IwiMti2f/aXSQYqH7AVjxLY+FxS9YcCkOs3GRf9lEZf+FLLf2BbjUkvrmkpKlfQq
/mStnukYJP/4CPcYYXS8IweN/RblMtRxpuBqBj0hrO3pbWd8UNqPBMa66HbkVogD
JoZqa0bOFRiu6/JRaJVHxQCP+8FbqORaYwpn8tz08l7mLxmXrUMoyqA4cDKYF4Gz
SX8JQaNj8Hx9vZ6x3CMgzAdOtIZOnxk9/BLVlercHVexUY8d8dFlqa+BvEYxPmnY
6KzmQ4BNSmDHDj94ramcMNNhlSRV6OBV29fRtbMuKrymKRIzEE3zAlSMAssR5cRE
AtUfVMRtMk2Zv76flW7M9mJ6SNKMX7k9V4YYLfppupDPaX+YwYYQiLVRJ70DQTDP
5vfK7nqLyKXJGzXauLe3QOFeL7kaCTOLIjNpACLq1XJPsug5aJswDePEq2Nt6nVv
58cRbQpeufi6XzddcZo9yYkK3T5YocP/U48cjI8Oyu8h0oqjgN3v4xTT/A3DNm/F
Y/hxg8NnYhQEQ8t0ACF9yi6KYrC4ff5icRIOyr/LPDWZhZRz+eJ4aUAkV5WjuA05
EjC1cqmsLKnN9lDd+MXhzG0Zzk0V749vUySZnqdddqzufTf3rmf26RXVB2nn3wXA
kbcI0+GquBZpZboX78jgiTqjQffl/6oVV+g8VTucaCpudBE//Ire5SSNYl7Fa38q
4Jg/KW+oU8Y4xispqVTeXnY0PZqkllL06fcySlWaozUtxcsXeNgJKUgX0LCzNEGc
ox8ScJfwkCa55VMfDHqQph4lIao2ZZpponAxYci7m1g=
=gkj7
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-13  9:59               ` Aleksa Sarai
  2020-03-13 10:00                 ` Aleksa Sarai
@ 2020-03-13 16:48                 ` Jeremy Allison
  2020-03-13 18:28                 ` Al Viro
  2 siblings, 0 replies; 50+ messages in thread
From: Jeremy Allison @ 2020-03-13 16:48 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Stefan Metzmacher, Linus Torvalds, David Howells, Al Viro,
	Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Karel Zak, jlayton, Linux API, linux-fsdevel,
	LSM List, Linux Kernel Mailing List, Ralph Böhme,
	Volker Lendecke

On Fri, Mar 13, 2020 at 08:59:01PM +1100, Aleksa Sarai wrote:
> 
> I have heard some folks asking for a way to create a directory and get a
> handle to it atomically -- so arguably this is something that could be
> inside openat2()'s feature set (O_MKDIR?). But I'm not sure how popular
> of an idea this is.

This would be very useful to prevent race conditions between making
a directory and EA's on it, as are needed by Samba for
DOS attributes and Windows/NFSv4 ACLS.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-13  9:59               ` Aleksa Sarai
  2020-03-13 10:00                 ` Aleksa Sarai
  2020-03-13 16:48                 ` Jeremy Allison
@ 2020-03-13 18:28                 ` Al Viro
  2020-03-13 18:35                   ` Jeremy Allison
  2020-03-16 14:21                   ` Aleksa Sarai
  2 siblings, 2 replies; 50+ messages in thread
From: Al Viro @ 2020-03-13 18:28 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Stefan Metzmacher, Linus Torvalds, David Howells, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

On Fri, Mar 13, 2020 at 08:59:01PM +1100, Aleksa Sarai wrote:
> On 2020-03-12, Stefan Metzmacher <metze@samba.org> wrote:
> > Am 12.03.20 um 17:24 schrieb Linus Torvalds:
> > > But yes, if we have a major package like samba use it, then by all
> > > means let's add linkat2(). How many things are we talking about? We
> > > have a number of system calls that do *not* take flags, but do do
> > > pathname walking. I'm thinking things like "mkdirat()"?)
> > 
> > I haven't looked them up in detail yet.
> > Jeremy can you provide a list?
> > 
> > Do you think we could route some of them like mkdirat() and mknodat()
> > via openat2() instead of creating new syscalls?
> 
> I have heard some folks asking for a way to create a directory and get a
> handle to it atomically -- so arguably this is something that could be
> inside openat2()'s feature set (O_MKDIR?). But I'm not sure how popular
> of an idea this is.

For fuck sake, *NO*!

We don't need any more multiplexors from hell.  mkdir() and open() have
deeply different interpretation of pathnames (and anyone who asks for
e.g. traversals of dangling symlinks on mkdir() is insane).  Don't try to
mix those; even O_TMPFILE had been a mistake.

Folks, we'd paid very dearly for the atomic_open() merge.  We are _still_
paying for it - and keep finding bugs induced by the convoluted horrors
in that thing (see yesterday pull from vfs.git#fixes for the latest crop).
I hope to get into more or less sane shape (part - this cycle, with
followups in the next one), but the last thing we need is more complexity
in the area.

Keep the semantics simple and regular; corner cases _suck_.  "Infinitely
extensible (without review)" is no virtue.  And having nowhere to hide
very special flags for very special kludges is a bloody good thing.

Every fucking time we had a multiplexed syscall, it had been a massive
source of trouble.  IF it has a uniform semantics - fine; we don't need
arseloads of read_this(2)/read_that(2).  But when you need pages upon
pages to describe the subtle differences in the interpretation of
its arguments, you have already lost.  It will be full of corner
cases, they will get zero testing and they will rot.  Inevitably.  All
the faster for the lack of people who would be able to keep all of that
in head.

We do have a mechanism for multiplexing; on amd64 it lives in do_syscall_64().
We really don't need openat2() turning into another one.  Syscall table
slots are not in a short supply, and the level of review one gets from
"new syscall added" is higher than from "make fubar(2) recognize a new
member in options->union_full_of_crap if it has RESOLVE_TO_WANK_WITH_RIGHT_HAND
set in options->flags, affecting its behaviour in some odd ways".
Which is a good thing, damnit.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-13 18:28                 ` Al Viro
@ 2020-03-13 18:35                   ` Jeremy Allison
  2020-03-16 14:21                   ` Aleksa Sarai
  1 sibling, 0 replies; 50+ messages in thread
From: Jeremy Allison @ 2020-03-13 18:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Aleksa Sarai, Stefan Metzmacher, Linus Torvalds, David Howells,
	Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Karel Zak, jlayton, Linux API, linux-fsdevel,
	LSM List, Linux Kernel Mailing List, Ralph Böhme,
	Volker Lendecke

On Fri, Mar 13, 2020 at 06:28:44PM +0000, Al Viro wrote:
> On Fri, Mar 13, 2020 at 08:59:01PM +1100, Aleksa Sarai wrote:
> > On 2020-03-12, Stefan Metzmacher <metze@samba.org> wrote:
> > > Am 12.03.20 um 17:24 schrieb Linus Torvalds:
> > > > But yes, if we have a major package like samba use it, then by all
> > > > means let's add linkat2(). How many things are we talking about? We
> > > > have a number of system calls that do *not* take flags, but do do
> > > > pathname walking. I'm thinking things like "mkdirat()"?)
> > > 
> > > I haven't looked them up in detail yet.
> > > Jeremy can you provide a list?
> > > 
> > > Do you think we could route some of them like mkdirat() and mknodat()
> > > via openat2() instead of creating new syscalls?
> > 
> > I have heard some folks asking for a way to create a directory and get a
> > handle to it atomically -- so arguably this is something that could be
> > inside openat2()'s feature set (O_MKDIR?). But I'm not sure how popular
> > of an idea this is.
> 
> For fuck sake, *NO*!
> 
> We don't need any more multiplexors from hell.  mkdir() and open() have
> deeply different interpretation of pathnames (and anyone who asks for
> e.g. traversals of dangling symlinks on mkdir() is insane).  Don't try to
> mix those; even O_TMPFILE had been a mistake.
> 
> Folks, we'd paid very dearly for the atomic_open() merge.  We are _still_
> paying for it - and keep finding bugs induced by the convoluted horrors
> in that thing (see yesterday pull from vfs.git#fixes for the latest crop).
> I hope to get into more or less sane shape (part - this cycle, with
> followups in the next one), but the last thing we need is more complexity
> in the area.

Can we disentangle the laudable desire to keep kernel internals
simple (which I completely agree with :-) from the desire to
keep user-space interfaces simple ?

Having some way of doing a mkdir() that returns an open fd
on the new directory *is* a very useful thing for many applications,
but I really don't care how the kernel implements it. We have so much
Linux-specific code already that one more thing won't matter :-).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-16 14:21                   ` Aleksa Sarai
@ 2020-03-16 14:20                     ` Aleksa Sarai
  0 siblings, 0 replies; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-16 14:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Stefan Metzmacher, Linus Torvalds, David Howells, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On 2020-03-13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Mar 13, 2020 at 08:59:01PM +1100, Aleksa Sarai wrote:
> > On 2020-03-12, Stefan Metzmacher <metze@samba.org> wrote:
> > > Am 12.03.20 um 17:24 schrieb Linus Torvalds:
> > > > But yes, if we have a major package like samba use it, then by all
> > > > means let's add linkat2(). How many things are we talking about? We
> > > > have a number of system calls that do *not* take flags, but do do
> > > > pathname walking. I'm thinking things like "mkdirat()"?)
> > > 
> > > I haven't looked them up in detail yet.
> > > Jeremy can you provide a list?
> > > 
> > > Do you think we could route some of them like mkdirat() and mknodat()
> > > via openat2() instead of creating new syscalls?
> > 
> > I have heard some folks asking for a way to create a directory and get a
> > handle to it atomically -- so arguably this is something that could be
> > inside openat2()'s feature set (O_MKDIR?). But I'm not sure how popular
> > of an idea this is.
> 
> For fuck sake, *NO*!
> 
> We don't need any more multiplexors from hell.  mkdir() and open() have
> deeply different interpretation of pathnames (and anyone who asks for
> e.g. traversals of dangling symlinks on mkdir() is insane).  Don't try to
> mix those; even O_TMPFILE had been a mistake.

I agree that O_TMPFILE is a mess, and you're right that it wouldn't be a
good idea to fold it into open*(). But what is your opinion on a
hypothetical mkdirat2() which would let you get an fd to the directory
that was just created?

> We really don't need openat2() turning into another one.  Syscall table
> slots are not in a short supply, and the level of review one gets from
> "new syscall added" is higher than from "make fubar(2) recognize a new
> member in options->union_full_of_crap if it has RESOLVE_TO_WANK_WITH_RIGHT_HAND
> set in options->flags, affecting its behaviour in some odd ways".
> Which is a good thing, damnit.

You're quite right, and I don't intend openat2() to become another
ioctl-but-with-even-more-fun-semantics.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/14] VFS: Add additional RESOLVE_* flags [ver #18]
  2020-03-13 18:28                 ` Al Viro
  2020-03-13 18:35                   ` Jeremy Allison
@ 2020-03-16 14:21                   ` Aleksa Sarai
  2020-03-16 14:20                     ` Aleksa Sarai
  1 sibling, 1 reply; 50+ messages in thread
From: Aleksa Sarai @ 2020-03-16 14:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Stefan Metzmacher, Linus Torvalds, David Howells, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong,
	Karel Zak, jlayton, Linux API, linux-fsdevel, LSM List,
	Linux Kernel Mailing List, Jeremy Allison, Ralph Böhme,
	Volker Lendecke

[-- Attachment #1: Type: application/pgp-encrypted, Size: 11 bytes --]

[-- Attachment #2: msg.asc --]
[-- Type: application/octet-stream, Size: 2661 bytes --]

-----BEGIN PGP MESSAGE-----

hF4DKxGIDJuAmzUSAQdAOyorpslBI6n/b95oHK9vQX5c34jWgBYT3BWkrxDnulUw
XRJgkAaE6Lh89fJsDu1AXoGLUg4uLAIwYZtDL/S7vP5lk/sics8Ea1pPPB57MEH2
hF4Da9o3/MenwwcSAQdAMvwJXH8hTH6HxeLhzs5XiylR2dpyU5ytZbMj1Uul0EIw
o+njZtUgim7iBNJBnYt+c3ePt97rH2SzaYHwmAKXDicTu3HK51J2c8qesxQMhAHg
0uoB2pnBAgvZHTmGscQmL+faC4PUcHhAldhJ0BWeXXo1zw/kBl7o6JjSBS4qEtAr
qXmgQNvurV215t3LnTYBd6Mg/ltVY8hDI3sCogRWHAk2UuuWtrTmZejZf7LOpZ5H
EZThZJ7ApsZenVUuEP1MrG6IKng1SyvEhloyuDYjeLaaO7H+ZgJU/lPnWi3sg7po
CdNINYabPZTmwBVi+ysS1lCmRYm2SuL71q/cOb5jvEmceTYrq/fBvnDQljUpgdpN
hRFZOorZYmfW1/J/JHjMfnTfOreaWD8HgPzQtdQYbYPVKDbcde5NsNKW3HZpYogJ
ZAze9aaU2Ef1sW7ZIFOHluy4Xcsn04xHvbfnFU5wOZ2Gl1rGo+b0L69CqkMrRZpI
NBp+cBvvQ/MM+UmSE/fCB+LKnFShlMqmT37gTfIgytmIaN196wBNlhXTovf/vjDz
YwmGgvGP/BtfKjzKRw2KaMv6Hr604rAitR9FTJFYWhhsQEwiPfNA0guP4e2aXugi
+RciMzExtMj3Y6MN+5Cv72w97NntJXhEPtcY8/g7DIE6C6h3XA8WVAtisU5Im9+d
h9Rronob3Mb2DSYOCMlTEDr7FYW2DslhV3d0825qdxoBEIpB4Z2YD0hLiM6yQoJP
YZEENFf0LCNkcN7BKX27LPPY5pSbHATVxQbyq+LJjQl9JI2QiiQdxxgzpCnY1rRV
h/pncdz2CqJkHabT3URD3fCv6Bu77IoaBKK9BxqIyBrZX8gx7lz58NzGuRttnckU
IfKtNhGSx0J4Kir8oxYhkfSqzoHYhVoAxe527TQ8YzfU0Mrdkj2MM4o6ELe3U/SQ
VYT/8PZoELSPJjJCP05drJZkvGisqe5McypyEI2UlIx68xK+YHuFkw9EoxLzNT1Y
jzvLbx6O+pd9ciya4stKGudEb1o5c7ini+jE6oKqefT8mUH+3ZL457EmNEJ59eyH
Y2hPeeAFFqnOpBv2pYZD5BqGFNs4B51kW+T2W8ZBBVlrUO8BGQPGZUD3cJcdgYZr
sNh9o9VshSiMW/0UIH6mfzcsrwPISgHVIsEndISFJcAeyEaHE72ytjVuyWBDlcC9
kxBkRCpNiPxlFX6oZ8o2zocx6o03VY3O4tvHiDxm3fi9+eItT6Q1g2q5e4No+ryz
QzcWPmtzSpDWEzJWF5HC1mxufAoAoxmIUNkX+gIB4jh4/CNMvjIXFeDn+coFchRp
9rG0JU97QKwVpVzQXkuJZ1cnCM2YBghliReJW//SoPLf9DEwkiRKqutYPj7xysJR
cotGDdNoIBPC0fu5TKgyHn7bOw0dmFG9VC3IITcX5KEkqdlRjSFgDdKRU+YI4fPQ
iXeVRI3FL2+HRfi6JjLm1Bwv6Zucq8M2nqc8tfNtHK9JgEYlgmMXVaG1RnzilHJT
YgtCBxoQc1YNh0s7D8xpFuZEdFoA/mn9Yxv5j+DO4ZMoHo2pjZ4yRjxtM2LIDgiV
hRq7Mwu5kU5TT5gBzTlbB0s7ulQYJqLJB/qPZ8kBXCUj4K1Yv54tjlErPgoHph1s
y0x8Yt46v/q77kzzXLFS8GKarIryvlEsYofRrsbXzseAYnv9WLCOO4a1BEDofgeg
wPcSUTHS+grL/wyzyqRXg7ZVtjrWcMNbHp/DHAxgt8ZB3NuhUTCwxhjFHBOz+7Wl
9Jh3cU3Ob4rNDaFrgVvhwIrRSTViF32IkNPCRfzaVG0OnEhnLvKoySmw6Uu17k/w
Bq6mnOF8c3VLMfGUmx9GfKCv7u3E6+eaHftZfa32tdOT994E0n+TstXK9bmGXfFZ
6moqFXihUTjvReR+ynVEd/yOtcSDVbA/rPJdz70kyF3QAM9AbDRnHwFKitlmG70Y
tbggI2WOBycVmG/CUXNnR5fmMf6VZ6MYw3I6DIWXGVqBJTEsDut+nropWpYtW73Z
uekHfoO9ES2I3kVRskoNowcb/ZjqbRx34R2ClHzXE3tzSBo8YRGC+BpIgHCtajgV
NmKEZfdC5JAC2GkDey75OUF3VqhcYEWmeW/GrIgu7sQ4SOldwZ6g7f01AUrM/bDA
s+HburNrgcXFPQbqqnicWOqNmJqMb0a/D2NUFoSiJiAjKkdt9jjfcaMYPU0DnqiG
UhYacuDny24N6NG6GWQFu4jf5WhU2j+fiKJhVKpVYbV/WarG4UQAaiLDYnAK504A
k2OaaZUgdiRs/SqWihvou2wp/DJz5rUnVgAToUy75UkRKnSayF33pytlURNmRGxR
Y9wem8i6FZmyb8zDSBbInXqMmYt58Ea92Ku/KtAV1yudXx0gZOCPBqPl6XVfhg==
=aDWU
-----END PGP MESSAGE-----

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-03-19  8:10 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 14:00 [PATCH 00/14] VFS: Filesystem information [ver #18] David Howells
2020-03-09 14:00 ` [PATCH 01/14] VFS: Add additional RESOLVE_* flags " David Howells
2020-03-09 20:56   ` Stefan Metzmacher
2020-03-09 21:13   ` David Howells
2020-03-10  0:55   ` Aleksa Sarai
2020-03-10  1:14     ` Linus Torvalds
2020-03-10  7:25     ` David Howells
2020-03-11 17:59       ` Linus Torvalds
2020-03-12  9:08         ` Stefan Metzmacher
2020-03-12 16:24           ` Linus Torvalds
2020-03-12 17:11             ` Stefan Metzmacher
2020-03-12 19:37               ` Al Viro
2020-03-12 21:48               ` Jeremy Allison
2020-03-13  9:59               ` Aleksa Sarai
2020-03-13 10:00                 ` Aleksa Sarai
2020-03-13 16:48                 ` Jeremy Allison
2020-03-13 18:28                 ` Al Viro
2020-03-13 18:35                   ` Jeremy Allison
2020-03-16 14:21                   ` Aleksa Sarai
2020-03-16 14:20                     ` Aleksa Sarai
2020-03-12 19:25             ` Al Viro
2020-03-12 16:56           ` David Howells
2020-03-12 18:09             ` Linus Torvalds
2020-03-13  9:53               ` Aleksa Sarai
2020-03-13  9:50         ` Aleksa Sarai
2020-03-09 14:01 ` [PATCH 02/14] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
2020-03-10  9:31   ` Christian Brauner
2020-03-10  9:32     ` [PATCH v19 01/14] fsinfo: Add fsinfo() syscall to query filesystem information Christian Brauner
2020-03-10  9:32       ` [PATCH v19 14/14] arch: wire up fsinfo syscall Christian Brauner
2020-03-09 14:01 ` [PATCH 03/14] fsinfo: Provide a bitmap of supported features [ver #18] David Howells
2020-03-09 14:01 ` [PATCH 04/14] fsinfo: Allow retrieval of superblock devname, options and stats " David Howells
2020-03-09 14:01 ` [PATCH 05/14] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
2020-03-09 14:01 ` [PATCH 06/14] fsinfo: Add a uniquifier ID to struct mount " David Howells
2020-03-09 14:01 ` [PATCH 07/14] fsinfo: Allow mount information to be queried " David Howells
2020-03-10  9:04   ` Miklos Szeredi
2020-03-09 14:02 ` [PATCH 08/14] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
2020-03-10  8:42   ` Christian Brauner
2020-03-09 14:02 ` [PATCH 09/14] fsinfo: Provide notification overrun handling support " David Howells
2020-03-09 14:02 ` [PATCH 10/14] fsinfo: sample: Mount listing program " David Howells
2020-03-09 14:02 ` [PATCH 11/14] fsinfo: Add API documentation " David Howells
2020-03-09 14:02 ` [PATCH 12/14] fsinfo: Add support for AFS " David Howells
2020-03-09 14:02 ` [PATCH 13/14] fsinfo: Example support for Ext4 " David Howells
2020-03-09 14:02 ` [PATCH 14/14] fsinfo: Example support for NFS " David Howells
2020-03-09 17:50 ` [PATCH 00/14] VFS: Filesystem information " Jeff Layton
2020-03-09 19:22   ` Andres Freund
2020-03-09 22:49     ` Jeff Layton
2020-03-10  0:18       ` Andres Freund
2020-03-09 20:02 ` Miklos Szeredi
2020-03-09 22:52 ` David Howells
2020-03-10  9:18   ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).