LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 00/27] Containers and using authenticated filesystems
@ 2019-02-15 16:07 David Howells
  2019-02-15 16:07 ` [RFC PATCH 01/27] containers: Rename linux/container.h to linux/container_dev.h David Howells
                   ` (29 more replies)
  0 siblings, 30 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel


Here's a collection of patches that containerises the kernel keys and makes
it possible to separate keys by namespace.  This can be extended to any
filesystem that uses request_key() to obtain the pertinent authentication
token on entry to VFS or socket methods.

I have this working with AFS and AF_RXRPC so far, but it could be extended
to other filesystems, such as NFS and CIFS.

The following changes are made:

 (1) Add optional namespace tags to a key's index_key.  This allows the
     following:

     (a) Automatic invalidation of all keys with that tag when the
     	 namespace is removed.

     (b) Mixing of keys with the same description, but different areas of
     	 operation within a keyring.

     (c) Sharing of cache keyrings, such as the DNS lookup cache.

     (d) Diversion of upcalls based on namespace criteria.

 (2) Provide each network namespace with a tag that can be used with (1).
     This is used by the DNS query, rxrpc, nfs idmapper keys.

     [!] Note that it might still be better to move these keyrings into the
     	 network namespace.

 (3) Provide key ACLs.  These allow:

     (a) The permissions can be split more finely, in particular separating
     	 out Invalidate and Join.

     (b) Permits to be granted to non-standard subjects.  So, for instance,
     	 Search permission could be granted to a container object, allowing
     	 a search of the container keyring by a denizen of the container to
     	 find a key that they can't otherwise see.

 (4) Provide a kernel container object.  Currently, this is created with a
     system call and passed flags that indicate the namespaces to be
     inherited or replaced.  It might be better to actually use something
     like fsconfig() to configure the container by setting key=val type
     options.

     The kernel container object provides the following facilities:

     (a) request_key upcall interception.  The manager of a container can
     	 intercept requests made inside the container and, using a series
     	 of filters, can cause the authkeys to be placed into keyrings that
     	 serve as queues for one or more upcall processing programs.  These
     	 upcall programs use key notifications to monitor those keyrings.

     (b) Per-container keyring.  A keyring can be attached to the container
     	 such that this is searched by a request_key() performed by a
     	 denizen of the container after searching the thread, process and
     	 session keyrings.  The keyring and the keys contained therein must
     	 be granted Search for that container.

	 This allows:

 	 (i) Authenticated filesystems to be used transparently inside of
	     the container without any cooperation from the occupant
	     thereof.  All the key maintenance can be done by the manager.

         (ii) Keys to be made available to the denizens of a container (by
             granting extra permissions to the container subject).

     (c) Per-container ID that can be used in audit messages.

     (d) Container object creation gives the manager a file descriptor that
     	 can:

	 (i) Be passed to a dirfd parameter to a VFS syscall, such as
     	     mkdirat(), allowing an operation to be done inside the
     	     container.

         (ii) Be passed to fsopen()/fsconfig() to indicate that the target
             filesystem is going to be created inside a container, in that
             container's namespaces.

         (iii) Be passed to the move_mount() syscall as a destination for
             setting the root filesystem inside a new mount namespace made
             upon container creation.

     (e) The ability to configure the container with namespaces or
     	 whatever, and then fork a process into that container to 'boot'
     	 it.


Three sample programs are provided:

 (1) test-container.  This:

	- Creates a kernel container with a blank mount ns.
	- Creates its root mount and moves it to the container root.
	- Mounts /proc therein.
	- Creates a keyring called "_container"
	  - Sets that as the container keyring.
	  - Grants Search permission to the container on that keyring.
	  - Removes owner permission on that keyring.
	- Creates a sample user key "foobar" in the container keyring.
	  - Grants various permissions to the container on that key.
	- Creates a keyring called "upcall"
	  - Intercepts "user" key upcalls from the container to there.
	- Forks a process into the container
	  - Prints the container keyring ID if it can
	  - Exec's bash.

     This program expects to be given the device name for a partition it
     can mount as the root and expects it to contain things like /etc,
     /bin, /sbin, /lib, /usr containing programs that can be run and /proc
     to mount procfs upon.  E.g.:

	./test-container /dev/sda3

 (2) test-upcall.  This is a service program that monitors the "upcall"
     keyring created by test-container for authkeys appearing, which it
     then hands off to /sbin/request-key.  This:

	- Opens /dev/watch_queue.
	  - Sets the size to 1 page.
	  - Sets a filter to watch for "Link creation" key events.
	  - Sets a watch on the upcall keyring.
	- Polls the watch queue for events
	- When an event comes in:
	  - Gets the authkey ID from the event buffer.
	  - Queries the authkey.
	  - Forks of a handler which:
	    - Moves the authkey to its thread keyring
	    - Sets up a new session keyring with the authkey in it.
	    - Execs /sbin/request-key.

     This can be run in a shell that shares the session keyring with
     test-container, from which it will find the upcall keyring.
     Alternatively, the keyring ID can be provided on the command line:

	./test-upcall [<upcall-keyring>]

     It can be triggered from inside of the container with something like:

	keyctl request2 user debug:e a @s

     and something like:

	ptrs h=4 t=2 m=2000003
	NOTIFY[00000004-00000002] ty=0003 sy=0002 i=01000010
	KEY 78543393 change=2 aux=141053003
	Authentication key 141053003
	- create 779280685
	- uid=0 gid=0
	- rings=0,0,798528519
	- callout='a'
	RQDebug keyid: 779280685
	RQDebug desc: debug:e
	RQDebug callout: a
	RQDebug session keyring: 798528519

     will appear on stdout/stderr from it and /sbin/request-key.

 (3) test-cont-grant.  This is a program to make the nominated key
     available to a container's denizens.  It:

	- Grants search permission to the nominated key.
	- Links the nominated key into the container keyring.

     It can be run from outside of the keyring like so:

	./test-cont-grant <key> [<container-keyring>]

     If the keyring isn't given, it will look for one called "_container"
     in the session keyring where test-container is expected to have placed
     it.

     With kAFS, it can be used like follows:

	kinit dhowells@REDHAT.COM
	kafs-aklog redhat.com

     which would log into kerberos and then get a key for accessing an AFS
     cell called "redhat.com".  This can be seen in the session keyring by
     calling "keyctl show":

	 120378984 --alswrv      0     0  keyring: _ses
	 474754113 ---lswrv      0 65534   \_ keyring: _uid.0
	  64049961 --alswrv      0     0   \_ rxrpc: afs@redhat.com
	  78543393 --alswrv      0     0   \_ keyring: upcall
	 661655334 --alswrv      0     0   \_ keyring: _container
	 639103010 --alswrv      0     0       \_ user: foobar

     Then doing:

	./test-cont-grant 64049961

     will result in:

	 120378984 --alswrv      0     0  keyring: _ses
	 474754113 ---lswrv      0 65534   \_ keyring: _uid.0
	  64049961 --alswrv      0     0   \_ rxrpc: afs@procyon.org.uk
	  78543393 --alswrv      0     0   \_ keyring: upcall
	 661655334 --alswrv      0     0   \_ keyring: _container
	 639103010 --alswrv      0     0       \_ user: foobar
	  64049961 --alswrv      0     0       \_ rxrpc: afs@procyon.org.uk

     Inside the container, the cell could be mounted:

	mount -t afs "%redhat.com:root.cell" /mnt

     and then operations in /mnt will be done using the token that has been
     made available.  However, this can be overridden locally inside the
     container by doing kinit and kafs-aklog there with a different user.

     More to the point, the container manager could mount the container's
     rootfs, say, over authenticated AFS and then attach the token to the
     container and mount the rootfs into the container and the container's
     inhabitant need not have any means to gain a kerberos login.

     [?] I do wonder if the possibility to use container key searches for
     	 direct mounts should be controlled by a mount option, say:

		fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);

         where you have to have the container handle available.

     [!] Note that test-cont-grant picks the container by name and does not
     	 require the container handle when setting the key ACL - but the
     	 name must come from the set of children of the current container.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container

Note that this is dependent on the mount-api-viro, fsinfo, notifications
and keys-namespace branches.

David
---
David Howells (27):
      containers: Rename linux/container.h to linux/container_dev.h
      containers: Implement containers as kernel objects
      containers: Provide /proc/containers
      containers: Allow a process to be forked into a container
      containers: Open a socket inside a container
      containers, vfs: Allow syscall dirfd arguments to take a container fd
      containers: Make fsopen() able to create a superblock in a container
      containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS
      vfs: Allow mounting to other namespaces
      containers: Provide fs_context op for container setting
      containers: Sample program for driving container objects
      containers: Allow a daemon to intercept request_key upcalls in a container
      keys: Provide a keyctl to query a request_key authentication key
      keys: Break bits out of key_unlink()
      keys: Make __key_link_begin() handle lockdep nesting
      keys: Grant Link permission to possessers of request_key auth keys
      keys: Add a keyctl to move a key between keyrings
      keys: Find the least-recently used unseen key in a keyring.
      containers: Sample: request_key upcall handling
      container, keys: Add a container keyring
      keys: Fix request_key() lack of Link perm check on found key
      KEYS: Replace uid/gid/perm permissions checking with an ACL
      KEYS: Provide KEYCTL_GRANT_PERMISSION
      keys: Allow a container to be specified as a subject in a key's ACL
      keys: Provide a way to ask for the container keyring
      keys: Allow containers to be included in key ACLs by name
      containers: Sample to grant access to a key in a container


 arch/x86/entry/syscalls/syscall_32.tbl             |    3 
 arch/x86/entry/syscalls/syscall_64.tbl             |    3 
 arch/x86/ia32/sys_ia32.c                           |    2 
 certs/blacklist.c                                  |    7 
 certs/system_keyring.c                             |   12 
 drivers/acpi/container.c                           |    2 
 drivers/base/container.c                           |    2 
 drivers/md/dm-crypt.c                              |    2 
 drivers/nvdimm/security.c                          |    2 
 fs/afs/security.c                                  |    2 
 fs/afs/super.c                                     |   18 +
 fs/cifs/cifs_spnego.c                              |   25 +
 fs/cifs/cifsacl.c                                  |   28 +
 fs/cifs/connect.c                                  |    4 
 fs/crypto/keyinfo.c                                |    2 
 fs/ecryptfs/ecryptfs_kernel.h                      |    2 
 fs/ecryptfs/keystore.c                             |    2 
 fs/fs_context.c                                    |   39 +
 fs/fscache/object-list.c                           |    2 
 fs/fsopen.c                                        |   54 ++
 fs/namei.c                                         |   45 +-
 fs/namespace.c                                     |  129 ++++-
 fs/nfs/nfs4idmap.c                                 |   29 +
 fs/proc/root.c                                     |   20 +
 fs/ubifs/auth.c                                    |    2 
 include/linux/container.h                          |  100 +++-
 include/linux/container_dev.h                      |   25 +
 include/linux/cred.h                               |    3 
 include/linux/fs_context.h                         |    5 
 include/linux/init_task.h                          |    1 
 include/linux/key-type.h                           |    2 
 include/linux/key.h                                |  122 +++--
 include/linux/lsm_hooks.h                          |   20 +
 include/linux/nsproxy.h                            |    7 
 include/linux/pid.h                                |    5 
 include/linux/proc_ns.h                            |    6 
 include/linux/sched.h                              |    3 
 include/linux/sched/task.h                         |    3 
 include/linux/security.h                           |   15 +
 include/linux/socket.h                             |    3 
 include/linux/syscalls.h                           |    6 
 include/uapi/linux/container.h                     |   28 +
 include/uapi/linux/keyctl.h                        |   85 +++
 include/uapi/linux/mount.h                         |    4 
 init/Kconfig                                       |    7 
 init/init_task.c                                   |    3 
 ipc/mqueue.c                                       |   10 
 kernel/Makefile                                    |    2 
 kernel/container.c                                 |  532 ++++++++++++++++++++
 kernel/cred.c                                      |   45 ++
 kernel/exit.c                                      |    1 
 kernel/fork.c                                      |  111 ++++
 kernel/namespaces.h                                |   15 +
 kernel/nsproxy.c                                   |   32 +
 kernel/pid.c                                       |    4 
 kernel/sys_ni.c                                    |    5 
 lib/digsig.c                                       |    2 
 net/ceph/ceph_common.c                             |    2 
 net/compat.c                                       |    2 
 net/dns_resolver/dns_key.c                         |   12 
 net/dns_resolver/dns_query.c                       |   15 -
 net/rxrpc/key.c                                    |   16 -
 net/socket.c                                       |   34 +
 samples/vfs/Makefile                               |   12 
 samples/vfs/test-cont-grant.c                      |   84 +++
 samples/vfs/test-container.c                       |  382 ++++++++++++++
 samples/vfs/test-upcall.c                          |  243 +++++++++
 security/integrity/digsig.c                        |   31 -
 security/integrity/digsig_asymmetric.c             |    2 
 security/integrity/evm/evm_crypto.c                |    2 
 security/integrity/ima/ima_mok.c                   |   13 
 security/integrity/integrity.h                     |    4 
 .../integrity/platform_certs/platform_keyring.c    |   13 
 security/keys/Makefile                             |    2 
 security/keys/compat.c                             |   20 +
 security/keys/container.c                          |  419 ++++++++++++++++
 security/keys/encrypted-keys/encrypted.c           |    2 
 security/keys/encrypted-keys/masterkey_trusted.c   |    2 
 security/keys/gc.c                                 |    2 
 security/keys/internal.h                           |   34 +
 security/keys/key.c                                |   35 -
 security/keys/keyctl.c                             |  176 +++++--
 security/keys/keyring.c                            |  198 ++++++-
 security/keys/permission.c                         |  446 +++++++++++++++--
 security/keys/persistent.c                         |   27 +
 security/keys/proc.c                               |   17 -
 security/keys/process_keys.c                       |  102 +++-
 security/keys/request_key.c                        |   70 ++-
 security/keys/request_key_auth.c                   |   21 +
 security/security.c                                |   12 
 security/selinux/hooks.c                           |   16 +
 security/smack/smack_lsm.c                         |    3 
 92 files changed, 3696 insertions(+), 425 deletions(-)
 create mode 100644 include/linux/container_dev.h
 create mode 100644 include/uapi/linux/container.h
 create mode 100644 kernel/container.c
 create mode 100644 kernel/namespaces.h
 create mode 100644 samples/vfs/test-cont-grant.c
 create mode 100644 samples/vfs/test-container.c
 create mode 100644 samples/vfs/test-upcall.c
 create mode 100644 security/keys/container.c


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 01/27] containers: Rename linux/container.h to linux/container_dev.h
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
@ 2019-02-15 16:07 ` David Howells
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Rename linux/container.h to linux/container_dev.h so that linux/container.h
can be used for containers.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 drivers/acpi/container.c      |    2 +-
 drivers/base/container.c      |    2 +-
 include/linux/container.h     |   25 -------------------------
 include/linux/container_dev.h |   25 +++++++++++++++++++++++++
 4 files changed, 27 insertions(+), 27 deletions(-)
 delete mode 100644 include/linux/container.h
 create mode 100644 include/linux/container_dev.h

diff --git a/drivers/acpi/container.c b/drivers/acpi/container.c
index 12c240903c18..435db0694405 100644
--- a/drivers/acpi/container.c
+++ b/drivers/acpi/container.c
@@ -23,7 +23,7 @@
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  */
 #include <linux/acpi.h>
-#include <linux/container.h>
+#include <linux/container_dev.h>
 
 #include "internal.h"
 
diff --git a/drivers/base/container.c b/drivers/base/container.c
index 1ba42d2d3532..1ff01ead2b2a 100644
--- a/drivers/base/container.c
+++ b/drivers/base/container.c
@@ -6,7 +6,7 @@
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  */
 
-#include <linux/container.h>
+#include <linux/container_dev.h>
 
 #include "base.h"
 
diff --git a/include/linux/container.h b/include/linux/container.h
deleted file mode 100644
index 3c03e6fd2035..000000000000
--- a/include/linux/container.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Definitions for container bus type.
- *
- * Copyright (C) 2013, Intel Corporation
- * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#include <linux/device.h>
-
-/* drivers/base/power/container.c */
-extern struct bus_type container_subsys;
-
-struct container_dev {
-	struct device dev;
-	int (*offline)(struct container_dev *cdev);
-};
-
-static inline struct container_dev *to_container_dev(struct device *dev)
-{
-	return container_of(dev, struct container_dev, dev);
-}
diff --git a/include/linux/container_dev.h b/include/linux/container_dev.h
new file mode 100644
index 000000000000..3c03e6fd2035
--- /dev/null
+++ b/include/linux/container_dev.h
@@ -0,0 +1,25 @@
+/*
+ * Definitions for container bus type.
+ *
+ * Copyright (C) 2013, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+
+/* drivers/base/power/container.c */
+extern struct bus_type container_subsys;
+
+struct container_dev {
+	struct device dev;
+	int (*offline)(struct container_dev *cdev);
+};
+
+static inline struct container_dev *to_container_dev(struct device *dev)
+{
+	return container_of(dev, struct container_dev, dev);
+}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
  2019-02-15 16:07 ` [RFC PATCH 01/27] containers: Rename linux/container.h to linux/container_dev.h David Howells
@ 2019-02-15 16:07 ` David Howells
  2019-02-17 18:57   ` Trond Myklebust
                     ` (7 more replies)
  2019-02-15 16:07 ` [RFC PATCH 03/27] containers: Provide /proc/containers David Howells
                   ` (27 subsequent siblings)
  29 siblings, 8 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Implement a kernel container object such that it contains the following
things:

 (1) Namespaces.

 (2) A root directory.

 (3) A set of processes, including one designated as the 'init' process.

A container is created and attached to a file descriptor by:

	int cfd = container_create(const char *name, unsigned int flags);

this inherits all the namespaces of the parent container unless otherwise
the mask calls for new namespaces.

	CONTAINER_NEW_FS_NS
	CONTAINER_NEW_EMPTY_FS_NS
	CONTAINER_NEW_CGROUP_NS [root only]
	CONTAINER_NEW_UTS_NS
	CONTAINER_NEW_IPC_NS
	CONTAINER_NEW_USER_NS
	CONTAINER_NEW_PID_NS
	CONTAINER_NEW_NET_NS

Other flags include:

	CONTAINER_KILL_ON_CLOSE
	CONTAINER_CLOSE_ON_EXEC

Note that I've added a pointer to the current container to task_struct.
This doesn't make the nsproxy pointer redundant as you can still make new
namespaces with clone().

I've also added a list_head to task_struct to form a list in the container
of its member processes.  This is convenient, but redundant since the code
could iterate over all the tasks looking for ones that have a matching
task->container.

It might make sense to use fsconfig() to configure the container:

	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);


==================
FUTURE DEVELOPMENT
==================

 (1) Setting up the container.

     A container would be created with, say:

	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);

     Once created, it should then be possible for the supervising process
     to modify the new container.  Mounts can be created inside of the
     container's namespaces:

	fsfd = fsopen("ext4", 0);
	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fsfd, 0, 0);

     and then mounted into the namespace:

	move_mount(mfd, "", cfd, "/",
		   MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT);

     Further mounts can be added by:

	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);

     Files and devices can be created by supplying the container fd as the
     dirfd argument:

	mkdirat(int cfd, const char *path, mode_t mode);
	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
	int fd = openat(int cfd, const char *path,
			unsigned int flags, mode_t mode);

     [*] Note that when using cfd as dirfd, the path must not contain a '/'
     	 at the front.

     Sockets, such as netlink, can be opened inside of the container's
     namespaces:

	int fd = container_socket(int cfd, int domain, int type,
				  int protocol);

     This should allow management of the container's network namespace from
     outside.

 (2) Starting the container.

     Once all modifications are complete, the container's 'init' process
     can be started by:

	fork_into_container(int cfd);

     This precludes further external modification of the mount tree within
     the container.  Before this point, the container is simply destroyed
     if the container fd is closed.

 (3) Waiting for the container to complete.

     The container fd can then be polled to wait for init process therein
     to complete and the exit code collected by:

	container_wait(int container_fd, int *_wstatus, unsigned int wait,
		       struct rusage *rusage);

     The container and everything in it can be terminated or killed off:

	container_kill(int container_fd, int initonly, int signal);

     If 'init' dies, all other processes in the container are preemptively
     SIGKILL'd by the kernel.

     By default, if the container is active and its fd is closed, the
     container is left running and wil be cleaned up when its 'init' exits.
     The default can be changed with the CONTAINER_KILL_ON_CLOSE flag.

 (4) Supervising the container.

     Given that we have an fd attached to the container, we could make it
     such that the supervising process could monitor and override EPERM
     returns for mount and other privileged operations within the
     container.

 (5) Per-container keyring.

     Each container can point to a per-container keyring for the holding of
     integrity keys and filesystem keys for use inside the container.  This
     would be attached:

	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)

     This keyring would be searched by request_key() after it has searched
     the thread, process and session keyrings.

 (6) Running different LSM policies by container.  This might particularly
     make sense with something like Apparmor where different path-based
     rules might be required inside a container to inside the parent.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |    5 
 include/linux/container.h              |   86 ++++++++
 include/linux/init_task.h              |    1 
 include/linux/lsm_hooks.h              |   20 ++
 include/linux/sched.h                  |    3 
 include/linux/security.h               |   15 +
 include/linux/syscalls.h               |    3 
 include/uapi/linux/container.h         |   28 +++
 init/Kconfig                           |    7 +
 init/init_task.c                       |    3 
 kernel/Makefile                        |    2 
 kernel/container.c                     |  348 ++++++++++++++++++++++++++++++++
 kernel/exit.c                          |    1 
 kernel/fork.c                          |    7 +
 kernel/namespaces.h                    |   15 +
 kernel/nsproxy.c                       |   23 +-
 kernel/sys_ni.c                        |    3 
 security/security.c                    |   12 +
 20 files changed, 571 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/container.h
 create mode 100644 include/uapi/linux/container.h
 create mode 100644 kernel/container.c
 create mode 100644 kernel/namespaces.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c9db9d51a7df..3564814a5d21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -407,3 +407,4 @@
 393	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
 394	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
 395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
+396	i386	container_create	sys_container_create		__ia32_sys_container_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 17869bf7788a..aa6cccbe5271 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -352,6 +352,7 @@
 341	common	fsinfo			__x64_sys_fsinfo
 342	common	mount_notify		__x64_sys_mount_notify
 343	common	sb_notify		__x64_sys_sb_notify
+344	common	container_create	__x64_sys_container_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index f378cfc63043..ea005f55ec4c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -30,6 +30,7 @@
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/fsinfo.h>
+#include <linux/container.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -3742,6 +3743,10 @@ static void __init init_mount_tree(void)
 
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);
+#ifdef CONFIG_CONTAINERS
+	path_get(&root);
+	init_container.root = root;
+#endif
 }
 
 void __init mnt_init(void)
diff --git a/include/linux/container.h b/include/linux/container.h
new file mode 100644
index 000000000000..0a8918435097
--- /dev/null
+++ b/include/linux/container.h
@@ -0,0 +1,86 @@
+/* Container objects
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CONTAINER_H
+#define _LINUX_CONTAINER_H
+
+#include <uapi/linux/container.h>
+#include <linux/refcount.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/path.h>
+#include <linux/seqlock.h>
+
+struct fs_struct;
+struct nsproxy;
+struct task_struct;
+
+/*
+ * The container object.
+ */
+struct container {
+	char			name[24];
+	u64			id;		/* Container ID */
+	refcount_t		usage;
+	int			exit_code;	/* The exit code of 'init' */
+	const struct cred	*cred;		/* Creds for this container, including userns */
+	struct nsproxy		*ns;		/* This container's namespaces */
+	struct path		root;		/* The root of the container's fs namespace */
+	struct task_struct	*init;		/* The 'init' task for this container */
+	struct container	*parent;	/* Parent of this container. */
+	void			*security;	/* LSM data */
+	struct list_head	members;	/* Member processes, guarded with ->lock */
+	struct list_head	child_link;	/* Link in parent->children */
+	struct list_head	children;	/* Child containers */
+	wait_queue_head_t	waitq;		/* Someone waiting for init to exit waits here */
+	unsigned long		flags;
+#define CONTAINER_FLAG_INIT_STARTED	0	/* Init is started - certain ops now prohibited */
+#define CONTAINER_FLAG_DEAD		1	/* Init has died */
+#define CONTAINER_FLAG_KILL_ON_CLOSE	2	/* Kill init if container handle closed */
+	spinlock_t		lock;
+	seqcount_t		seq;		/* Track changes in ->root */
+};
+
+extern struct container init_container;
+
+#ifdef CONFIG_CONTAINERS
+extern const struct file_operations container_fops;
+
+extern int copy_container(unsigned long flags, struct task_struct *tsk,
+			  struct container *container);
+extern void exit_container(struct task_struct *tsk);
+extern void put_container(struct container *c);
+
+static inline struct container *get_container(struct container *c)
+{
+	refcount_inc(&c->usage);
+	return c;
+}
+
+static inline bool is_container_file(struct file *file)
+{
+	return file->f_op == &container_fops;
+}
+
+#else
+
+static inline int copy_container(unsigned long flags, struct task_struct *tsk,
+				 struct container *container)
+{ return 0; }
+static inline void exit_container(struct task_struct *tsk) { }
+static inline void put_container(struct container *c) {}
+static inline struct container *get_container(struct container *c) { return NULL; }
+static inline bool is_container_file(struct file *file) { return false; }
+
+#endif /* CONFIG_CONTAINERS */
+
+#endif /* _LINUX_CONTAINER_H */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a7083a45a26c..f016cadece24 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/ipc.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/container.h>
 #include <linux/securebits.h>
 #include <linux/seqlock.h>
 #include <linux/rbtree.h>
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 52d0f3f4c786..0f310d911815 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1460,6 +1460,16 @@
  * @bpf_prog_free_security:
  *	Clean up the security information stored inside bpf prog.
  *
+ * Security hooks for containers:
+ *
+ * @container_alloc:
+ *	Permit creation of a new container and assign security data.
+ *	@container: The new container.
+ *
+ * @container_free:
+ *	Free security data attached to a container.
+ *	@container: The container.
+ *
  */
 union security_list_options {
 	int (*binder_set_context_mgr)(struct task_struct *mgr);
@@ -1825,6 +1835,12 @@ union security_list_options {
 	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
 	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
 #endif /* CONFIG_BPF_SYSCALL */
+
+	/* Container management security hooks */
+#ifdef CONFIG_CONTAINERS
+	int (*container_alloc)(struct container *container, unsigned int flags);
+	void (*container_free)(struct container *container);
+#endif
 };
 
 struct security_hook_heads {
@@ -2069,6 +2085,10 @@ struct security_hook_heads {
 	struct hlist_head bpf_prog_alloc_security;
 	struct hlist_head bpf_prog_free_security;
 #endif /* CONFIG_BPF_SYSCALL */
+#ifdef CONFIG_CONTAINERS
+	struct hlist_head container_alloc;
+	struct hlist_head container_free;
+#endif /* CONFIG_CONTAINERS */
 } __randomize_layout;
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2f90fa92468..073a3a930514 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@ struct backing_dev_info;
 struct bio_list;
 struct blk_plug;
 struct cfs_rq;
+struct container;
 struct fs_struct;
 struct futex_pi_state;
 struct io_context;
@@ -870,6 +871,8 @@ struct task_struct {
 
 	/* Namespaces: */
 	struct nsproxy			*nsproxy;
+	struct container		*container;
+	struct list_head		container_link;
 
 	/* Signal handlers: */
 	struct signal_struct		*signal;
diff --git a/include/linux/security.h b/include/linux/security.h
index da538c06766f..acd0c14c6e95 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -70,6 +70,7 @@ struct ctl_table;
 struct audit_krule;
 struct user_namespace;
 struct timezone;
+struct container;
 
 enum lsm_event {
 	LSM_POLICY_CHANGE,
@@ -1751,6 +1752,20 @@ static inline void security_audit_rule_free(void *lsmrule)
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_AUDIT */
 
+#ifdef CONFIG_CONTAINERS
+#ifdef CONFIG_SECURITY
+int security_container_alloc(struct container *container, unsigned int flags);
+void security_container_free(struct container *container);
+#else
+static inline int security_container_alloc(struct container *container,
+					   unsigned int flags)
+{
+	return 0;
+}
+static inline void security_container_free(struct container *container) {}
+#endif
+#endif /* CONFIG_CONTAINERS */
+
 #ifdef CONFIG_SECURITYFS
 
 extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 10127b1d923b..dac42098c2dd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -943,6 +943,9 @@ asmlinkage long sys_mount_notify(int dfd, const char __user *path,
 				 unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_sb_notify(int dfd, const char __user *path,
 			      unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
+				     unsigned long spare3, unsigned long spare4,
+				     unsigned long spare5);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h
new file mode 100644
index 000000000000..43748099b28d
--- /dev/null
+++ b/include/uapi/linux/container.h
@@ -0,0 +1,28 @@
+/* Container UAPI
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_CONTAINER_H
+#define _UAPI_LINUX_CONTAINER_H
+
+
+#define CONTAINER_NEW_FS_NS		0x00000001 /* Dup current fs namespace */
+#define CONTAINER_NEW_EMPTY_FS_NS	0x00000002 /* Provide new empty fs namespace */
+#define CONTAINER_NEW_CGROUP_NS		0x00000004 /* Dup current cgroup namespace */
+#define CONTAINER_NEW_UTS_NS		0x00000008 /* Dup current uts namespace */
+#define CONTAINER_NEW_IPC_NS		0x00000010 /* Dup current ipc namespace */
+#define CONTAINER_NEW_USER_NS		0x00000020 /* Dup current user namespace */
+#define CONTAINER_NEW_PID_NS		0x00000040 /* Dup current pid namespace */
+#define CONTAINER_NEW_NET_NS		0x00000080 /* Dup current net namespace */
+#define CONTAINER_KILL_ON_CLOSE		0x00000100 /* Kill all member processes when fd closed */
+#define CONTAINER_FD_CLOEXEC		0x00000200 /* Close the fd on exec */
+#define CONTAINER__FLAG_MASK		0x000003ff
+
+#endif /* _UAPI_LINUX_CONTAINER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 5984dd7f2156..ab37c3a55aa1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -992,6 +992,13 @@ config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+config CONTAINERS
+	bool "Container support"
+	default y
+	help
+	  Allow userspace to create and manipulate containers as objects that
+	  have namespaces and hold a set of processes.
+
 endif # NAMESPACES
 
 config CHECKPOINT_RESTORE
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..90c7439a195b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -108,6 +108,9 @@ struct task_struct init_task
 	.signal		= &init_signals,
 	.sighand	= &init_sighand,
 	.nsproxy	= &init_nsproxy,
+	.container	= &init_container,
+	.container_link.next = &init_container.members,
+	.container_link.prev = &init_container.members,
 	.pending	= {
 		.list = LIST_HEAD_INIT(init_task.pending.list),
 		.signal = {{0}}
diff --git a/kernel/Makefile b/kernel/Makefile
index 6aa7543bcdb2..98cdd18cecef 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
 	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
 	    extable.o params.o \
-	    kthread.o sys_ni.o nsproxy.o \
+	    kthread.o sys_ni.o nsproxy.o container.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
 	    async.o range.o smpboot.o ucount.o
 
diff --git a/kernel/container.c b/kernel/container.c
new file mode 100644
index 000000000000..ca4012632cfa
--- /dev/null
+++ b/kernel/container.c
@@ -0,0 +1,348 @@
+/* Implement container objects.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/init_task.h>
+#include <linux/fs.h>
+#include <linux/fs_struct.h>
+#include <linux/anon_inodes.h>
+#include <linux/container.h>
+#include <linux/syscalls.h>
+#include <linux/printk.h>
+#include <linux/security.h>
+#include "namespaces.h"
+
+struct container init_container = {
+	.name		= ".init",
+	.id		= 1,
+	.usage		= REFCOUNT_INIT(2),
+	.cred		= &init_cred,
+	.ns		= &init_nsproxy,
+	.init		= &init_task,
+	.members.next	= &init_task.container_link,
+	.members.prev	= &init_task.container_link,
+	.children	= LIST_HEAD_INIT(init_container.children),
+	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
+	.lock		= __SPIN_LOCK_UNLOCKED(init_container.lock),
+	.seq		= SEQCNT_ZERO(init_fs.seq),
+};
+
+#ifdef CONFIG_CONTAINERS
+
+static atomic64_t container_id_counter = ATOMIC_INIT(1);
+
+/*
+ * Drop a ref on a container and clear it if no longer in use.
+ */
+void put_container(struct container *c)
+{
+	struct container *parent;
+
+	while (c && refcount_dec_and_test(&c->usage)) {
+		BUG_ON(!list_empty(&c->members));
+		if (c->ns)
+			put_nsproxy(c->ns);
+		path_put(&c->root);
+
+		parent = c->parent;
+		if (parent) {
+			spin_lock(&parent->lock);
+			list_del(&c->child_link);
+			spin_unlock(&parent->lock);
+		}
+
+		if (c->cred)
+			put_cred(c->cred);
+		security_container_free(c);
+		kfree(c);
+		c = parent;
+	}
+}
+
+/*
+ * Allow the user to poll for the container dying.
+ */
+static unsigned int container_poll(struct file *file, poll_table *wait)
+{
+	struct container *container = file->private_data;
+	unsigned int mask = 0;
+
+	poll_wait(file, &container->waitq, wait);
+
+	if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
+		mask |= POLLHUP;
+
+	return mask;
+}
+
+static int container_release(struct inode *inode, struct file *file)
+{
+	struct container *container = file->private_data;
+
+	put_container(container);
+	return 0;
+}
+
+const struct file_operations container_fops = {
+	.poll		= container_poll,
+	.release	= container_release,
+};
+
+/*
+ * Handle fork/clone.
+ *
+ * A process inherits its parent's container.  The first process into the
+ * container is its 'init' process and the life of everything else in there is
+ * dependent upon that.
+ */
+int copy_container(unsigned long flags, struct task_struct *tsk,
+		   struct container *container)
+{
+	struct container *c = container ?: tsk->container;
+	int ret = -ECANCELED;
+
+	spin_lock(&c->lock);
+
+	if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
+		list_add_tail(&tsk->container_link, &c->members);
+		get_container(c);
+		tsk->container = c;
+		if (!c->init) {
+			set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags);
+			c->init = tsk;
+		}
+		ret = 0;
+	}
+
+	spin_unlock(&c->lock);
+	return ret;
+}
+
+/*
+ * Remove a dead process from a container.
+ *
+ * If the 'init' process in a container dies, we kill off all the other
+ * processes in the container.
+ */
+void exit_container(struct task_struct *tsk)
+{
+	struct task_struct *p;
+	struct container *c = tsk->container;
+	struct kernel_siginfo si = {
+		.si_signo = SIGKILL,
+		.si_code  = SI_KERNEL,
+	};
+
+	spin_lock(&c->lock);
+
+	list_del(&tsk->container_link);
+
+	if (c->init == tsk) {
+		c->init = NULL;
+		c->exit_code = tsk->exit_code;
+		smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
+		set_bit(CONTAINER_FLAG_DEAD, &c->flags);
+		wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
+
+		list_for_each_entry(p, &c->members, container_link) {
+			si.si_pid = task_tgid_vnr(p);
+			send_sig_info(SIGKILL, &si, p);
+		}
+	}
+
+	spin_unlock(&c->lock);
+	put_container(c);
+}
+
+/*
+ * Allocate a container.
+ */
+static struct container *alloc_container(const char __user *name)
+{
+	struct container *c;
+	long len;
+	int ret;
+
+	c = kzalloc(sizeof(struct container), GFP_KERNEL);
+	if (!c)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&c->members);
+	INIT_LIST_HEAD(&c->children);
+	init_waitqueue_head(&c->waitq);
+	spin_lock_init(&c->lock);
+	refcount_set(&c->usage, 1);
+
+	ret = -EFAULT;
+	len = strncpy_from_user(c->name, name, sizeof(c->name));
+	if (len < 0)
+		goto err;
+	ret = -ENAMETOOLONG;
+	if (len >= sizeof(c->name))
+		goto err;
+	ret = -EINVAL;
+	if (strchr(c->name, '/'))
+		goto err;
+
+	c->name[len] = 0;
+	return c;
+
+err:
+	kfree(c);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create some creds for the container.  We don't want to pin things we don't
+ * have to, so drop all keyrings from the new cred.  The LSM gets to audit the
+ * cred struct when security_container_alloc() is invoked.
+ */
+static const struct cred *create_container_creds(unsigned int flags)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+#ifdef CONFIG_KEYS
+	key_put(new->thread_keyring);
+	new->thread_keyring = NULL;
+	key_put(new->process_keyring);
+	new->process_keyring = NULL;
+	key_put(new->session_keyring);
+	new->session_keyring = NULL;
+	key_put(new->request_key_auth);
+	new->request_key_auth = NULL;
+#endif
+
+	if (flags & CONTAINER_NEW_USER_NS) {
+		ret = create_user_ns(new);
+		if (ret < 0)
+			goto err;
+		new->euid = new->user_ns->owner;
+		new->egid = new->user_ns->group;
+	}
+
+	new->fsuid = new->suid = new->uid = new->euid;
+	new->fsgid = new->sgid = new->gid = new->egid;
+	return new;
+
+err:
+	abort_creds(new);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container.
+ */
+static struct container *create_container(const char __user *name, unsigned int flags)
+{
+	struct container *parent, *c;
+	struct fs_struct *fs;
+	struct nsproxy *ns;
+	const struct cred *cred;
+	int ret;
+
+	c = alloc_container(name);
+	if (IS_ERR(c))
+		return c;
+
+	if (flags & CONTAINER_KILL_ON_CLOSE)
+		__set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
+
+	cred = create_container_creds(flags);
+	if (IS_ERR(cred)) {
+		ret = PTR_ERR(cred);
+		goto err_cont;
+	}
+	c->cred = cred;
+
+	ret = -ENOMEM;
+	fs = copy_fs_struct(current->fs);
+	if (!fs)
+		goto err_cont;
+
+	ns = create_new_namespaces(
+		(flags & CONTAINER_NEW_FS_NS	 ? CLONE_NEWNS : 0) |
+		(flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) |
+		(flags & CONTAINER_NEW_UTS_NS	 ? CLONE_NEWUTS : 0) |
+		(flags & CONTAINER_NEW_IPC_NS	 ? CLONE_NEWIPC : 0) |
+		(flags & CONTAINER_NEW_PID_NS	 ? CLONE_NEWPID : 0) |
+		(flags & CONTAINER_NEW_NET_NS	 ? CLONE_NEWNET : 0),
+		current->nsproxy, cred->user_ns, fs);
+	if (IS_ERR(ns)) {
+		ret = PTR_ERR(ns);
+		goto err_fs;
+	}
+
+	c->ns = ns;
+	c->root = fs->root;
+	c->seq = fs->seq;
+	fs->root.mnt = NULL;
+	fs->root.dentry = NULL;
+
+	ret = security_container_alloc(c, flags);
+	if (ret < 0)
+		goto err_fs;
+
+	parent = current->container;
+	get_container(parent);
+	c->parent = parent;
+	c->id = atomic64_inc_return(&container_id_counter);
+	spin_lock(&parent->lock);
+	list_add_tail(&c->child_link, &parent->children);
+	spin_unlock(&parent->lock);
+	return c;
+
+err_fs:
+	free_fs_struct(fs);
+err_cont:
+	put_container(c);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container object.
+ */
+SYSCALL_DEFINE5(container_create,
+		const char __user *, name,
+		unsigned int, flags,
+		unsigned long, spare3,
+		unsigned long, spare4,
+		unsigned long, spare5)
+{
+	struct container *c;
+	int fd;
+
+	if (!name ||
+	    flags & ~CONTAINER__FLAG_MASK ||
+	    spare3 != 0 || spare4 != 0 || spare5 != 0)
+		return -EINVAL;
+	if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) ==
+	    (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
+		return -EINVAL;
+
+	c = create_container(name, flags);
+	if (IS_ERR(c))
+		return PTR_ERR(c);
+
+	fd = anon_inode_getfd("container", &container_fops, c,
+			      O_RDWR | (flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0));
+	if (fd < 0)
+		put_container(c);
+	return fd;
+}
+
+#endif /* CONFIG_CONTAINERS */
diff --git a/kernel/exit.c b/kernel/exit.c
index 284f2fe9a293..78f6065ad799 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -864,6 +864,7 @@ void __noreturn do_exit(long code)
 	if (group_dead)
 		disassociate_ctty(1);
 	exit_task_namespaces(tsk);
+	exit_container(tsk);
 	exit_task_work(tsk);
 	exit_thread(tsk);
 	exit_umh(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..009cf7e63894 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1920,9 +1920,12 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_namespaces(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_mm;
-	retval = copy_io(clone_flags, p);
+	retval = copy_container(clone_flags, p, NULL);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
+	retval = copy_io(clone_flags, p);
+	if (retval)
+		goto bad_fork_cleanup_container;
 	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
@@ -2121,6 +2124,8 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_io:
 	if (p->io_context)
 		exit_io_context(p);
+bad_fork_cleanup_container:
+	exit_container(p);
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
diff --git a/kernel/namespaces.h b/kernel/namespaces.h
new file mode 100644
index 000000000000..c44e3cf0e254
--- /dev/null
+++ b/kernel/namespaces.h
@@ -0,0 +1,15 @@
+/* Local namespaces defs
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+extern struct nsproxy *create_new_namespaces(unsigned long flags,
+					     struct nsproxy *nsproxy,
+					     struct user_namespace *user_ns,
+					     struct fs_struct *new_fs);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d330059a..4bb5184b3a80 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -27,6 +27,7 @@
 #include <linux/syscalls.h>
 #include <linux/cgroup.h>
 #include <linux/perf_event.h>
+#include "namespaces.h"
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
  * Return the newly created nsproxy.  Do not attach this to the task,
  * leave it to the caller to do proper locking and attach it to task.
  */
-static struct nsproxy *create_new_namespaces(unsigned long flags,
-	struct task_struct *tsk, struct user_namespace *user_ns,
+struct nsproxy *create_new_namespaces(unsigned long flags,
+	struct nsproxy *nsproxy, struct user_namespace *user_ns,
 	struct fs_struct *new_fs)
 {
 	struct nsproxy *new_nsp;
@@ -72,39 +73,39 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	if (!new_nsp)
 		return ERR_PTR(-ENOMEM);
 
-	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
+	new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs);
 	if (IS_ERR(new_nsp->mnt_ns)) {
 		err = PTR_ERR(new_nsp->mnt_ns);
 		goto out_ns;
 	}
 
-	new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
+	new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns);
 	if (IS_ERR(new_nsp->uts_ns)) {
 		err = PTR_ERR(new_nsp->uts_ns);
 		goto out_uts;
 	}
 
-	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
+	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns);
 	if (IS_ERR(new_nsp->ipc_ns)) {
 		err = PTR_ERR(new_nsp->ipc_ns);
 		goto out_ipc;
 	}
 
 	new_nsp->pid_ns_for_children =
-		copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
+		copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children);
 	if (IS_ERR(new_nsp->pid_ns_for_children)) {
 		err = PTR_ERR(new_nsp->pid_ns_for_children);
 		goto out_pid;
 	}
 
 	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
-					    tsk->nsproxy->cgroup_ns);
+					    nsproxy->cgroup_ns);
 	if (IS_ERR(new_nsp->cgroup_ns)) {
 		err = PTR_ERR(new_nsp->cgroup_ns);
 		goto out_cgroup;
 	}
 
-	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
+	new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
@@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
 		return -EINVAL;
 
-	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
+	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
@@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
-	*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
+	*new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns,
 					 new_fs ? new_fs : current->fs);
 	if (IS_ERR(*new_nsp)) {
 		err = PTR_ERR(*new_nsp);
@@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	if (nstype && (ns->ops->type != nstype))
 		goto out;
 
-	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+	new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs);
 	if (IS_ERR(new_nsproxy)) {
 		err = PTR_ERR(new_nsproxy);
 		goto out;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a4e7131b2509..f0455cbb91cf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -136,6 +136,9 @@ COND_SYSCALL(acct);
 COND_SYSCALL(capget);
 COND_SYSCALL(capset);
 
+/* kernel/container.c */
+COND_SYSCALL(container_create);
+
 /* kernel/exec_domain.c */
 
 /* kernel/exit.c */
diff --git a/security/security.c b/security/security.c
index b49732c02e21..259be9a1746c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1864,3 +1864,15 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux)
 	call_void_hook(bpf_prog_free_security, aux);
 }
 #endif /* CONFIG_BPF_SYSCALL */
+
+#ifdef CONFIG_CONTAINERS
+int security_container_alloc(struct container *container, unsigned int flags)
+{
+	return call_int_hook(container_alloc, 0, container, flags);
+}
+
+void security_container_free(struct container *container)
+{
+	call_void_hook(container_free, container);
+}
+#endif /* CONFIG_CONTAINERS */


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 03/27] containers: Provide /proc/containers
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
  2019-02-15 16:07 ` [RFC PATCH 01/27] containers: Rename linux/container.h to linux/container_dev.h David Howells
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
@ 2019-02-15 16:07 ` David Howells
  2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide /proc/containers to view the current container and all the
containers created within it:

	# ./foo-container
	NAME                     USE FL OWNER GROUP
	<current>                141 01 0     0
	foo-test                   1 04 0     0

I'm not sure whether this is really desirable, though.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 kernel/container.c |  110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

diff --git a/kernel/container.c b/kernel/container.c
index ca4012632cfa..1d2cb1c1e9b1 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -20,6 +20,7 @@
 #include <linux/syscalls.h>
 #include <linux/printk.h>
 #include <linux/security.h>
+#include <linux/proc_fs.h>
 #include "namespaces.h"
 
 struct container init_container = {
@@ -69,6 +70,108 @@ void put_container(struct container *c)
 	}
 }
 
+static void *container_proc_start(struct seq_file *m, loff_t *_pos)
+{
+	struct container *c = m->private;
+	struct list_head *p;
+	loff_t pos = *_pos;
+
+	spin_lock(&c->lock);
+
+	if (pos <= 1) {
+		*_pos = 1;
+		return (void *)1UL; /* Banner on first line */
+	}
+
+	if (pos == 2)
+		return m->private; /* Current container on second line */
+
+	/* Subordinate containers thereafter */
+	p = c->children.next;
+	pos--;
+	for (pos--; pos > 0 && p != &c->children; pos--) {
+		p = p->next;
+	}
+
+	if (p == &c->children)
+		return NULL;
+	return container_of(p, struct container, child_link);
+}
+
+static void *container_proc_next(struct seq_file *m, void *v, loff_t *_pos)
+{
+	struct container *c = m->private, *vc = v;
+	struct list_head *p;
+	loff_t pos = *_pos;
+
+	pos++;
+	*_pos = pos;
+	if (pos == 2)
+		return c; /* Current container on second line */
+
+	if (pos == 3)
+		p = &c->children;
+	else
+		p = &vc->child_link;
+	p = p->next;
+	if (p == &c->children)
+		return NULL;
+	return container_of(p, struct container, child_link);
+}
+
+static void container_proc_stop(struct seq_file *m, void *v)
+{
+	struct container *c = m->private;
+
+	spin_unlock(&c->lock);
+}
+
+static int container_proc_show(struct seq_file *m, void *v)
+{
+	struct user_namespace *uns = current_user_ns();
+	struct container *c = v;
+	const char *name;
+
+	if (v == (void *)1UL) {
+		seq_puts(m, "NAME                               ID USE FL OWNER GROUP\n");
+		return 0;
+	}
+
+	name = (c == m->private) ? "<current>" : c->name;
+	seq_printf(m, "%-24s %12llu %3u %02lx %5d %5d\n",
+		   name, c->id, refcount_read(&c->usage), c->flags,
+		   from_kuid_munged(uns, c->cred->uid),
+		   from_kgid_munged(uns, c->cred->gid));
+
+	return 0;
+}
+
+static const struct seq_operations container_proc_ops = {
+	.start	= container_proc_start,
+	.next	= container_proc_next,
+	.stop	= container_proc_stop,
+	.show	= container_proc_show,
+};
+
+static int container_proc_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *m;
+	int ret = seq_open(file, &container_proc_ops);
+
+	if (ret == 0) {
+		m = file->private_data;
+		m->private = current->container;
+	}
+	return ret;
+}
+
+static const struct file_operations container_proc_fops = {
+	.open		= container_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 /*
  * Allow the user to poll for the container dying.
  */
@@ -345,4 +448,11 @@ SYSCALL_DEFINE5(container_create,
 	return fd;
 }
 
+static int __init init_container_fs(void)
+{
+	proc_create("containers", 0, NULL, &container_proc_fops);
+	return 0;
+}
+fs_initcall(init_container_fs);
+
 #endif /* CONFIG_CONTAINERS */


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 04/27] containers: Allow a process to be forked into a container
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (2 preceding siblings ...)
  2019-02-15 16:07 ` [RFC PATCH 03/27] containers: Provide /proc/containers David Howells
@ 2019-02-15 16:07 ` David Howells
  2019-02-15 17:39   ` Stephen Smalley
                     ` (2 more replies)
  2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside " David Howells
                   ` (25 subsequent siblings)
  29 siblings, 3 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Allow a single process to be forked directly into a container using a new
syscall, thereby 'booting' the container:

	pid_t pid = fork_into_container(int container_fd);

This process will be the 'init' process of the container.

Further attempts to fork into the container will be rejected.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 arch/x86/ia32/sys_ia32.c               |    2 -
 include/linux/cred.h                   |    3 +
 include/linux/nsproxy.h                |    7 ++
 include/linux/sched/task.h             |    3 +
 include/linux/syscalls.h               |    1 
 kernel/cred.c                          |   45 +++++++++++++
 kernel/fork.c                          |  110 ++++++++++++++++++++++++++------
 kernel/nsproxy.c                       |   11 +++
 kernel/sys_ni.c                        |    1 
 11 files changed, 157 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3564814a5d21..8666693510f9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -408,3 +408,4 @@
 394	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
 395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
 396	i386	container_create	sys_container_create		__ia32_sys_container_create
+397	i386	fork_into_container	sys_fork_into_container		__ia32_sys_fork_into_container
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index aa6cccbe5271..d40d4790fcb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -353,6 +353,7 @@
 342	common	mount_notify		__x64_sys_mount_notify
 343	common	sb_notify		__x64_sys_sb_notify
 344	common	container_create	__x64_sys_container_create
+345	common	fork_into_container	__x64_sys_fork_into_container
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..080d9e21b697 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -238,5 +238,5 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
 		       unsigned long, tls_val, int __user *, child_tidptr)
 {
 	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
-			tls_val);
+			tls_val, NULL);
 }
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4907c9df86b3..357e743d5d4a 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -23,6 +23,7 @@
 
 struct cred;
 struct inode;
+struct container;
 
 /*
  * COW Supplementary groups list
@@ -155,7 +156,7 @@ struct cred {
 
 extern void __put_cred(struct cred *);
 extern void exit_creds(struct task_struct *);
-extern int copy_creds(struct task_struct *, unsigned long);
+extern int copy_creds(struct task_struct *, unsigned long, struct container *);
 extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..81838ae24a92 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -11,6 +11,7 @@ struct ipc_namespace;
 struct pid_namespace;
 struct cgroup_namespace;
 struct fs_struct;
+struct container;
 
 /*
  * A structure to contain pointers to all per-process
@@ -63,9 +64,13 @@ extern struct nsproxy init_nsproxy;
  *         * /
  *     task_unlock(task);
  *
+ *  4. Container namespaces are set at container creation and cannot be
+ *     changed.
+ *
  */
 
-int copy_namespaces(unsigned long flags, struct task_struct *tsk);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+		    struct container *dest_container);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
 void free_nsproxy(struct nsproxy *ns);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 44c6f15800ff..bdff71b0fb66 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -73,7 +73,8 @@ extern void do_group_exit(int);
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
+		     int __user *, unsigned long, struct container *);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index dac42098c2dd..15e5cc704df3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -946,6 +946,7 @@ asmlinkage long sys_sb_notify(int dfd, const char __user *path,
 asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
 				     unsigned long spare3, unsigned long spare4,
 				     unsigned long spare5);
+asmlinkage long sys_fork_into_container(int containerfd);
 
 /*
  * Architecture-specific system calls
diff --git a/kernel/cred.c b/kernel/cred.c
index 21f4a97085b4..f0ee5cec533d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -313,6 +313,43 @@ struct cred *prepare_exec_creds(void)
 	return new;
 }
 
+/*
+ * Handle forking a process into a container.
+ */
+static struct cred *copy_container_creds(struct container *dest_container)
+{
+	struct cred *new;
+
+	validate_process_creds();
+
+	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
+	if (!new)
+		return NULL;
+
+	kdebug("prepare_creds() alloc %p", new);
+
+	memcpy(new, dest_container->cred, sizeof(struct cred));
+
+	atomic_set(&new->usage, 1);
+	set_cred_subscribers(new, 0);
+	get_group_info(new->group_info);
+	get_uid(new->user);
+	get_user_ns(new->user_ns);
+
+#ifdef CONFIG_SECURITY
+	new->security = NULL;
+#endif
+
+	if (security_prepare_creds(new, dest_container->cred, GFP_KERNEL) < 0)
+		goto error;
+	validate_creds(new);
+	return new;
+
+error:
+	abort_creds(new);
+	return NULL;
+}
+
 /*
  * Copy credentials for the new process created by fork()
  *
@@ -322,7 +359,8 @@ struct cred *prepare_exec_creds(void)
  * The new process gets the current process's subjective credentials as its
  * objective and subjective credentials
  */
-int copy_creds(struct task_struct *p, unsigned long clone_flags)
+int copy_creds(struct task_struct *p, unsigned long clone_flags,
+	       struct container *dest_container)
 {
 	struct cred *new;
 	int ret;
@@ -343,7 +381,10 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 		return 0;
 	}
 
-	new = prepare_creds();
+	if (dest_container)
+		new = copy_container_creds(dest_container);
+	else
+		new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 009cf7e63894..71401deb4434 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1385,9 +1385,33 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	return retval;
 }
 
-static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_fs(unsigned long clone_flags, struct task_struct *tsk,
+		   struct container *dest_container)
 {
 	struct fs_struct *fs = current->fs;
+
+#ifdef CONFIG_CONTAINERS
+	if (dest_container) {
+		fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);
+		if (!fs)
+			return -ENOMEM;
+
+		fs->users = 1;
+		fs->in_exec = 0;
+		spin_lock_init(&fs->lock);
+		seqcount_init(&fs->seq);
+		fs->umask = 0022;
+
+		spin_lock(&dest_container->lock);
+		fs->pwd = fs->root = dest_container->root;
+		path_get(&fs->root);
+		path_get(&fs->pwd);
+		spin_unlock(&dest_container->lock);
+		tsk->fs = fs;
+		return 0;
+	}
+#endif
+
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		spin_lock(&fs->lock);
@@ -1679,7 +1703,8 @@ static __latent_entropy struct task_struct *copy_process(
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
-					int node)
+					int node,
+					struct container *dest_container)
 {
 	int retval;
 	struct task_struct *p;
@@ -1783,7 +1808,7 @@ static __latent_entropy struct task_struct *copy_process(
 	}
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	retval = copy_creds(p, clone_flags);
+	retval = copy_creds(p, clone_flags, dest_container);
 	if (retval < 0)
 		goto bad_fork_free;
 
@@ -1905,7 +1930,7 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_files(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
-	retval = copy_fs(clone_flags, p);
+	retval = copy_fs(clone_flags, p, dest_container);
 	if (retval)
 		goto bad_fork_cleanup_files;
 	retval = copy_sighand(clone_flags, p);
@@ -1917,15 +1942,15 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_mm(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_signal;
-	retval = copy_namespaces(clone_flags, p);
+	retval = copy_container(clone_flags, p, dest_container);
 	if (retval)
 		goto bad_fork_cleanup_mm;
-	retval = copy_container(clone_flags, p, NULL);
+	retval = copy_namespaces(clone_flags, p, dest_container);
 	if (retval)
-		goto bad_fork_cleanup_namespaces;
+		goto bad_fork_cleanup_container;
 	retval = copy_io(clone_flags, p);
 	if (retval)
-		goto bad_fork_cleanup_container;
+		goto bad_fork_cleanup_namespaces;
 	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
@@ -2124,10 +2149,10 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_io:
 	if (p->io_context)
 		exit_io_context(p);
-bad_fork_cleanup_container:
-	exit_container(p);
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
+bad_fork_cleanup_container:
+	exit_container(p);
 bad_fork_cleanup_mm:
 	if (p->mm)
 		mmput(p->mm);
@@ -2183,7 +2208,7 @@ struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+			    cpu_to_node(cpu), NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2195,15 +2220,16 @@ struct task_struct *fork_idle(int cpu)
 /*
  *  Ok, this is the main fork-routine.
  *
- * It copies the process, and if successful kick-starts
- * it and waits for it to finish using the VM if required.
+ * It copies the process into the specified container, and if successful
+ * kick-starts it and waits for it to finish using the VM if required.
  */
 long _do_fork(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr,
-	      unsigned long tls)
+	      unsigned long tls,
+	      struct container *dest_container)
 {
 	struct completion vfork;
 	struct pid *pid;
@@ -2229,8 +2255,32 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
+	if (dest_container) {
+		/* A process spawned into a container doesn't share anything
+		 * with the parent other than namespaces.
+		 */
+		if (clone_flags & (CLONE_CHILD_CLEARTID |
+				   CLONE_CHILD_SETTID |
+				   CLONE_FILES |
+				   CLONE_FS |
+				   CLONE_IO |
+				   CLONE_PARENT |
+				   CLONE_PARENT_SETTID |
+				   CLONE_PTRACE |
+				   CLONE_SETTLS |
+				   CLONE_SIGHAND |
+				   CLONE_SYSVSEM |
+				   CLONE_THREAD))
+			return -EINVAL;
+
+		/* However, we do have to let kernel threads borrow a VM. */
+		if ((clone_flags & CLONE_VM) && current->mm)
+			return -EINVAL;
+	}
+	
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE,
+			 dest_container);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2279,7 +2329,7 @@ long do_fork(unsigned long clone_flags,
 	      int __user *child_tidptr)
 {
 	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+			parent_tidptr, child_tidptr, 0, NULL);
 }
 #endif
 
@@ -2289,14 +2339,14 @@ long do_fork(unsigned long clone_flags,
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
 	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+			(unsigned long)arg, NULL, NULL, 0, NULL);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, NULL);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -2308,7 +2358,26 @@ SYSCALL_DEFINE0(fork)
 SYSCALL_DEFINE0(vfork)
 {
 	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+			0, NULL, NULL, 0, NULL);
+}
+#endif
+
+#ifdef CONFIG_CONTAINERS
+SYSCALL_DEFINE1(fork_into_container, int, containerfd)
+{
+	struct fd f = fdget(containerfd);
+	int ret;
+
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (is_container_file(f.file)) {
+		struct container *dest_container = f.file->private_data;
+
+		ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, dest_container);
+	}
+	fdput(f);
+	return ret;
 }
 #endif
 
@@ -2336,7 +2405,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls,
+			NULL);
 }
 #endif
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4bb5184b3a80..4031075300a4 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -136,12 +136,19 @@ struct nsproxy *create_new_namespaces(unsigned long flags,
  * called from clone.  This now handles copy for nsproxy and all
  * namespaces therein.
  */
-int copy_namespaces(unsigned long flags, struct task_struct *tsk)
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+		    struct container *dest_container)
 {
 	struct nsproxy *old_ns = tsk->nsproxy;
 	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
 	struct nsproxy *new_ns;
 
+	if (dest_container) {
+		get_nsproxy(dest_container->ns);
+		tsk->nsproxy = dest_container->ns;
+		return 0;
+	}
+
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
 			      CLONE_NEWPID | CLONE_NEWNET |
 			      CLONE_NEWCGROUP)))) {
@@ -163,7 +170,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
 		return -EINVAL;
 
-	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
+	new_ns = create_new_namespaces(flags, old_ns, user_ns, tsk->fs);
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f0455cbb91cf..a23ad529d548 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -144,6 +144,7 @@ COND_SYSCALL(container_create);
 /* kernel/exit.c */
 
 /* kernel/fork.c */
+COND_SYSCALL(fork_into_container);
 
 /* kernel/futex.c */
 COND_SYSCALL(futex);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 05/27] containers: Open a socket inside a container
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (3 preceding siblings ...)
  2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
@ 2019-02-15 16:07 ` David Howells
  2019-02-19 16:41   ` Eric W. Biederman
  2019-02-15 16:08 ` [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd David Howells
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-15 16:07 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a system call to open a socket inside of a container, using that
container's network namespace.  This allows netlink to be used to manage
the container.

	fd = container_socket(int container_fd,
			      int domain, int type, int protocol);

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 include/linux/socket.h                 |    3 ++-
 include/linux/syscalls.h               |    2 ++
 kernel/sys_ni.c                        |    1 +
 net/compat.c                           |    2 +-
 net/socket.c                           |   34 +++++++++++++++++++++++++++-----
 7 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 8666693510f9..f4c9beff77a6 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -409,3 +409,4 @@
 395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
 396	i386	container_create	sys_container_create		__ia32_sys_container_create
 397	i386	fork_into_container	sys_fork_into_container		__ia32_sys_fork_into_container
+398	i386	container_socket	sys_container_socket		__ia32_sys_container_socket
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d40d4790fcb2..e20cdf7b5527 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -354,6 +354,7 @@
 343	common	sb_notify		__x64_sys_sb_notify
 344	common	container_create	__x64_sys_container_create
 345	common	fork_into_container	__x64_sys_fork_into_container
+346	common	container_socket	__x64_sys_container_socket
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ab2041a00e01..154ac900a8a5 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -10,6 +10,7 @@
 #include <linux/compiler.h>		/* __user			*/
 #include <uapi/linux/socket.h>
 
+struct net;
 struct pid;
 struct cred;
 
@@ -376,7 +377,7 @@ extern int __sys_sendto(int fd, void __user *buff, size_t len,
 			int addr_len);
 extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
 			 int __user *upeer_addrlen, int flags);
-extern int __sys_socket(int family, int type, int protocol);
+extern int __sys_socket(struct net *net, int family, int type, int protocol);
 extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen);
 extern int __sys_connect(int fd, struct sockaddr __user *uservaddr,
 			 int addrlen);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 15e5cc704df3..547334c6ffc2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -947,6 +947,8 @@ asmlinkage long sys_container_create(const char __user *name, unsigned int flags
 				     unsigned long spare3, unsigned long spare4,
 				     unsigned long spare5);
 asmlinkage long sys_fork_into_container(int containerfd);
+asmlinkage long sys_container_socket(int containerfd,
+				     int domain, int type, int protocol);
 
 /*
  * Architecture-specific system calls
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a23ad529d548..ce9c5bb30e7f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -236,6 +236,7 @@ COND_SYSCALL(shmdt);
 /* net/socket.c */
 COND_SYSCALL(socket);
 COND_SYSCALL(socketpair);
+COND_SYSCALL(container_socket);
 COND_SYSCALL(bind);
 COND_SYSCALL(listen);
 COND_SYSCALL(accept);
diff --git a/net/compat.c b/net/compat.c
index 959d1c51826d..1b2db740fd33 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -856,7 +856,7 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
 
 	switch (call) {
 	case SYS_SOCKET:
-		ret = __sys_socket(a0, a1, a[2]);
+		ret = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]);
 		break;
 	case SYS_BIND:
 		ret = __sys_bind(a0, compat_ptr(a1), a[2]);
diff --git a/net/socket.c b/net/socket.c
index 7d271a1d0c7e..7406580598b9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -80,6 +80,7 @@
 #include <linux/highmem.h>
 #include <linux/mount.h>
 #include <linux/fs_context.h>
+#include <linux/container.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/compat.h>
@@ -1326,9 +1327,9 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct
 }
 EXPORT_SYMBOL(sock_create_kern);
 
-int __sys_socket(int family, int type, int protocol)
+int __sys_socket(struct net *net, int family, int type, int protocol)
 {
-	int retval;
+	long retval;
 	struct socket *sock;
 	int flags;
 
@@ -1346,7 +1347,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
 		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 
-	retval = sock_create(family, type, protocol, &sock);
+	retval = __sock_create(net, family, type, protocol, &sock, 0);
 	if (retval < 0)
 		return retval;
 
@@ -1355,9 +1356,32 @@ int __sys_socket(int family, int type, int protocol)
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 {
-	return __sys_socket(family, type, protocol);
+	return __sys_socket(current->nsproxy->net_ns, family, type, protocol);
 }
 
+/*
+ * Create a socket inside a container.
+ */
+#ifdef CONFIG_CONTAINERS
+SYSCALL_DEFINE4(container_socket,
+		int, containerfd, int, family, int, type, int, protocol)
+{
+	struct fd f = fdget(containerfd);
+	long ret;
+
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (is_container_file(f.file)) {
+		struct container *c = f.file->private_data;
+
+		ret = __sys_socket(c->ns->net_ns, family, type, protocol);
+	}
+	fdput(f);
+	return ret;
+}
+#endif
+
 /*
  *	Create a pair of connected sockets.
  */
@@ -2555,7 +2579,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 
 	switch (call) {
 	case SYS_SOCKET:
-		err = __sys_socket(a0, a1, a[2]);
+		err = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]);
 		break;
 	case SYS_BIND:
 		err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (4 preceding siblings ...)
  2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside " David Howells
@ 2019-02-15 16:08 ` David Howells
  2019-02-19 16:45   ` Eric W. Biederman
  2019-02-19 23:24   ` David Howells
  2019-02-15 16:08 ` [RFC PATCH 07/27] containers: Make fsopen() able to create a superblock in a container David Howells
                   ` (23 subsequent siblings)
  29 siblings, 2 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:08 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Some filesystem system calls, such as mkdirat(), take a 'directory fd' to
specify the pathwalk origin.  This takes either AT_FDCWD or a file
descriptor that refers to an open directory.

Make it possible to supply a container fd, as obtained from
container_create(), instead thereby specifying the container's root as the
origin.  This performs the filesystem operation into the container's mount
namespace.  For example:

	int cfd = container_create("fred", CONTAINER_NEW_MNT_NS, 0);
	mkdirat(cfd, "/fred", 0755);

A better way to do this might be to temporarily override current->fs and
current->nsproxy, but this requires splitting those fields so that procfs
doesn't see the override.

A sequence number and lock are available to protect the root pointer in
case container_chroot() and/or container_pivot_root() are implemented.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namei.c |   45 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a85deb55d0c9..4932b5467285 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2232,20 +2232,43 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (!f.file)
 			return ERR_PTR(-EBADF);
 
-		dentry = f.file->f_path.dentry;
+		if (is_container_file(f.file)) {
+			struct container *c = f.file->private_data;
+			unsigned seq;
 
-		if (*s && unlikely(!d_can_lookup(dentry))) {
-			fdput(f);
-			return ERR_PTR(-ENOTDIR);
-		}
+			if (!*s)
+				return ERR_PTR(-EINVAL);
 
-		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
-			nd->inode = nd->path.dentry->d_inode;
-			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+			if (flags & LOOKUP_RCU) {
+				do {
+					seq = read_seqcount_begin(&c->seq);
+					nd->path = c->root;
+					nd->inode = nd->path.dentry->d_inode;
+					nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+				} while (read_seqcount_retry(&c->seq, seq));
+			} else {
+				spin_lock(&c->lock);
+				nd->path = c->root;
+				path_get(&nd->path);
+				spin_unlock(&c->lock);
+				nd->inode = nd->path.dentry->d_inode;
+			}
 		} else {
-			path_get(&nd->path);
-			nd->inode = nd->path.dentry->d_inode;
+			dentry = f.file->f_path.dentry;
+
+			if (*s && unlikely(!d_can_lookup(dentry))) {
+				fdput(f);
+				return ERR_PTR(-ENOTDIR);
+			}
+
+			nd->path = f.file->f_path;
+			if (flags & LOOKUP_RCU) {
+				nd->inode = nd->path.dentry->d_inode;
+				nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+			} else {
+				path_get(&nd->path);
+				nd->inode = nd->path.dentry->d_inode;
+			}
 		}
 		fdput(f);
 		return s;


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 07/27] containers: Make fsopen() able to create a superblock in a container
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (5 preceding siblings ...)
  2019-02-15 16:08 ` [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd David Howells
@ 2019-02-15 16:08 ` David Howells
  2019-02-15 16:08 ` [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS David Howells
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:08 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Make it possible for fsopen() to create a superblock in a specified
container, using the namespaces associated with that container to cover UID
translation, networking and filesystem content.  This involves adding a new
fsconfig command to specify the container.

For example:

	cfd = container_create("fred", CONTAINER_NEW_FS_NS);

	fsfd = fsopen("ext4", 0);
	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fsfd, 0, MOUNT_ATTR_RDONLY);
	move_mount(mfd, "", cfd, "/",
		   MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT);

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fs_context.c            |   19 +++++++++++++++
 fs/fsopen.c                |   54 +++++++++++++++++++++++++++++++++++++-------
 fs/namespace.c             |   19 +++++++++++----
 fs/proc/root.c             |   11 +++++++--
 include/linux/container.h  |    1 +
 include/linux/fs_context.h |    3 ++
 include/linux/pid.h        |    5 +++-
 include/linux/proc_ns.h    |    6 +++--
 include/uapi/linux/mount.h |    1 +
 kernel/container.c         |    4 +++
 kernel/fork.c              |    2 +-
 kernel/pid.c               |    4 ++-
 12 files changed, 108 insertions(+), 21 deletions(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index a47ccd5a4a78..fc76ac02d618 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/magic.h>
 #include <linux/security.h>
+#include <linux/container.h>
 #include <linux/mnt_namespace.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
@@ -169,6 +170,21 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_fs_param);
 
+/*
+ * Specify a container in which a superblock will exist.
+ */
+void vfs_set_container(struct fs_context *fc, struct container *container)
+{
+	if (container) {
+		put_user_ns(fc->user_ns);
+		put_net(fc->net_ns);
+
+		fc->container = get_container(container);
+		fc->user_ns = get_user_ns(container->cred->user_ns);
+		fc->net_ns = get_net(container->ns->net_ns);
+	}
+}
+
 /**
  * vfs_parse_fs_string - Convenience function to just parse a string.
  */
@@ -364,6 +380,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	fc->source	= NULL;
 	fc->security	= NULL;
 	get_filesystem(fc->fs_type);
+	if (fc->container)
+		get_container(fc->container);
 	get_net(fc->net_ns);
 	get_user_ns(fc->user_ns);
 	get_cred(fc->cred);
@@ -510,6 +528,7 @@ void put_fs_context(struct fs_context *fc)
 	put_net(fc->net_ns);
 	put_user_ns(fc->user_ns);
 	put_cred(fc->cred);
+	put_container(fc->container);
 	kfree(fc->subtype);
 	put_fc_log(fc);
 	put_filesystem(fc->fs_type);
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 3bb9c0c8cbcc..d0fe9e563ebb 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -17,11 +17,33 @@
 #include <linux/security.h>
 #include <linux/anon_inodes.h>
 #include <linux/namei.h>
+#include <linux/container.h>
 #include <linux/file.h>
 #include <uapi/linux/mount.h>
 #include "internal.h"
 #include "mount.h"
 
+/*
+ * Configure the destination container on a filesystem context.  This must be
+ * done before any other parameters are offered.  Containers are presented as
+ * fds attached to such objects given by the auxiliary parameter.
+ *
+ * For example:
+ *
+ *	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, container_fd);
+ */
+static int fsconfig_set_container(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct container *c;
+
+	if (!is_container_file(param->file))
+		return -EINVAL;
+
+	c = param->file->private_data;
+	vfs_set_container(fc, c);
+	return 0;
+}
+
 /*
  * Allow the user to read back any error, warning or informational messages.
  */
@@ -111,10 +133,6 @@ static int fscontext_alloc_log(struct fs_context *fc)
 
 /*
  * Open a filesystem by name so that it can be configured for mounting.
- *
- * We are allowed to specify a container in which the filesystem will be
- * opened, thereby indicating which namespaces will be used (notably, which
- * network namespace will be used for network filesystems).
  */
 SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 {
@@ -143,7 +161,7 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	if (IS_ERR(fc))
 		return PTR_ERR(fc);
 
-	fc->phase = FS_CONTEXT_CREATE_PARAMS;
+	fc->phase = FS_CONTEXT_CREATE_NS;
 
 	ret = fscontext_alloc_log(fc);
 	if (ret < 0)
@@ -228,7 +246,8 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 		return ret;
 	switch (cmd) {
 	case FSCONFIG_CMD_CREATE:
-		if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+		if (fc->phase != FS_CONTEXT_CREATE_NS &&
+		    fc->phase != FS_CONTEXT_CREATE_PARAMS)
 			return -EBUSY;
 		fc->phase = FS_CONTEXT_CREATING;
 		ret = vfs_get_tree(fc);
@@ -259,9 +278,17 @@ static int vfs_fsconfig_locked(struct fs_context *fc, int cmd,
 			break;
 		vfs_clean_context(fc);
 		return 0;
+
+	case FSCONFIG_SET_CONTAINER:
+		if (fc->phase != FS_CONTEXT_CREATE_NS)
+			return -EBUSY;
+		return fsconfig_set_container(fc, param);
+
 	default:
-		if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
-		    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+		if (fc->phase == FS_CONTEXT_CREATE_NS)
+			fc->phase = FS_CONTEXT_CREATE_PARAMS;
+		else if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+			 fc->phase != FS_CONTEXT_RECONF_PARAMS)
 			return -EBUSY;
 
 		return vfs_parse_fs_param(fc, param);
@@ -353,6 +380,10 @@ SYSCALL_DEFINE5(fsconfig,
 		if (!_key || _value || aux < 0)
 			return -EINVAL;
 		break;
+	case FSCONFIG_SET_CONTAINER:
+		if (_key || _value || aux < 0)
+			return -EINVAL;
+		break;
 	case FSCONFIG_CMD_CREATE:
 	case FSCONFIG_CMD_RECONFIGURE:
 		if (_key || _value || aux)
@@ -438,6 +469,12 @@ SYSCALL_DEFINE5(fsconfig,
 		if (!param.file)
 			goto out_key;
 		break;
+	case FSCONFIG_SET_CONTAINER:
+		ret = -EBADF;
+		param.file = fget(aux);
+		if (!param.file)
+			goto out_key;
+		break;
 	default:
 		break;
 	}
@@ -463,6 +500,7 @@ SYSCALL_DEFINE5(fsconfig,
 			putname(param.name);
 		break;
 	case FSCONFIG_SET_FD:
+	case FSCONFIG_SET_CONTAINER:
 		if (param.file)
 			fput(param.file);
 		break;
diff --git a/fs/namespace.c b/fs/namespace.c
index ea005f55ec4c..cc5d56f7ae29 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -781,9 +781,16 @@ static void put_mountpoint(struct mountpoint *mp)
 	}
 }
 
+static inline int __check_mnt(struct mount *mnt, struct mnt_namespace *mnt_ns)
+{
+	if (!mnt_ns)
+		mnt_ns = current->nsproxy->mnt_ns;
+	return mnt->mnt_ns == mnt_ns;
+}
+
 static inline int check_mnt(struct mount *mnt)
 {
-	return mnt->mnt_ns == current->nsproxy->mnt_ns;
+	return __check_mnt(mnt, NULL);
 }
 
 /*
@@ -2696,7 +2703,8 @@ static int do_move_mount_old(struct path *path, const char *old_name)
 /*
  * add a mount into a namespace's mount tree
  */
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags,
+			struct mnt_namespace *mnt_ns)
 {
 	struct mountpoint *mp;
 	struct mount *parent;
@@ -2710,7 +2718,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 
 	parent = real_mount(path->mnt);
 	err = -EINVAL;
-	if (unlikely(!check_mnt(parent))) {
+	if (unlikely(!__check_mnt(parent, mnt_ns))) {
 		/* that's acceptable only for automounts done in private ns */
 		if (!(mnt_flags & MNT_SHRINKABLE))
 			goto unlock;
@@ -2765,7 +2773,8 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
-	error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+	error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
+			     fc->container ? fc->container->ns->mnt_ns : NULL);
 	if (error < 0)
 		mntput(mnt);
 	return error;
@@ -2839,7 +2848,7 @@ int finish_automount(struct vfsmount *m, struct path *path)
 		goto fail;
 	}
 
-	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+	err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE, NULL);
 	if (!err)
 		return 0;
 fail:
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6927b29ece76..aa802006d855 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -18,6 +18,7 @@
 #include <linux/sched/stat.h>
 #include <linux/module.h>
 #include <linux/bitops.h>
+#include <linux/container.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_context.h>
 #include <linux/mount.h>
@@ -186,8 +187,12 @@ static int proc_init_fs_context(struct fs_context *fc)
 	ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;
+	
+	if (fc->container)
+		ctx->pid_ns = get_pid_ns(fc->container->pid_ns);
+	else
+		ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
 
-	ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
 	fc->fs_private = ctx;
 	fc->ops = &proc_fs_context_ops;
 	return 0;
@@ -300,7 +305,7 @@ struct proc_dir_entry proc_root = {
 	.name		= "/proc",
 };
 
-int pid_ns_prepare_proc(struct pid_namespace *ns)
+int pid_ns_prepare_proc(struct pid_namespace *ns, struct container *container)
 {
 	struct proc_fs_context *ctx;
 	struct fs_context *fc;
@@ -315,6 +320,8 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
 		fc->user_ns = get_user_ns(ns->user_ns);
 	}
 
+	vfs_set_container(fc, container);
+	
 	ctx = fc->fs_private;
 	if (ctx->pid_ns != ns) {
 		put_pid_ns(ctx->pid_ns);
diff --git a/include/linux/container.h b/include/linux/container.h
index 0a8918435097..087aa1885ef7 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -37,6 +37,7 @@ struct container {
 	struct path		root;		/* The root of the container's fs namespace */
 	struct task_struct	*init;		/* The 'init' task for this container */
 	struct container	*parent;	/* Parent of this container. */
+	struct pid_namespace	*pid_ns;	/* The process ID namespace for this container */
 	void			*security;	/* LSM data */
 	struct list_head	members;	/* Member processes, guarded with ->lock */
 	struct list_head	child_link;	/* Link in parent->children */
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index dc8c9fcba341..45486080eb84 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -40,6 +40,7 @@ enum fs_context_purpose {
  * Userspace usage phase for fsopen/fspick.
  */
 enum fs_context_phase {
+	FS_CONTEXT_CREATE_NS,		/* Set namespaces for sb creation */
 	FS_CONTEXT_CREATE_PARAMS,	/* Loading params for sb creation */
 	FS_CONTEXT_CREATING,		/* A superblock is being created */
 	FS_CONTEXT_AWAITING_MOUNT,	/* Superblock created, awaiting fsmount() */
@@ -93,6 +94,7 @@ struct fs_context {
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
 	struct dentry		*root;		/* The root and superblock */
+	struct container	*container;	/* The container in which the mount will exist */
 	struct user_namespace	*user_ns;	/* The user namespace for this mount */
 	struct net		*net_ns;	/* The network namespace for this mount */
 	const struct cred	*cred;		/* The mounter's credentials */
@@ -136,6 +138,7 @@ extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
 extern int generic_parse_monolithic(struct fs_context *fc, void *data);
+extern void vfs_set_container(struct fs_context *fc, struct container *container);
 extern int vfs_get_tree(struct fs_context *fc);
 extern void put_fs_context(struct fs_context *fc);
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 14a9a39da9c7..16dc152ceef1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,8 @@ static inline struct pid *get_pid(struct pid *pid)
 	return pid;
 }
 
+struct container;
+
 extern void put_pid(struct pid *pid);
 extern struct task_struct *pid_task(struct pid *pid, enum pid_type);
 extern struct task_struct *get_pid_task(struct pid *pid, enum pid_type);
@@ -111,7 +113,8 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns,
+			     struct container *container);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..dee0881eca5c 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -47,14 +47,16 @@ enum {
 
 #ifdef CONFIG_PROC_FS
 
-extern int pid_ns_prepare_proc(struct pid_namespace *ns);
+extern int pid_ns_prepare_proc(struct pid_namespace *ns,
+			       struct container *container);
 extern void pid_ns_release_proc(struct pid_namespace *ns);
 extern int proc_alloc_inum(unsigned int *pino);
 extern void proc_free_inum(unsigned int inum);
 
 #else /* CONFIG_PROC_FS */
 
-static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; }
+static inline int pid_ns_prepare_proc(struct pid_namespace *ns, struct container *container)
+{ return 0; }
 static inline void pid_ns_release_proc(struct pid_namespace *ns) {}
 
 static inline int proc_alloc_inum(unsigned int *inum)
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..f60bbe6f4099 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -97,6 +97,7 @@ enum fsconfig_command {
 	FSCONFIG_SET_FD		= 5,	/* Set parameter, supplying an object by fd */
 	FSCONFIG_CMD_CREATE	= 6,	/* Invoke superblock creation */
 	FSCONFIG_CMD_RECONFIGURE = 7,	/* Invoke superblock reconfiguration */
+	FSCONFIG_SET_CONTAINER	= 8,	/* Set a container, supplied by fd */
 };
 
 /*
diff --git a/kernel/container.c b/kernel/container.c
index 1d2cb1c1e9b1..fd3b2a6849a1 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -30,6 +30,7 @@ struct container init_container = {
 	.cred		= &init_cred,
 	.ns		= &init_nsproxy,
 	.init		= &init_task,
+	.pid_ns		= &init_pid_ns,
 	.members.next	= &init_task.container_link,
 	.members.prev	= &init_task.container_link,
 	.children	= LIST_HEAD_INIT(init_container.children),
@@ -51,6 +52,8 @@ void put_container(struct container *c)
 
 	while (c && refcount_dec_and_test(&c->usage)) {
 		BUG_ON(!list_empty(&c->members));
+		if (c->pid_ns)
+			put_pid_ns(c->pid_ns);
 		if (c->ns)
 			put_nsproxy(c->ns);
 		path_put(&c->root);
@@ -391,6 +394,7 @@ static struct container *create_container(const char __user *name, unsigned int
 	}
 
 	c->ns = ns;
+	c->pid_ns = get_pid_ns(c->ns->pid_ns_for_children);
 	c->root = fs->root;
 	c->seq = fs->seq;
 	fs->root.mnt = NULL;
diff --git a/kernel/fork.c b/kernel/fork.c
index 71401deb4434..09de5f35d312 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1958,7 +1958,7 @@ static __latent_entropy struct task_struct *copy_process(
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+		pid = alloc_pid(p->nsproxy->pid_ns_for_children, dest_container);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..6528a75e6c0d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -156,7 +156,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, struct container *container)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -205,7 +205,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 
 	if (unlikely(is_child_reaper(pid))) {
-		if (pid_ns_prepare_proc(ns))
+		if (pid_ns_prepare_proc(ns, container))
 			goto out_free;
 	}
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (6 preceding siblings ...)
  2019-02-15 16:08 ` [RFC PATCH 07/27] containers: Make fsopen() able to create a superblock in a container David Howells
@ 2019-02-15 16:08 ` David Howells
  2019-02-17  0:11   ` Al Viro
  2019-02-15 16:08 ` [RFC PATCH 09/27] vfs: Allow mounting to other namespaces David Howells
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-15 16:08 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Allow a container to be created with an empty mount namespace, as specified
by passing CONTAINER_NEW_EMPTY_FS_NS to container_create(), and allow a
root filesystem to be mounted into the container:

	cfd = container_create("foo", CONTAINER_NEW_EMPTY_FS_NS);

	fsfd = fsopen("ext3", 0);
	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	...
	rfd = fsmount(fsfd, 0, 0);
	move_mount(rfd, "", cfd, "/",
		   MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT);

	pfd = fsopen("proc", 0);
	write(pfd, "n c=<cfd>");
	...
	procfd = fsmount(pfd, 0, 0);
	move_mount(procfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c             |   95 +++++++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/mount.h |    3 +
 kernel/container.c         |    6 +++
 kernel/fork.c              |    6 ++-
 4 files changed, 97 insertions(+), 13 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cc5d56f7ae29..22cf4a8f8065 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3513,6 +3513,63 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
 	return ret;
 }
 
+/*
+ * Create a mount namespace for a container and set the root mount in it.
+ */
+static int set_container_root(struct path *path, int fd)
+{
+	struct mnt_namespace *mnt_ns;
+	struct container *container;
+	struct mount *mnt;
+	struct fd f;
+	int ret;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (!is_container_file(f.file))
+		goto out_fd;
+
+	ret = -EBUSY;
+	container = f.file->private_data;
+	if (container->ns->mnt_ns)
+		goto out_fd;
+
+	mnt_ns = alloc_mnt_ns(container->cred->user_ns, false);
+	if (IS_ERR(mnt_ns)) {
+		ret = PTR_ERR(mnt_ns);
+		goto out_fd;
+	}
+
+	mnt = real_mount(path->mnt);
+	mnt_add_count(mnt, 1);
+	mnt->mnt_ns = mnt_ns;
+	mnt_ns->root = mnt;
+	mnt_ns->mounts++;
+	list_add(&mnt->mnt_list, &mnt_ns->list);
+
+	ret = -EBUSY;
+	spin_lock(&container->lock);
+	if (!container->ns->mnt_ns) {
+		container->ns->mnt_ns = mnt_ns;
+		write_seqcount_begin(&container->seq);
+		container->root.mnt = path->mnt;
+		container->root.dentry = path->dentry;
+		write_seqcount_end(&container->seq);
+		path_get(&container->root);
+		mnt_ns = NULL;
+		ret = 0;
+	}
+	spin_unlock(&container->lock);
+
+	if (ret < 0)
+		put_mnt_ns(mnt_ns);
+out_fd:
+	fdput(f);
+	return ret;
+}
+
 /*
  * Move a mount from one place to another.  In combination with
  * fsopen()/fsmount() this is used to install a new mount and in combination
@@ -3528,6 +3585,7 @@ SYSCALL_DEFINE5(move_mount,
 {
 	struct path from_path, to_path;
 	unsigned int lflags;
+	char buf[2];
 	int ret = 0;
 
 	if (!may_mount())
@@ -3536,6 +3594,17 @@ SYSCALL_DEFINE5(move_mount,
 	if (flags & ~MOVE_MOUNT__MASK)
 		return -EINVAL;
 
+	if (flags & MOVE_MOUNT_T_CONTAINER_ROOT) {
+		if (flags & (MOVE_MOUNT_T_SYMLINKS |
+			     MOVE_MOUNT_T_AUTOMOUNTS |
+			     MOVE_MOUNT_T_EMPTY_PATH))
+			return -EINVAL;
+		if (strncpy_from_user(buf, to_pathname, 2) < 0)
+			return -EFAULT;
+		if (buf[0] != '/' || buf[1] != '\0')
+			return -EINVAL;
+	}
+
 	/* If someone gives a pathname, they aren't permitted to move
 	 * from an fd that requires unmount as we can't get at the flag
 	 * to clear it afterwards.
@@ -3549,20 +3618,24 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		return ret;
 
-	lflags = 0;
-	if (flags & MOVE_MOUNT_T_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
-	if (flags & MOVE_MOUNT_T_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
-	if (flags & MOVE_MOUNT_T_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+	if (flags & MOVE_MOUNT_T_CONTAINER_ROOT) {
+		ret = set_container_root(&from_path, to_dfd);
+	} else {
+		lflags = 0;
+		if (flags & MOVE_MOUNT_T_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+		if (flags & MOVE_MOUNT_T_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+		if (flags & MOVE_MOUNT_T_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
 
-	ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
-	if (ret < 0)
-		goto out_from;
+		ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+		if (ret < 0)
+			goto out_from;
 
-	ret = security_move_mount(&from_path, &to_path);
-	if (ret < 0)
-		goto out_to;
+		ret = security_move_mount(&from_path, &to_path);
+		if (ret < 0)
+			goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+		ret = do_move_mount(&from_path, &to_path);
+	}
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index f60bbe6f4099..cfaa75fa0594 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -70,7 +70,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_T_CONTAINER_ROOT	0x00000080 /* Set as container root */
+#define MOVE_MOUNT__MASK		0x000000f7
 
 /*
  * fsopen() flags.
diff --git a/kernel/container.c b/kernel/container.c
index fd3b2a6849a1..360284db959b 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -21,6 +21,7 @@
 #include <linux/printk.h>
 #include <linux/security.h>
 #include <linux/proc_fs.h>
+#include <linux/mnt_namespace.h>
 #include "namespaces.h"
 
 struct container init_container = {
@@ -400,6 +401,11 @@ static struct container *create_container(const char __user *name, unsigned int
 	fs->root.mnt = NULL;
 	fs->root.dentry = NULL;
 
+	if (flags & CONTAINER_NEW_EMPTY_FS_NS) {
+		put_mnt_ns(ns->mnt_ns);
+		ns->mnt_ns = NULL;
+	}
+
 	ret = security_container_alloc(c, flags);
 	if (ret < 0)
 		goto err_fs;
diff --git a/kernel/fork.c b/kernel/fork.c
index 09de5f35d312..6ec507a5f739 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2374,7 +2374,11 @@ SYSCALL_DEFINE1(fork_into_container, int, containerfd)
 	if (is_container_file(f.file)) {
 		struct container *dest_container = f.file->private_data;
 
-		ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, dest_container);
+		if (!dest_container->ns->mnt_ns)
+			ret = -ENOENT;
+		else
+			ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0,
+				       dest_container);
 	}
 	fdput(f);
 	return ret;


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 09/27] vfs: Allow mounting to other namespaces
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (7 preceding siblings ...)
  2019-02-15 16:08 ` [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS David Howells
@ 2019-02-15 16:08 ` David Howells
  2019-02-17  0:14   ` Al Viro
  2019-02-15 16:08 ` [RFC PATCH 10/27] containers: Provide fs_context op for container setting David Howells
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-15 16:08 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Currently sys_move_mount() and sys_mount(MS_MOVE) prevent the caller from
moving a mount into a namespace not their own.  Relax this such that any
mount can be mounted onto any given mountpoint provided that the source
mount is either detached or the same namespace as the destination.

This permits container namespaces to be built from the outside rather than
from the inside.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 22cf4a8f8065..804601b6297c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2627,12 +2627,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	ns = old->mnt_ns;
 
 	err = -EINVAL;
-	/* The mountpoint must be in our namespace. */
-	if (!check_mnt(p))
-		goto out;
-
-	/* The thing moved should be either ours or completely unattached. */
-	if (attached && !check_mnt(old))
+	/* The new mount must be either unattached or in the same namespace as
+	 * the mountpoint.
+	 */
+	if (attached && old->mnt_ns != p->mnt_ns)
 		goto out;
 
 	if (!attached && !is_anon_ns(ns))


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 10/27] containers: Provide fs_context op for container setting
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (8 preceding siblings ...)
  2019-02-15 16:08 ` [RFC PATCH 09/27] vfs: Allow mounting to other namespaces David Howells
@ 2019-02-15 16:08 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 11/27] containers: Sample program for driving container objects David Howells
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:08 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide an fs_context op to notify a filesystem that a container has been
set.  The filesystem should do whatever cleanup it needs, then call
do_set_container() and then re-set its container/namespace dependent stuff.

This allows the following:

 (1) proc and mqueue mounts to set the correct pid and ipc namespaces
     respectively.

 (2) afs to discard the old default cell before the net namespace is
     changed (ie. while it is still pinned), after which it can get the new
     default cell.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/super.c             |   18 ++++++++++++++++++
 fs/fs_context.c            |   32 ++++++++++++++++++++++++++------
 fs/proc/root.c             |    9 +++++++++
 include/linux/fs_context.h |    2 ++
 ipc/mqueue.c               |   10 ++++++++++
 5 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index 4e33a7038bc5..a349e213bdc8 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -569,6 +569,23 @@ static int afs_get_tree(struct fs_context *fc)
 	return ret;
 }
 
+static void afs_set_container(struct fs_context *fc)
+{
+	struct afs_fs_context *ctx = fc->fs_private;
+	struct afs_cell *cell;
+
+	afs_put_cell(ctx->net, ctx->cell);
+	do_set_container(fc);
+
+	/* Default to the workstation cell. */
+	rcu_read_lock();
+	cell = afs_lookup_cell_rcu(ctx->net, NULL, 0);
+	rcu_read_unlock();
+	if (IS_ERR(cell))
+		cell = NULL;
+	ctx->cell = cell;
+}
+
 static void afs_free_fc(struct fs_context *fc)
 {
 	struct afs_fs_context *ctx = fc->fs_private;
@@ -583,6 +600,7 @@ static void afs_free_fc(struct fs_context *fc)
 static const struct fs_context_operations afs_context_ops = {
 	.free		= afs_free_fc,
 	.parse_param	= afs_parse_param,
+	.set_container	= afs_set_container,
 	.get_tree	= afs_get_tree,
 };
 
diff --git a/fs/fs_context.c b/fs/fs_context.c
index fc76ac02d618..c0f333cc0e16 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -170,18 +170,38 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_fs_param);
 
+/**
+ * do_set_container - Helper to set container
+ * @fc: The fs_context to adjust
+ *
+ * This is called to effect the change of namespaces associated with the
+ * container.  The reason that this isn't rolled into vfs_set_container() is
+ * that the filesystem may need to do some cleanup on the old namespaces (which
+ * are currently pinned by the container) before calling this.
+ *
+ * The user namespace is not changed as that is used for security checks.
+ */
+void do_set_container(struct fs_context *fc)
+{
+	put_net(fc->net_ns);
+	fc->net_ns = get_net(fc->container->ns->net_ns);
+}
+EXPORT_SYMBOL(do_set_container);
+
 /*
- * Specify a container in which a superblock will exist.
+ * Specify a container in which a superblock will exist.  This should be called
+ * before calling vfs_parse_fs_param.  If ->set_container() is supplied by the
+ * filesystem, it should call do_set_container().
  */
 void vfs_set_container(struct fs_context *fc, struct container *container)
 {
 	if (container) {
-		put_user_ns(fc->user_ns);
-		put_net(fc->net_ns);
-
+		put_container(fc->container);
 		fc->container = get_container(container);
-		fc->user_ns = get_user_ns(container->cred->user_ns);
-		fc->net_ns = get_net(container->ns->net_ns);
+		if (fc->ops->set_container)
+			fc->ops->set_container(fc);
+		else
+			do_set_container(fc);
 	}
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index aa802006d855..f8e124ce0888 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -164,6 +164,14 @@ static int proc_get_tree(struct fs_context *fc)
 	return vfs_get_super(fc, vfs_get_keyed_super, proc_fill_super);
 }
 
+static void proc_set_container(struct fs_context *fc)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+
+	put_pid_ns(ctx->pid_ns);
+	ctx->pid_ns = get_pid_ns(fc->container->pid_ns);
+}
+
 static void proc_fs_context_free(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -176,6 +184,7 @@ static void proc_fs_context_free(struct fs_context *fc)
 static const struct fs_context_operations proc_fs_context_ops = {
 	.free		= proc_fs_context_free,
 	.parse_param	= proc_parse_param,
+	.set_container	= proc_set_container,
 	.get_tree	= proc_get_tree,
 	.reconfigure	= proc_reconfigure,
 };
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 45486080eb84..086e4f24705a 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -118,6 +118,7 @@ struct fs_context_operations {
 	int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
 	int (*parse_param)(struct fs_context *fc, struct fs_parameter *param);
 	int (*parse_monolithic)(struct fs_context *fc, void *data);
+	void (*set_container)(struct fs_context *fc);
 	int (*get_tree)(struct fs_context *fc);
 	int (*reconfigure)(struct fs_context *fc);
 };
@@ -138,6 +139,7 @@ extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
 extern int generic_parse_monolithic(struct fs_context *fc, void *data);
+extern void do_set_container(struct fs_context *fc);
 extern void vfs_set_container(struct fs_context *fc, struct container *container);
 extern int vfs_get_tree(struct fs_context *fc);
 extern void put_fs_context(struct fs_context *fc);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 2a9a8be49f5b..821fb227800f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -33,6 +33,7 @@
 #include <linux/mutex.h>
 #include <linux/nsproxy.h>
 #include <linux/pid.h>
+#include <linux/container.h>
 #include <linux/ipc_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/slab.h>
@@ -329,6 +330,14 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 	return ERR_PTR(ret);
 }
 
+static void mqueue_set_container(struct fs_context *fc)
+{
+	struct mqueue_fs_context *ctx = fc->fs_private;
+
+	put_ipc_ns(ctx->ipc_ns);
+	ctx->ipc_ns = get_ipc_ns(fc->container->ns->ipc_ns);
+}
+
 static int mqueue_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct inode *inode;
@@ -1569,6 +1578,7 @@ static const struct super_operations mqueue_super_ops = {
 
 static const struct fs_context_operations mqueue_fs_context_ops = {
 	.free		= mqueue_fs_context_free,
+	.set_container	= mqueue_set_container,
 	.get_tree	= mqueue_get_tree,
 };
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 11/27] containers: Sample program for driving container objects
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (9 preceding siblings ...)
  2019-02-15 16:08 ` [RFC PATCH 10/27] containers: Provide fs_context op for container setting David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 12/27] containers: Allow a daemon to intercept request_key upcalls in a container David Howells
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Add a sample program to demonstrate driving a container object.  It is
called something like:

	./samples/vfs/test-container /dev/sda3

where /dev/sda3 holds an ext4 filesystem that has appropriate /etc, /bin,
/usr, /lib, /proc directories emplaced such that procfs can be mounted and
then /bin/bash can be executed within the container.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/vfs/Makefile         |    5 +
 samples/vfs/test-container.c |  279 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 283 insertions(+), 1 deletion(-)
 create mode 100644 samples/vfs/test-container.c

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index b88655cb2f1d..25420919ee40 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,7 +4,8 @@ hostprogs-$(CONFIG_SAMPLE_VFS) := \
 	test-fs-query \
 	test-fsmount \
 	test-mntinfo \
-	test-statx
+	test-statx \
+	test-container
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -17,3 +18,5 @@ HOSTLDLIBS_test-mntinfo += -lm
 HOSTCFLAGS_test-fs-query.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-container.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-container += -lkeyutils
diff --git a/samples/vfs/test-container.c b/samples/vfs/test-container.c
new file mode 100644
index 000000000000..44ff57afb5a4
--- /dev/null
+++ b/samples/vfs/test-container.c
@@ -0,0 +1,279 @@
+/* Container test.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <linux/mount.h>
+#include <linux/unistd.h>
+#include <dirent.h>
+#include <sys/stat.h>
+#include <keyutils.h>
+
+/* Hope -1 isn't a syscall */
+#ifndef __NR_fsopen
+#define __NR_fsopen -1
+#endif
+#ifndef __NR_fsmount
+#define __NR_fsmount -1
+#endif
+#ifndef __NR_fsconfig
+#define __NR_fsconfig -1
+#endif
+#ifndef __NR_move_mount
+#define __NR_move_mount -1
+#endif
+
+
+#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
+
+static void check_messages(int fd)
+{
+	char buf[4096];
+	int err, n;
+
+	err = errno;
+
+	for (;;) {
+		n = read(fd, buf, sizeof(buf));
+		if (n < 0)
+			break;
+		n -= 2;
+
+		switch (buf[0]) {
+		case 'e':
+			fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
+			break;
+		case 'w':
+			fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
+			break;
+		case 'i':
+			fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
+			break;
+		}
+	}
+
+	errno = err;
+}
+
+static __attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+	check_messages(fd);
+	fprintf(stderr, "%s: %m\n", s);
+	exit(1);
+}
+
+#define CONTAINER_NEW_FS_NS		0x00000001 /* Dup current fs namespace */
+#define CONTAINER_NEW_EMPTY_FS_NS	0x00000002 /* Provide new empty fs namespace */
+#define CONTAINER_NEW_CGROUP_NS		0x00000004 /* Dup current cgroup namespace [priv] */
+#define CONTAINER_NEW_UTS_NS		0x00000008 /* Dup current uts namespace */
+#define CONTAINER_NEW_IPC_NS		0x00000010 /* Dup current ipc namespace */
+#define CONTAINER_NEW_USER_NS		0x00000020 /* Dup current user namespace */
+#define CONTAINER_NEW_PID_NS		0x00000040 /* Dup current pid namespace */
+#define CONTAINER_NEW_NET_NS		0x00000080 /* Dup current net namespace */
+#define CONTAINER_KILL_ON_CLOSE		0x00000100 /* Kill all member processes when fd closed */
+#define CONTAINER_FD_CLOEXEC		0x00000200 /* Close the fd on exec */
+#define CONTAINER__FLAG_MASK		0x000003ff
+
+static inline int fsopen(const char *fs_name, unsigned int flags)
+{
+	return syscall(__NR_fsopen, fs_name, flags);
+}
+
+static inline int fsconfig(int fsfd, unsigned int cmd,
+			   const char *key, const void *val, int aux)
+{
+	return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
+}
+
+static inline int fsmount(int fsfd, unsigned int flags, unsigned int attr_flags)
+{
+	return syscall(__NR_fsmount, fsfd, flags, attr_flags);
+}
+
+static inline int move_mount(int from_dfd, const char *from_pathname,
+			     int to_dfd, const char *to_pathname,
+			     unsigned int flags)
+{
+	return syscall(__NR_move_mount,
+		       from_dfd, from_pathname,
+		       to_dfd, to_pathname, flags);
+}
+
+static inline int container_create(const char *name, unsigned int mask)
+{
+	return syscall(__NR_container_create, name, mask, 0, 0, 0);
+}
+
+static inline int fork_into_container(int containerfd)
+{
+	return syscall(__NR_fork_into_container, containerfd);
+}
+
+#define E_fsconfig(fd, cmd, key, val, aux)				\
+	do {								\
+		if (fsconfig(fd, cmd, key, val, aux) == -1)		\
+			mount_error(fd, key ?: "create");		\
+	} while (0)
+
+/*
+ * The container init process.
+ */
+static __attribute__((noreturn))
+void container_init(void)
+{
+	if (0) {
+		/* Do a bit of debugging on the container. */
+		struct dirent **dlist;
+		struct stat st;
+		char buf[4096];
+		int n, i;
+
+		printf("hello!\n");
+		n = scandir("/", &dlist, NULL, alphasort);
+		if (n == -1) {
+			perror("scandir");
+			exit(1);
+		}
+
+		for (i = 0; i < n; i++) {
+			struct dirent *p = dlist[i];
+
+			if (p)
+				printf("- %u %s\n", p->d_type, p->d_name);
+		}
+
+		n = readlink("/bin", buf, sizeof(buf) - 1);
+		if (n == -1) {
+			perror("readlink");
+			exit(1);
+		}
+
+		buf[n] = 0;
+		printf("/bin -> %s\n", buf);
+
+		if (stat("/lib64/ld-linux-x86-64.so.2", &st) == -1) {
+			perror("stat");
+			exit(1);
+		}
+
+		printf("mode %o\n", st.st_mode);
+	}
+
+	if (keyctl_join_session_keyring(NULL) == -1) {
+		perror("keyctl/join");
+		exit(1);
+	}
+
+	setenv("PS1", "container>", 1);
+	execl("/bin/bash", "bash", NULL);
+	perror("execl");
+	exit(1);
+}
+
+/*
+ * The container manager process.
+ */
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int fsfd, mfd, cfd, ws;
+
+	if (argc != 2) {
+		fprintf(stderr, "Format: test-container <root-dev>\n");
+		exit(2);
+	}
+
+	cfd = container_create("foo-test",
+			       CONTAINER_NEW_EMPTY_FS_NS |
+			       //CONTAINER_NEW_UTS_NS |
+			       //CONTAINER_NEW_IPC_NS |
+			       //CONTAINER_NEW_USER_NS |
+			       CONTAINER_NEW_PID_NS |
+			       CONTAINER_KILL_ON_CLOSE |
+			       CONTAINER_FD_CLOEXEC);
+	if (cfd == -1) {
+		perror("container_create");
+		exit(1);
+	}
+
+	system("cat /proc/containers");
+
+	/* Open the filesystem that's going to form the container root. */
+	printf("Creating root...\n");
+	fsfd = fsopen("ext4", 0);
+	if (fsfd == -1) {
+		perror("fsopen/root");
+		exit(1);
+	}
+
+	E_fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
+	E_fsconfig(fsfd, FSCONFIG_SET_STRING, "source", argv[1], 0);
+	E_fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
+	E_fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	/* Mount the container root */
+	printf("Mounting root...\n");
+	mfd = fsmount(fsfd, 0, 0);
+	if (mfd < 0)
+		mount_error(fsfd, "fsmount/root");
+
+	if (move_mount(mfd, "", cfd, "/",
+		       MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT) < 0) {
+		perror("move_mount/root");
+		exit(1);
+	}
+	E(close(fsfd));
+	E(close(mfd));
+
+	/* Mount procfs within the container */
+	printf("Creating procfs...\n");
+	fsfd = fsopen("proc", 0);
+	if (fsfd == -1) {
+		perror("fsopen/proc");
+		exit(1);
+	}
+
+	E_fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
+	E_fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	printf("Mounting procfs...\n");
+	mfd = fsmount(fsfd, 0, 0);
+	if (mfd < 0)
+		mount_error(fsfd, "fsmount/proc");
+	if (move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH) < 0) {
+		perror("move_mount/proc");
+		exit(1);
+	}
+	E(close(fsfd));
+	E(close(mfd));
+
+	/* Start the 'init' process. */
+	printf("Forking...\n");
+	switch ((pid = fork_into_container(cfd))) {
+	case -1:
+		perror("fork_into_container");
+		exit(1);
+	case 0:
+		close(cfd);
+		container_init();
+	default:
+		if (waitpid(pid, &ws, 0) < 0) {
+			perror("waitpid");
+			exit(1);
+		}
+	}
+	E(close(cfd));
+	exit(0);
+}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 12/27] containers: Allow a daemon to intercept request_key upcalls in a container
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (10 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 11/27] containers: Sample program for driving container objects David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 13/27] keys: Provide a keyctl to query a request_key authentication key David Howells
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a mechanism by which a running daemon can intercept request_key
upcalls, filtered by namespace and key type, and service them.  The list of
active services is per-container.

Intercepts for a specific {key_type, namespace} can be installed on a
container with:

	keyctl(KEYCTL_ADD_UPCALL_INTERCEPT,
	       int containerfd,
	       const char *type_name,
	       unsigned int ns_id,
	       key_serial_t dest_keyring);

The authentication token keys for intercepted keys are linked into the
destination keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/container.h        |    2 
 include/linux/key-type.h         |    2 
 include/uapi/linux/keyctl.h      |    1 
 kernel/container.c               |    4 +
 security/keys/Makefile           |    2 
 security/keys/compat.c           |    5 +
 security/keys/container.c        |  227 ++++++++++++++++++++++++++++++++++++++
 security/keys/internal.h         |   10 ++
 security/keys/keyctl.c           |   14 ++
 security/keys/request_key.c      |   18 ++-
 security/keys/request_key_auth.c |    6 +
 11 files changed, 278 insertions(+), 13 deletions(-)
 create mode 100644 security/keys/container.c

diff --git a/include/linux/container.h b/include/linux/container.h
index 087aa1885ef7..a8cac800ce75 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -42,6 +42,7 @@ struct container {
 	struct list_head	members;	/* Member processes, guarded with ->lock */
 	struct list_head	child_link;	/* Link in parent->children */
 	struct list_head	children;	/* Child containers */
+	struct list_head	req_key_traps;	/* Traps for request-key upcalls */
 	wait_queue_head_t	waitq;		/* Someone waiting for init to exit waits here */
 	unsigned long		flags;
 #define CONTAINER_FLAG_INIT_STARTED	0	/* Init is started - certain ops now prohibited */
@@ -60,6 +61,7 @@ extern int copy_container(unsigned long flags, struct task_struct *tsk,
 			  struct container *container);
 extern void exit_container(struct task_struct *tsk);
 extern void put_container(struct container *c);
+extern long key_del_intercept(struct container *c, const char *type);
 
 static inline struct container *get_container(struct container *c)
 {
diff --git a/include/linux/key-type.h b/include/linux/key-type.h
index 2148a6bf58f1..0e09dac53245 100644
--- a/include/linux/key-type.h
+++ b/include/linux/key-type.h
@@ -66,7 +66,7 @@ struct key_match_data {
  */
 struct key_type {
 	/* name of the type */
-	const char *name;
+	const char name[24];
 
 	/* default payload length for quota precalculation (optional)
 	 * - this can be used instead of calling key_payload_reserve(), that
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index e9e7da849619..85e8fef89bba 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -68,6 +68,7 @@
 #define KEYCTL_PKEY_VERIFY		28	/* Verify a public key signature */
 #define KEYCTL_RESTRICT_KEYRING		29	/* Restrict keys allowed to link to a keyring */
 #define KEYCTL_WATCH_KEY		30	/* Watch a key or ring of keys for changes */
+#define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
 
 /* keyctl structures */
 struct keyctl_dh_params {
diff --git a/kernel/container.c b/kernel/container.c
index 360284db959b..33e41fe5050b 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -35,6 +35,7 @@ struct container init_container = {
 	.members.next	= &init_task.container_link,
 	.members.prev	= &init_task.container_link,
 	.children	= LIST_HEAD_INIT(init_container.children),
+	.req_key_traps	= LIST_HEAD_INIT(init_container.req_key_traps),
 	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
 	.lock		= __SPIN_LOCK_UNLOCKED(init_container.lock),
 	.seq		= SEQCNT_ZERO(init_fs.seq),
@@ -53,6 +54,8 @@ void put_container(struct container *c)
 
 	while (c && refcount_dec_and_test(&c->usage)) {
 		BUG_ON(!list_empty(&c->members));
+		if (!list_empty(&c->req_key_traps))
+			key_del_intercept(c, NULL);
 		if (c->pid_ns)
 			put_pid_ns(c->pid_ns);
 		if (c->ns)
@@ -286,6 +289,7 @@ static struct container *alloc_container(const char __user *name)
 
 	INIT_LIST_HEAD(&c->members);
 	INIT_LIST_HEAD(&c->children);
+	INIT_LIST_HEAD(&c->req_key_traps);
 	init_waitqueue_head(&c->waitq);
 	spin_lock_init(&c->lock);
 	refcount_set(&c->usage, 1);
diff --git a/security/keys/Makefile b/security/keys/Makefile
index 9cef54064f60..24f5df27b1c2 100644
--- a/security/keys/Makefile
+++ b/security/keys/Makefile
@@ -16,6 +16,7 @@ obj-y := \
 	request_key.o \
 	request_key_auth.o \
 	user_defined.o
+
 compat-obj-$(CONFIG_KEY_DH_OPERATIONS) += compat_dh.o
 obj-$(CONFIG_KEYS_COMPAT) += compat.o $(compat-obj-y)
 obj-$(CONFIG_PROC_FS) += proc.o
@@ -23,6 +24,7 @@ obj-$(CONFIG_SYSCTL) += sysctl.o
 obj-$(CONFIG_PERSISTENT_KEYRINGS) += persistent.o
 obj-$(CONFIG_KEY_DH_OPERATIONS) += dh.o
 obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += keyctl_pkey.o
+obj-$(CONFIG_CONTAINERS) += container.o
 
 #
 # Key types
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 021d8e1c9233..6420881e5ce7 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -161,6 +161,11 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 	case KEYCTL_WATCH_KEY:
 		return keyctl_watch_key(arg2, arg3, arg4);
 
+#ifdef CONFIG_CONTAINERS
+	case KEYCTL_CONTAINER_INTERCEPT:
+		return keyctl_container_intercept(arg2, compat_ptr(arg3), arg4, arg5);
+#endif
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/container.c b/security/keys/container.c
new file mode 100644
index 000000000000..c61c43658f3b
--- /dev/null
+++ b/security/keys/container.c
@@ -0,0 +1,227 @@
+/* Container intercept interface
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/key.h>
+#include <linux/key-type.h>
+#include <linux/container.h>
+#include <keys/request_key_auth-type.h>
+#include "internal.h"
+
+struct request_key_intercept {
+	char			type[32];	/* The type of key to be trapped */
+	struct list_head	link;		/* Link in containers->req_key_traps */
+	struct key		*dest_keyring;	/* Where to place the trapped auth keys */
+	struct ns_common	*ns;		/* Namespace the key must match */
+};
+
+/*
+ * Add an intercept filter to a container.
+ */
+static long key_add_intercept(struct container *c, struct request_key_intercept *rki)
+{
+	struct request_key_intercept *p;
+
+	kenter("%p,{%s,%d}", c, rki->type, key_serial(rki->dest_keyring));
+
+	spin_lock(&c->lock);
+	list_for_each_entry(p, &c->req_key_traps, link) {
+		if (strcmp(rki->type, p->type) == 0) {
+			spin_unlock(&c->lock);
+			return -EEXIST;
+		}
+	}
+
+	/* We put all-matching rules at the back so they're checked after the
+	 * more specific rules.
+	 */
+	if (rki->type[0] == '*' && !rki->type[1])
+		list_add_tail(&rki->link, &c->req_key_traps);
+	else
+		list_add(&rki->link, &c->req_key_traps);
+
+	spin_unlock(&c->lock);
+	kleave(" = 0");
+	return 0;
+}
+
+/*
+ * Remove one or more intercept filters from a container.  Returns the number
+ * of entries removed.
+ */
+long key_del_intercept(struct container *c, const char *type)
+{
+	struct request_key_intercept *p, *q;
+	long count;
+	LIST_HEAD(graveyard);
+
+	kenter("%p,%s", c, type);
+
+	spin_lock(&c->lock);
+	list_for_each_entry_safe(p, q, &c->req_key_traps, link) {
+		if (!type || strcmp(p->type, type) == 0) {
+			kdebug("- match %d", key_serial(p->dest_keyring));
+			list_move(&p->link, &graveyard);
+		}
+	}
+	spin_unlock(&c->lock);
+
+	count = 0;
+	while (!list_empty(&graveyard)) {
+		p = list_entry(graveyard.next, struct request_key_intercept, link);
+		list_del(&p->link);
+		count++;
+
+		key_put(p->dest_keyring);
+		kfree(p);
+	}
+
+	kleave(" = %ld", count);
+	return count;
+}
+
+/*
+ * Create an intercept filter and add it to a container.
+ */
+static long key_create_intercept(struct container *c, const char *type,
+				 key_serial_t dest_ring_id)
+{
+	struct request_key_intercept *rki;
+	key_ref_t dest_ref;
+	long ret = -ENOMEM;
+
+	dest_ref = lookup_user_key(dest_ring_id, KEY_LOOKUP_CREATE,
+				   KEY_NEED_WRITE);
+	if (IS_ERR(dest_ref))
+		return PTR_ERR(dest_ref);
+
+	rki = kzalloc(sizeof(*rki), GFP_KERNEL);
+	if (!rki)
+		goto out_dest;
+
+	memcpy(rki->type, type, sizeof(rki->type));
+	rki->dest_keyring = key_ref_to_ptr(dest_ref);
+	/* TODO: set rki->ns */
+
+	ret = key_add_intercept(c, rki);
+	if (ret < 0)
+		goto out_rki;
+	return ret;
+
+out_rki:
+ 	kfree(rki);
+out_dest:
+	key_ref_put(dest_ref);
+	return ret;
+}
+
+/*
+ * Add or remove (if dest_keyring==0) a request_key upcall intercept trap upon
+ * a container.  If _type points to a string of "*" that matches all types.
+ */
+long keyctl_container_intercept(int containerfd,
+				const char *_type,
+				unsigned int ns_id,
+				key_serial_t dest_ring_id)
+{
+	struct container *c;
+	struct fd f;
+	char type[32] = "";
+	long ret;
+
+	if (containerfd < 0 || ns_id < 0)
+		return -EINVAL;
+	if (dest_ring_id && !_type)
+		return -EINVAL;
+
+	f = fdget(containerfd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (!is_container_file(f.file))
+		goto out_fd;
+
+	c = f.file->private_data;
+
+	/* Find out what type we're dealing with (can be NULL to make removal
+	 * remove everything).
+	 */
+	if (_type) {
+		ret = key_get_type_from_user(type, _type, sizeof(type));
+		if (ret < 0)
+			goto out_fd;
+	}
+
+	/* TODO: Get the namespace to filter on */
+
+	/* We add a filter if a destination keyring has been specified. */
+	if (dest_ring_id) {
+		ret = key_create_intercept(c, type, dest_ring_id);
+	} else {
+		ret = key_del_intercept(c, _type ? type : NULL);
+	}
+
+out_fd:
+	fdput(f);
+	return ret;
+}
+
+/*
+ * Queue a construction record if we can find a handler.
+ *
+ * Returns true if we found a handler - in which case ownership of the
+ * construction record has been passed on to the service queue and the caller
+ * can no longer touch it.
+ */
+int queue_request_key(struct key *authkey)
+{
+	struct container *c = current->container;
+	struct request_key_intercept *rki;
+	struct request_key_auth *rka = get_request_key_auth(authkey);
+	struct key *service_keyring;
+	struct key *key = rka->target_key;
+	int ret;
+
+	kenter("%p,%d,%d", c, key_serial(authkey), key_serial(key));
+
+	if (list_empty(&c->req_key_traps)) {
+		kleave(" = -EAGAIN [e]");
+		return -EAGAIN;
+	}
+
+	spin_lock(&c->lock);
+
+	list_for_each_entry(rki, &c->req_key_traps, link) {
+		if (strcmp(rki->type, "*") == 0 ||
+		    strcmp(rki->type, key->type->name) == 0)
+			goto found_match;
+	}
+
+	spin_unlock(&c->lock);
+	kleave(" = -EAGAIN [n]");
+	return -EAGAIN;
+
+found_match:
+	service_keyring = key_get(rki->dest_keyring);
+	kdebug("- match %d", key_serial(service_keyring));
+	spin_unlock(&c->lock);
+
+	/* We add the authentication key to the keyring for the service daemon
+	 * to collect.  This can be detected by means of a watch on the service
+	 * keyring.
+	 */
+	ret = key_link(service_keyring, authkey);
+	key_put(service_keyring);
+	kleave(" = %d", ret);
+	return ret;
+}
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 14c5b8ad5bd6..e98fca465146 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -93,6 +93,7 @@ extern wait_queue_head_t request_key_conswq;
 extern void key_set_index_key(struct keyring_index_key *index_key);
 extern struct key_type *key_type_lookup(const char *type);
 extern void key_type_put(struct key_type *ktype);
+extern int key_get_type_from_user(char *, const char __user *, unsigned);
 
 extern int __key_link_begin(struct key *keyring,
 			    const struct keyring_index_key *index_key,
@@ -180,6 +181,11 @@ extern void key_gc_keytype(struct key_type *ktype);
 extern int key_task_permission(const key_ref_t key_ref,
 			       const struct cred *cred,
 			       key_perm_t perm);
+#ifdef CONFIG_CONTAINERS
+extern int queue_request_key(struct key *);
+#else
+static inline int queue_request_key(struct key *authkey) { return -EAGAIN; }
+#endif
 
 static inline void notify_key(struct key *key,
 			      enum key_notification_subtype subtype, u32 aux)
@@ -354,6 +360,10 @@ static inline long keyctl_watch_key(key_serial_t key_id, int watch_fd, int watch
 }
 #endif
 
+#ifdef CONFIG_CONTAINERS
+extern long keyctl_container_intercept(int, const char __user *, unsigned int, key_serial_t);
+#endif
+
 /*
  * Debugging key validation
  */
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 94b99a52b4e5..38ff33431f33 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -30,9 +30,9 @@
 
 #define KEY_MAX_DESC_SIZE 4096
 
-static int key_get_type_from_user(char *type,
-				  const char __user *_type,
-				  unsigned len)
+int key_get_type_from_user(char *type,
+			   const char __user *_type,
+			   unsigned len)
 {
 	int ret;
 
@@ -1857,6 +1857,14 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case KEYCTL_WATCH_KEY:
 		return keyctl_watch_key((key_serial_t)arg2, (int)arg3, (int)arg4);
 
+#ifdef CONFIG_CONTAINERS
+	case KEYCTL_CONTAINER_INTERCEPT:
+		return keyctl_container_intercept((int)arg2,
+						  (const char __user *)arg3,
+						  (unsigned int)arg4,
+						  (key_serial_t)arg5);
+#endif
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index edfabf20bdbb..078767564283 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -17,6 +17,7 @@
 #include <linux/err.h>
 #include <linux/keyctl.h>
 #include <linux/slab.h>
+#include <linux/init_task.h>
 #include <net/net_namespace.h>
 #include "internal.h"
 #include <keys/request_key_auth-type.h>
@@ -91,11 +92,11 @@ static int call_usermodehelper_keys(const char *path, char **argv, char **envp,
  * Request userspace finish the construction of a key
  * - execute "/sbin/request-key <op> <key> <uid> <gid> <keyring> <keyring> <keyring>"
  */
-static int call_sbin_request_key(struct key *authkey, void *aux)
+static int call_sbin_request_key(struct key *authkey)
 {
 	static char const request_key[] = "/sbin/request-key";
 	struct request_key_auth *rka = get_request_key_auth(authkey);
-	const struct cred *cred = current_cred();
+	const struct cred *cred = rka->cred;
 	key_serial_t prkey, sskey;
 	struct key *key = rka->target_key, *keyring, *session;
 	char *argv[9], *envp[3], uid_str[12], gid_str[12];
@@ -203,7 +204,6 @@ static int construct_key(struct key *key, const void *callout_info,
 			 size_t callout_len, void *aux,
 			 struct key *dest_keyring)
 {
-	request_key_actor_t actor;
 	struct key *authkey;
 	int ret;
 
@@ -216,11 +216,13 @@ static int construct_key(struct key *key, const void *callout_info,
 		return PTR_ERR(authkey);
 
 	/* Make the call */
-	actor = call_sbin_request_key;
-	if (key->type->request_key)
-		actor = key->type->request_key;
-
-	ret = actor(authkey, aux);
+	if (key->type->request_key) {
+		ret = key->type->request_key(authkey, aux);
+	} else {
+		ret = queue_request_key(authkey);
+		if (ret == -EAGAIN)
+			ret = call_sbin_request_key(authkey);
+	}
 
 	/* check that the actor called complete_request_key() prior to
 	 * returning an error */
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index afc304e8b61e..cd75173cadad 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -123,6 +123,10 @@ static void free_request_key_auth(struct request_key_auth *rka)
 {
 	if (!rka)
 		return;
+
+	if (rka->target_key->state == KEY_IS_UNINSTANTIATED)
+		key_reject_and_link(rka->target_key, 0, -ENOKEY, NULL, NULL);
+
 	key_put(rka->target_key);
 	key_put(rka->dest_keyring);
 	if (rka->cred)
@@ -184,7 +188,7 @@ struct key *request_key_auth_new(struct key *target, const char *op,
 			goto error_free_rka;
 		}
 
-		irka = cred->request_key_auth->payload.data[0];
+		irka = get_request_key_auth(cred->request_key_auth);
 		rka->cred = get_cred(irka->cred);
 		rka->pid = irka->pid;
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 13/27] keys: Provide a keyctl to query a request_key authentication key
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (11 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 12/27] containers: Allow a daemon to intercept request_key upcalls in a container David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 14/27] keys: Break bits out of key_unlink() David Howells
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a keyctl to query a request_key authentication key for situations
where this information isn't passed on the command line (such as where the
authentication key is placed in a queue instead of /sbin/request-key being
invoked):

	struct keyctl_query_request_key_auth {
		char		operation[32];
		uid_t		fsuid;
		gid_t		fsgid;
		key_serial_t	target_key;
		key_serial_t	thread_keyring;
		key_serial_t	process_keyring;
		key_serial_t	session_keyring;
		__u64		spare[1];
	};

	keyctl(KEYCTL_QUERY_REQUEST_KEY_AUTH,
	       key_serial_t key,
	       struct keyctl_query_request_key_auth *data);

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/uapi/linux/keyctl.h |   12 ++++++++++++
 security/keys/compat.c      |    2 ++
 security/keys/container.c   |   42 ++++++++++++++++++++++++++++++++++++++++++
 security/keys/internal.h    |    2 ++
 security/keys/keyctl.c      |    4 ++++
 5 files changed, 62 insertions(+)

diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 85e8fef89bba..bb075ad1827d 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -69,6 +69,7 @@
 #define KEYCTL_RESTRICT_KEYRING		29	/* Restrict keys allowed to link to a keyring */
 #define KEYCTL_WATCH_KEY		30	/* Watch a key or ring of keys for changes */
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
+#define KEYCTL_QUERY_REQUEST_KEY_AUTH	32	/* Query a request_key_auth key */
 
 /* keyctl structures */
 struct keyctl_dh_params {
@@ -114,4 +115,15 @@ struct keyctl_pkey_params {
 	__u32		__spare[7];
 };
 
+struct keyctl_query_request_key_auth {
+	char		operation[32];	/* Operation name, typically "create" */
+	uid_t		fsuid;		/* UID of requester */
+	gid_t		fsgid;		/* GID of requester */
+	__u32		target_key;	/* The key being instantiated */
+	__u32		thread_keyring;	/* The requester's thread keyring */
+	__u32		process_keyring; /* The requester's process keyring */
+	__u32		session_keyring; /* The requester's session keyring */
+	__u64		spare[1];
+};
+
 #endif /*  _LINUX_KEYCTL_H */
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 6420881e5ce7..30055fc2b629 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -164,6 +164,8 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 #ifdef CONFIG_CONTAINERS
 	case KEYCTL_CONTAINER_INTERCEPT:
 		return keyctl_container_intercept(arg2, compat_ptr(arg3), arg4, arg5);
+	case KEYCTL_QUERY_REQUEST_KEY_AUTH:
+		return keyctl_query_request_key_auth(arg2, compat_ptr(arg3));
 #endif
 
 	default:
diff --git a/security/keys/container.c b/security/keys/container.c
index c61c43658f3b..115998e867cd 100644
--- a/security/keys/container.c
+++ b/security/keys/container.c
@@ -225,3 +225,45 @@ int queue_request_key(struct key *authkey)
 	kleave(" = %d", ret);
 	return ret;
 }
+
+/*
+ * Query information about a request_key_auth key.
+ */
+long keyctl_query_request_key_auth(key_serial_t auth_id,
+				   struct keyctl_query_request_key_auth __user *_data)
+{
+	struct keyctl_query_request_key_auth data;
+	struct request_key_auth *rka;
+	struct key *session;
+	key_ref_t authkey_ref;
+
+	if (auth_id <= 0 || !_data)
+		return -EINVAL;
+
+	authkey_ref = lookup_user_key(auth_id, 0, KEY_NEED_SEARCH);
+	if (IS_ERR(authkey_ref))
+		return PTR_ERR(authkey_ref);
+	rka = get_request_key_auth(key_ref_to_ptr(authkey_ref));
+
+	memset(&data, 0, sizeof(data));
+	strlcpy(data.operation, rka->op, sizeof(data.operation));
+	data.fsuid = from_kuid(current_user_ns(), rka->cred->fsuid);
+	data.fsgid = from_kgid(current_user_ns(), rka->cred->fsgid);
+	data.target_key = rka->target_key->serial;
+	data.thread_keyring = key_serial(rka->cred->thread_keyring);
+	data.process_keyring = key_serial(rka->cred->thread_keyring);
+
+	rcu_read_lock();
+	session = rcu_dereference(rka->cred->session_keyring);
+	if (!session)
+		session = rka->cred->user->session_keyring;
+	data.session_keyring = key_serial(session);
+	rcu_read_unlock();
+
+	key_ref_put(authkey_ref);
+
+	if (copy_to_user(_data, &data, sizeof(data)))
+		return -EFAULT;
+
+	return 0;
+}
diff --git a/security/keys/internal.h b/security/keys/internal.h
index e98fca465146..9f2a6ce67d15 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -362,6 +362,8 @@ static inline long keyctl_watch_key(key_serial_t key_id, int watch_fd, int watch
 
 #ifdef CONFIG_CONTAINERS
 extern long keyctl_container_intercept(int, const char __user *, unsigned int, key_serial_t);
+extern long keyctl_query_request_key_auth(key_serial_t,
+					  struct keyctl_query_request_key_auth __user *);
 #endif
 
 /*
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 38ff33431f33..a19efc60944d 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1863,6 +1863,10 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 						  (const char __user *)arg3,
 						  (unsigned int)arg4,
 						  (key_serial_t)arg5);
+	case KEYCTL_QUERY_REQUEST_KEY_AUTH:
+		return keyctl_query_request_key_auth(
+			(key_serial_t)arg2,
+			(struct keyctl_query_request_key_auth __user *)arg3);
 #endif
 
 	default:


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 14/27] keys: Break bits out of key_unlink()
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (12 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 13/27] keys: Provide a keyctl to query a request_key authentication key David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 15/27] keys: Make __key_link_begin() handle lockdep nesting David Howells
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Break bits out of key_unlink() into helper functions so that they can be
used in implementing key_move().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/keyring.c |   89 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 66 insertions(+), 23 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 062cad635edf..431094c6cd74 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -1409,6 +1409,66 @@ int key_link(struct key *keyring, struct key *key)
 }
 EXPORT_SYMBOL(key_link);
 
+/*
+ * Begin the process of unlinking a key from a keyring.
+ */
+static int __key_unlink_begin(struct key *keyring, unsigned int lock_nesting,
+			      struct key *key, struct assoc_array_edit **_edit)
+	__acquires(&keyring->sem)
+{
+	struct assoc_array_edit *edit;
+	int ret;
+
+	if (keyring->type != &key_type_keyring)
+		return -ENOTDIR;
+
+	down_write_nested(&keyring->sem, lock_nesting);
+
+	edit = assoc_array_delete(&keyring->keys, &keyring_assoc_array_ops,
+				  &key->index_key);
+	if (IS_ERR(edit)) {
+		ret = PTR_ERR(edit);
+		goto error;
+	}
+
+	if (!edit) {
+		ret = -ENOENT;
+		goto error;
+	}
+
+	*_edit = edit;
+	return 0;
+
+error:
+	up_write(&keyring->sem);
+	return ret;
+}
+
+/*
+ * Apply an unlink change.
+ */
+static void __key_unlink(struct key *keyring, struct key *key,
+			      struct assoc_array_edit **_edit)
+{
+	assoc_array_apply_edit(*_edit);
+	*_edit = NULL;
+	notify_key(keyring, NOTIFY_KEY_UNLINKED, key_serial(key));
+	key_payload_reserve(keyring, keyring->datalen - KEYQUOTA_LINK_BYTES);
+}
+
+/*
+ * Finish unlinking a key from to a keyring.
+ */
+static void __key_unlink_end(struct key *keyring,
+			     struct key *key,
+			     struct assoc_array_edit *edit)
+	__releases(&keyring->sem)
+{
+	if (edit)
+		assoc_array_cancel_edit(edit);
+	up_write(&keyring->sem);
+}
+
 /**
  * key_unlink - Unlink the first link to a key from a keyring.
  * @keyring: The keyring to remove the link from.
@@ -1429,35 +1489,18 @@ EXPORT_SYMBOL(key_link);
 int key_unlink(struct key *keyring, struct key *key)
 {
 	struct assoc_array_edit *edit;
-	key_serial_t target = key_serial(key);
 	int ret;
 
 	key_check(keyring);
 	key_check(key);
 
-	if (keyring->type != &key_type_keyring)
-		return -ENOTDIR;
-
-	down_write(&keyring->sem);
-
-	edit = assoc_array_delete(&keyring->keys, &keyring_assoc_array_ops,
-				  &key->index_key);
-	if (IS_ERR(edit)) {
-		ret = PTR_ERR(edit);
-		goto error;
-	}
-	ret = -ENOENT;
-	if (edit == NULL)
-		goto error;
-
-	assoc_array_apply_edit(edit);
-	notify_key(keyring, NOTIFY_KEY_UNLINKED, target);
-	key_payload_reserve(keyring, keyring->datalen - KEYQUOTA_LINK_BYTES);
-	ret = 0;
+	ret = __key_unlink_begin(keyring, 0, key, &edit);
+	if (ret < 0)
+		return ret;
 
-error:
-	up_write(&keyring->sem);
-	return ret;
+	__key_unlink(keyring, key, &edit);
+	__key_unlink_end(keyring, key, edit);
+	return 0;
 }
 EXPORT_SYMBOL(key_unlink);
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 15/27] keys: Make __key_link_begin() handle lockdep nesting
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (13 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 14/27] keys: Break bits out of key_unlink() David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:09 ` [RFC PATCH 16/27] keys: Grant Link permission to possessers of request_key auth keys David Howells
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Make __key_link_begin() handle lockdep nesting for the implementation of
key_move() where we have to lock two keyrings.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/internal.h    |    2 +-
 security/keys/key.c         |    6 +++---
 security/keys/keyring.c     |    6 +++---
 security/keys/request_key.c |    2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/security/keys/internal.h b/security/keys/internal.h
index 9f2a6ce67d15..40846657aebd 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -95,7 +95,7 @@ extern struct key_type *key_type_lookup(const char *type);
 extern void key_type_put(struct key_type *ktype);
 extern int key_get_type_from_user(char *, const char __user *, unsigned);
 
-extern int __key_link_begin(struct key *keyring,
+extern int __key_link_begin(struct key *keyring, unsigned int lock_nesting,
 			    const struct keyring_index_key *index_key,
 			    struct assoc_array_edit **_edit);
 extern int __key_link_check_live_key(struct key *keyring, struct key *key);
diff --git a/security/keys/key.c b/security/keys/key.c
index 2c60d6bcf8a3..63513ffcf2e8 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -518,7 +518,7 @@ int key_instantiate_and_link(struct key *key,
 	}
 
 	if (keyring) {
-		ret = __key_link_begin(keyring, &key->index_key, &edit);
+		ret = __key_link_begin(keyring, 0, &key->index_key, &edit);
 		if (ret < 0)
 			goto error;
 
@@ -586,7 +586,7 @@ int key_reject_and_link(struct key *key,
 		if (keyring->restrict_link)
 			return -EPERM;
 
-		link_ret = __key_link_begin(keyring, &key->index_key, &edit);
+		link_ret = __key_link_begin(keyring, 0, &key->index_key, &edit);
 	}
 
 	mutex_lock(&key_construction_mutex);
@@ -866,7 +866,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 	index_key.desc_len = strlen(index_key.description);
 	key_set_index_key(&index_key);
 
-	ret = __key_link_begin(keyring, &index_key, &edit);
+	ret = __key_link_begin(keyring, 0, &index_key, &edit);
 	if (ret < 0) {
 		key_ref = ERR_PTR(ret);
 		goto error_free_prep;
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 431094c6cd74..1334ed97e530 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -1227,7 +1227,7 @@ static int keyring_detect_cycle(struct key *A, struct key *B)
 /*
  * Preallocate memory so that a key can be linked into to a keyring.
  */
-int __key_link_begin(struct key *keyring,
+int __key_link_begin(struct key *keyring, unsigned int lock_nesting,
 		     const struct keyring_index_key *index_key,
 		     struct assoc_array_edit **_edit)
 	__acquires(&keyring->sem)
@@ -1244,7 +1244,7 @@ int __key_link_begin(struct key *keyring,
 	if (keyring->type != &key_type_keyring)
 		return -ENOTDIR;
 
-	down_write(&keyring->sem);
+	down_write_nested(&keyring->sem, lock_nesting);
 
 	ret = -EKEYREVOKED;
 	if (test_bit(KEY_FLAG_REVOKED, &keyring->flags))
@@ -1393,7 +1393,7 @@ int key_link(struct key *keyring, struct key *key)
 	key_check(keyring);
 	key_check(key);
 
-	ret = __key_link_begin(keyring, &key->index_key, &edit);
+	ret = __key_link_begin(keyring, 0, &key->index_key, &edit);
 	if (ret == 0) {
 		kdebug("begun {%d,%d}", keyring->serial, refcount_read(&keyring->usage));
 		ret = __key_link_check_restriction(keyring, key);
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 078767564283..ab1f6de9e623 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -375,7 +375,7 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 	set_bit(KEY_FLAG_USER_CONSTRUCT, &key->flags);
 
 	if (dest_keyring) {
-		ret = __key_link_begin(dest_keyring, &ctx->index_key, &edit);
+		ret = __key_link_begin(dest_keyring, 0, &ctx->index_key, &edit);
 		if (ret < 0)
 			goto link_prealloc_failed;
 	}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 16/27] keys: Grant Link permission to possessers of request_key auth keys
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (14 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 15/27] keys: Make __key_link_begin() handle lockdep nesting David Howells
@ 2019-02-15 16:09 ` David Howells
  2019-02-15 16:10 ` [RFC PATCH 17/27] keys: Add a keyctl to move a key between keyrings David Howells
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:09 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Grant Link permission to the possessers of request_key authentication keys,
thereby allowing a daemon that is servicing upcalls to arrange things such
that only the necessary auth key is passed to the actual service program
and not all the daemon's pending auth keys.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/request_key_auth.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index cd75173cadad..726555a0639c 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -208,7 +208,7 @@ struct key *request_key_auth_new(struct key *target, const char *op,
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
 			    cred->fsuid, cred->fsgid, cred,
-			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH |
+			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH | KEY_POS_LINK |
 			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA, NULL);
 	if (IS_ERR(authkey)) {
 		ret = PTR_ERR(authkey);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 17/27] keys: Add a keyctl to move a key between keyrings
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (15 preceding siblings ...)
  2019-02-15 16:09 ` [RFC PATCH 16/27] keys: Grant Link permission to possessers of request_key auth keys David Howells
@ 2019-02-15 16:10 ` David Howells
  2019-02-15 16:10 ` [RFC PATCH 18/27] keys: Find the least-recently used unseen key in a keyring David Howells
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:10 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Add a keyctl to atomically move a link to a key from one keyring to
another.  The key must exist in "from" keyring and a flag can be given to
cause the operation to fail if there's a matching key already in the "to"
keyring.

This can be done with:

	keyctl(KEYCTL_MOVE,
	       key_serial_t key,
	       key_serial_t from_keyring,
	       key_serial_t to_keyring,
	       unsigned int flags);

The key being moved must grant Link permission and both keyrings must grant
Write permission.

flags should be 0 or KEYCTL_MOVE_EXCL, with the latter preventing
displacement of a matching key from the "to" keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/key.h         |    5 ++
 include/uapi/linux/keyctl.h |    3 +
 security/keys/compat.c      |    3 +
 security/keys/internal.h    |    1 
 security/keys/keyctl.c      |   55 +++++++++++++++++++++++++++
 security/keys/keyring.c     |   88 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 155 insertions(+)

diff --git a/include/linux/key.h b/include/linux/key.h
index 82eb1b8d6336..165f842ec042 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -335,6 +335,11 @@ extern int key_update(key_ref_t key,
 extern int key_link(struct key *keyring,
 		    struct key *key);
 
+extern int key_move(struct key *key,
+		    struct key *from_keyring,
+		    struct key *to_keyring,
+		    unsigned int flags);
+
 extern int key_unlink(struct key *keyring,
 		      struct key *key);
 
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index bb075ad1827d..425bbd9612c4 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -70,6 +70,7 @@
 #define KEYCTL_WATCH_KEY		30	/* Watch a key or ring of keys for changes */
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
 #define KEYCTL_QUERY_REQUEST_KEY_AUTH	32	/* Query a request_key_auth key */
+#define KEYCTL_MOVE			33	/* Move keys between keyrings */
 
 /* keyctl structures */
 struct keyctl_dh_params {
@@ -126,4 +127,6 @@ struct keyctl_query_request_key_auth {
 	__u64		spare[1];
 };
 
+#define KEYCTL_MOVE_EXCL	0x00000001 /* Do not displace from the to-keyring */
+
 #endif /*  _LINUX_KEYCTL_H */
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 30055fc2b629..ed36efa13c48 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -168,6 +168,9 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 		return keyctl_query_request_key_auth(arg2, compat_ptr(arg3));
 #endif
 
+	case KEYCTL_MOVE:
+		return keyctl_keyring_move(arg2, arg3, arg4, arg5);
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 40846657aebd..bad4a8038a99 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -242,6 +242,7 @@ extern long keyctl_update_key(key_serial_t, const void __user *, size_t);
 extern long keyctl_revoke_key(key_serial_t);
 extern long keyctl_keyring_clear(key_serial_t);
 extern long keyctl_keyring_link(key_serial_t, key_serial_t);
+extern long keyctl_keyring_move(key_serial_t, key_serial_t, key_serial_t, unsigned int);
 extern long keyctl_keyring_unlink(key_serial_t, key_serial_t);
 extern long keyctl_describe_key(key_serial_t, char __user *, size_t);
 extern long keyctl_keyring_search(key_serial_t, const char __user *,
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index a19efc60944d..6057b810c611 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -572,6 +572,55 @@ long keyctl_keyring_unlink(key_serial_t id, key_serial_t ringid)
 	return ret;
 }
 
+/*
+ * Move a link to a key from one keyring to another, displacing any matching
+ * key from the destination keyring.
+ *
+ * The key must grant the caller Link permission and both keyrings must grant
+ * the caller Write permission.  There must also be a link in the from keyring
+ * to the key.  If both keyrings are the same, nothing is done.
+ *
+ * If successful, 0 will be returned.
+ */
+long keyctl_keyring_move(key_serial_t id, key_serial_t from_ringid,
+			 key_serial_t to_ringid, unsigned int flags)
+{
+	key_ref_t key_ref, from_ref, to_ref;
+	long ret;
+
+	if (flags & ~KEYCTL_MOVE_EXCL)
+		return -EINVAL;
+
+	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE, KEY_NEED_LINK);
+	if (IS_ERR(key_ref)) {
+		ret = PTR_ERR(key_ref);
+		goto error;
+	}
+
+	from_ref = lookup_user_key(from_ringid, 0, KEY_NEED_WRITE);
+	if (IS_ERR(from_ref)) {
+		ret = PTR_ERR(from_ref);
+		goto error2;
+	}
+
+	to_ref = lookup_user_key(to_ringid, KEY_LOOKUP_CREATE, KEY_NEED_WRITE);
+	if (IS_ERR(to_ref)) {
+		ret = PTR_ERR(to_ref);
+		goto error3;
+	}
+
+	ret = key_move(key_ref_to_ptr(key_ref), key_ref_to_ptr(from_ref),
+		       key_ref_to_ptr(to_ref), flags);
+
+	key_ref_put(to_ref);
+error3:
+	key_ref_put(from_ref);
+error2:
+	key_ref_put(key_ref);
+error:
+	return ret;
+}
+
 /*
  * Return a description of a key to userspace.
  *
@@ -1869,6 +1918,12 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			(struct keyctl_query_request_key_auth __user *)arg3);
 #endif
 
+	case KEYCTL_MOVE:
+		return keyctl_keyring_move((key_serial_t)arg2,
+					   (key_serial_t)arg3,
+					   (key_serial_t)arg4,
+					   (unsigned int)arg5);
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 1334ed97e530..14df79814ea0 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -1504,6 +1504,94 @@ int key_unlink(struct key *keyring, struct key *key)
 }
 EXPORT_SYMBOL(key_unlink);
 
+/**
+ * key_move - Move a key from one keyring to another
+ * @key: The key to move
+ * @from_keyring: The keyring to remove the link from.
+ * @to_keyring: The keyring to make the link in.
+ * @flags: Qualifying flags, such as KEYCTL_MOVE_EXCL.
+ *
+ * Make a link in @to_keyring to a key, such that the keyring holds a reference
+ * on that key and the key can potentially be found by searching that keyring
+ * whilst simultaneously removing a link to the key from @from_keyring.
+ *
+ * This function will write-lock both keyring's semaphores and will consume
+ * some of the user's key data quota to hold the link on @to_keyring.
+ *
+ * Returns 0 if successful, -ENOTDIR if either keyring isn't a keyring,
+ * -EKEYREVOKED if either keyring has been revoked, -ENFILE if the second
+ * keyring is full, -EDQUOT if there is insufficient key data quota remaining
+ * to add another link or -ENOMEM if there's insufficient memory.  If
+ * KEYCTL_MOVE_EXCL is set, then -EEXIST will be returned if there's already a
+ * matching key in @to_keyring.
+ *
+ * It is assumed that the caller has checked that it is permitted for a link to
+ * be made (the keyring should have Write permission and the key Link
+ * permission).
+ */
+int key_move(struct key *key,
+	     struct key *from_keyring,
+	     struct key *to_keyring,
+	     unsigned int flags)
+{
+	struct assoc_array_edit *from_edit, *to_edit;
+	int ret;
+
+	kenter("%d,%d,%d", key->serial, from_keyring->serial, to_keyring->serial);
+
+	if (from_keyring == to_keyring)
+		return 0;
+
+	key_check(key);
+	key_check(from_keyring);
+	key_check(to_keyring);
+
+	/* We have to be very careful here to take the keyring locks in the
+	 * right order, lest we open ourselves to deadlocking against another
+	 * move operation.
+	 */
+	if (from_keyring < to_keyring) {
+		ret = __key_unlink_begin(from_keyring, 0, key, &from_edit);
+		if (ret < 0)
+			goto out;
+		ret = __key_link_begin(to_keyring, 1, &key->index_key, &to_edit);
+		if (ret < 0) {
+			assoc_array_cancel_edit(from_edit);
+			goto out;
+		}
+	} else {
+		ret = __key_link_begin(to_keyring, 0, &key->index_key, &to_edit);
+		if (ret < 0)
+			goto out;
+		ret = __key_unlink_begin(from_keyring, 1, key, &from_edit);
+		if (ret < 0) {
+			__key_link_end(to_keyring, &key->index_key, to_edit);
+			goto out;
+		}
+	}
+
+	ret = -EEXIST;
+	if (to_edit->dead_leaf && (flags & KEYCTL_MOVE_EXCL))
+		goto error;
+
+	ret = __key_link_check_restriction(to_keyring, key);
+	if (ret < 0)
+		goto error;
+	ret = __key_link_check_live_key(to_keyring, key);
+	if (ret < 0)
+		goto error;
+
+	__key_unlink(from_keyring, key, &from_edit);
+	__key_link(to_keyring, key, &to_edit);
+error:
+	__key_unlink_end(from_keyring, key, from_edit);
+	__key_link_end(to_keyring, &key->index_key, to_edit);
+out:
+	kleave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(key_move);
+
 /**
  * keyring_clear - Clear a keyring
  * @keyring: The keyring to clear.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 18/27] keys: Find the least-recently used unseen key in a keyring.
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (16 preceding siblings ...)
  2019-02-15 16:10 ` [RFC PATCH 17/27] keys: Add a keyctl to move a key between keyrings David Howells
@ 2019-02-15 16:10 ` David Howells
  2019-02-15 16:10 ` [RFC PATCH 19/27] containers: Sample: request_key upcall handling David Howells
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:10 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a keyctl by which the oldest "unseen" key in a keyring can be
found.  The "unseenness" is controlled by a flag on the key, so is shared
across all keyrings that might link to a key.  The flag is only set by this
keyctl.  The keyctl looks like:

	key = keyctl_find_lru(key_serial_t keyring,
			      const char *type_name)

It searches the nominated keyring subtree for a valid key of the specified
type and returns its serial number or -ENOKEY if no valid, unseen keys are
found.

This is primarily intended for use with ".request_key_auth"-type keys in
container upcall management.  Ordinarily, it should be possible to just
pick the serial numbers out of the notification records from when an auth
key gets added to the upcall keyring, but if the buffer gets overrun, then
some other means must be employed.

[!] I'm not sure I need to do the "unseen" check at all.  This call is only
    really needed if there's a notification buffer overrun.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/key.h         |    1 
 include/uapi/linux/keyctl.h |    1 
 security/keys/compat.c      |    2 +
 security/keys/container.c   |  106 +++++++++++++++++++++++++++++++++++++++++++
 security/keys/internal.h    |    1 
 security/keys/keyctl.c      |    3 +
 6 files changed, 114 insertions(+)

diff --git a/include/linux/key.h b/include/linux/key.h
index 165f842ec042..de190036512b 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -219,6 +219,7 @@ struct key {
 #define KEY_FLAG_KEEP		8	/* set if key should not be removed */
 #define KEY_FLAG_UID_KEYRING	9	/* set if key is a user or user session keyring */
 #define KEY_FLAG_SET_WATCH_PROXY 10	/* Set if watch_proxy should be set on added keys */
+#define KEY_FLAG_SEEN		11	/* Set if returned by keyctl_find_oldest_key() */
 
 	/* the key type and key description string
 	 * - the desc is used to match a key against search criteria
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 425bbd9612c4..5b792303a05b 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -71,6 +71,7 @@
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
 #define KEYCTL_QUERY_REQUEST_KEY_AUTH	32	/* Query a request_key_auth key */
 #define KEYCTL_MOVE			33	/* Move keys between keyrings */
+#define KEYCTL_FIND_LRU			34	/* Find the least-recently used key in a keyring */
 
 /* keyctl structures */
 struct keyctl_dh_params {
diff --git a/security/keys/compat.c b/security/keys/compat.c
index ed36efa13c48..160fb7b37352 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -166,6 +166,8 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 		return keyctl_container_intercept(arg2, compat_ptr(arg3), arg4, arg5);
 	case KEYCTL_QUERY_REQUEST_KEY_AUTH:
 		return keyctl_query_request_key_auth(arg2, compat_ptr(arg3));
+	case KEYCTL_FIND_LRU:
+		return keyctl_find_lru(arg2, compat_ptr(arg3));
 #endif
 
 	case KEYCTL_MOVE:
diff --git a/security/keys/container.c b/security/keys/container.c
index 115998e867cd..8e6b3c8710e2 100644
--- a/security/keys/container.c
+++ b/security/keys/container.c
@@ -267,3 +267,109 @@ long keyctl_query_request_key_auth(key_serial_t auth_id,
 
 	return 0;
 }
+
+struct key_lru_search_state {
+	struct key	*candidate;
+	time64_t	oldest;
+};
+
+/*
+ * Iterate over all the keys in the keyring looking for the one with the oldest
+ * timestamp.
+ */
+static bool cmp_lru(const struct key *key,
+			   const struct key_match_data *match_data)
+{
+	struct key_lru_search_state *state = (void *)match_data->raw_data;
+	time64_t t;
+
+	t = READ_ONCE(key->last_used_at);
+	if (state->oldest > t && !test_bit(KEY_FLAG_SEEN, &key->flags)) {
+		state->oldest = t;
+		state->candidate = (struct key *)key;
+	}
+
+	return false;
+}
+
+/*
+ * Find the oldest key in a keyring of a particular type.
+ */
+long keyctl_find_lru(key_serial_t _keyring, const char __user *type_name)
+{
+	struct key_lru_search_state state;
+	struct keyring_search_context ctx = {
+		.index_key.description	= NULL,
+		.cred			= current_cred(),
+		.match_data.cmp		= cmp_lru,
+		.match_data.raw_data	= &state,
+		.match_data.lookup_type	= KEYRING_SEARCH_LOOKUP_ITERATE,
+		.flags			= KEYRING_SEARCH_DO_STATE_CHECK,
+	};
+	struct key_type *ktype;
+	struct key *key;
+	key_ref_t keyring_ref, ref;
+	char type[32];
+	int ret, max_iter = 10;
+
+	if (!_keyring || !type_name)
+		return -EINVAL;
+
+	/* We want to allow special types, such as ".request_key_auth" */
+	ret = strncpy_from_user(type, type_name, sizeof(type));
+	if (ret < 0)
+		return ret;
+	if (ret == 0 || ret >= sizeof(type))
+		return -EINVAL;
+	type[ret] = '\0';
+
+	keyring_ref = lookup_user_key(_keyring, 0, KEY_NEED_SEARCH);
+	if (IS_ERR(keyring_ref))
+		return PTR_ERR(keyring_ref);
+
+	if (strcmp(type, key_type_request_key_auth.name) == 0) {
+		ktype = &key_type_request_key_auth;
+	} else {
+		ktype = key_type_lookup(type);
+		if (IS_ERR(ktype)) {
+			ret = PTR_ERR(ktype);
+			goto error_ring;
+		}
+	}
+
+	ctx.index_key.type = ktype;
+
+	do {
+		state.oldest = TIME64_MAX;
+		state.candidate = NULL;
+
+		rcu_read_lock();
+
+		/* Scan the keyring.  We expect this to end in -EAGAIN as we
+		 * can't generate a result until the entire scan is completed.
+		 */
+		ret = -EAGAIN;
+		ref = keyring_search_aux(keyring_ref, &ctx);
+
+		key = state.candidate;
+		if (key &&
+		    !test_and_set_bit(KEY_FLAG_SEEN, &key->flags) &&
+		    key_validate(key) == 0) {
+			ret = key->serial;
+			goto error_unlock;
+		}
+
+
+		rcu_read_unlock();
+	} while (--max_iter > 0);
+	goto error_type;
+
+error_unlock:
+	rcu_read_unlock();
+error_type:
+	if (ktype != &key_type_request_key_auth)
+		key_type_put(ktype);
+error_ring:
+	key_ref_put(keyring_ref);
+	return ret;
+}
diff --git a/security/keys/internal.h b/security/keys/internal.h
index bad4a8038a99..fe4a4da1ff17 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -365,6 +365,7 @@ static inline long keyctl_watch_key(key_serial_t key_id, int watch_fd, int watch
 extern long keyctl_container_intercept(int, const char __user *, unsigned int, key_serial_t);
 extern long keyctl_query_request_key_auth(key_serial_t,
 					  struct keyctl_query_request_key_auth __user *);
+extern long keyctl_find_lru(key_serial_t, const char __user *);
 #endif
 
 /*
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 6057b810c611..1446bc52e369 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1916,6 +1916,9 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		return keyctl_query_request_key_auth(
 			(key_serial_t)arg2,
 			(struct keyctl_query_request_key_auth __user *)arg3);
+	case KEYCTL_FIND_LRU:
+		return keyctl_find_lru((key_serial_t)arg2,
+				       (const char __user *)arg3);
 #endif
 
 	case KEYCTL_MOVE:


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 19/27] containers: Sample: request_key upcall handling
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (17 preceding siblings ...)
  2019-02-15 16:10 ` [RFC PATCH 18/27] keys: Find the least-recently used unseen key in a keyring David Howells
@ 2019-02-15 16:10 ` David Howells
  2019-02-15 16:10 ` [RFC PATCH 20/27] container, keys: Add a container keyring David Howells
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:10 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Implement a sample upcall handling.

Firstly, the test-container sample is modified to (a) create a staging
keyring and to (b) intercept request_key calls for user-type keys inside
the container and place the authentication keys into that rather than
invoking /sbin/request-key.

Secondly, a test-upcall sample is added that will monitor the keyring for
notifications and spawn /sbin/request-key instances for each of key added.
This is run as:

	./test-upcall

to find a keyring called "upcall" in the session keyring (as created by the
./test-container program) and listen for additions to that, or it can be
run as:

	./test-upcall <keyring-id>

to listen on a specific keyring.

Note that the test-upcall sample is designed to be run separately from
test-container so that its stdout can be observed.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/vfs/Makefile         |    6 +
 samples/vfs/test-container.c |   16 +++
 samples/vfs/test-upcall.c    |  243 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 264 insertions(+), 1 deletion(-)
 create mode 100644 samples/vfs/test-upcall.c

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 25420919ee40..a8e9e1142ae3 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -5,7 +5,8 @@ hostprogs-$(CONFIG_SAMPLE_VFS) := \
 	test-fsmount \
 	test-mntinfo \
 	test-statx \
-	test-container
+	test-container \
+	test-upcall
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -18,5 +19,8 @@ HOSTLDLIBS_test-mntinfo += -lm
 HOSTCFLAGS_test-fs-query.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+
 HOSTCFLAGS_test-container.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-container += -lkeyutils
+HOSTCFLAGS_test-upcall.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-upcall += -lkeyutils
diff --git a/samples/vfs/test-container.c b/samples/vfs/test-container.c
index 44ff57afb5a4..7dc9071399b2 100644
--- a/samples/vfs/test-container.c
+++ b/samples/vfs/test-container.c
@@ -20,6 +20,8 @@
 #include <sys/stat.h>
 #include <keyutils.h>
 
+#define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
+
 /* Hope -1 isn't a syscall */
 #ifndef __NR_fsopen
 #define __NR_fsopen -1
@@ -187,6 +189,7 @@ void container_init(void)
  */
 int main(int argc, char *argv[])
 {
+	key_serial_t keyring;
 	pid_t pid;
 	int fsfd, mfd, cfd, ws;
 
@@ -259,6 +262,19 @@ int main(int argc, char *argv[])
 	E(close(fsfd));
 	E(close(mfd));
 
+	/* Create a keyring to catch upcalls. */
+	printf("Intercepting...\n");
+	keyring = add_key("keyring", "upcall", NULL, 0, KEY_SPEC_SESSION_KEYRING);
+	if (keyring == -1) {
+		perror("add_key/u");
+		exit(1);
+	}
+
+	if (keyctl(KEYCTL_CONTAINER_INTERCEPT, cfd, "user", 0, keyring) < 0) {
+		perror("keyctl_container_intercept");
+		exit(1);
+	}
+
 	/* Start the 'init' process. */
 	printf("Forking...\n");
 	switch ((pid = fork_into_container(cfd))) {
diff --git a/samples/vfs/test-upcall.c b/samples/vfs/test-upcall.c
new file mode 100644
index 000000000000..225fa0325d1b
--- /dev/null
+++ b/samples/vfs/test-upcall.c
@@ -0,0 +1,243 @@
+/* Container keyring upcall management test.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <poll.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <keyutils.h>
+#include <linux/watch_queue.h>
+
+#define KEYCTL_WATCH_KEY		30	/* Watch a key or ring of keys for changes */
+#define KEYCTL_QUERY_REQUEST_KEY_AUTH	32	/* Query a request_key_auth key */
+#define KEYCTL_MOVE			33	/* Move keys between keyrings */
+#define KEYCTL_FIND_LRU			34	/* Find the least-recently used key in a keyring */
+
+struct keyctl_query_request_key_auth {
+	char		operation[32];	/* Operation name, typically "create" */
+	uid_t		fsuid;		/* UID of requester */
+	gid_t		fsgid;		/* GID of requester */
+	key_serial_t	target_key;	/* The key being instantiated */
+	key_serial_t	thread_keyring;	/* The requester's thread keyring */
+	key_serial_t	process_keyring; /* The requester's process keyring */
+	key_serial_t	session_keyring; /* The requester's session keyring */
+	long long	spare[1];
+};
+
+static void process_request(key_serial_t keyring, key_serial_t key)
+{
+	struct keyctl_query_request_key_auth info;
+	char target[32], uid[32], gid[32], thread[32], process[32], session[32];
+	void *callout;
+	long len;
+
+#if 0
+	key = keyctl(KEYCTL_FIND_LRU, keyring, ".request_key_auth");
+	if (key == -1) {
+		perror("keyctl/find");
+		exit(1);
+	}
+#endif
+
+	if (keyctl(KEYCTL_QUERY_REQUEST_KEY_AUTH, key, &info) == -1) {
+		perror("keyctl/query");
+		exit(1);
+	}
+
+	len = keyctl_read_alloc(key, &callout);
+	if (len == -1) {
+		perror("keyctl/read");
+		exit(1);
+	}
+
+	sprintf(target, "%d", info.target_key);
+	sprintf(uid, "%d", info.fsuid);
+	sprintf(gid, "%d", info.fsgid);
+	sprintf(thread, "%d", info.thread_keyring);
+	sprintf(process, "%d", info.process_keyring);
+	sprintf(session, "%d", info.session_keyring);
+
+	printf("Authentication key %d\n", key);
+	printf("- %s %s\n", info.operation, target);
+	printf("- uid=%s gid=%s\n", uid, gid);
+	printf("- rings=%s,%s,%s\n", thread, process, session);
+	printf("- callout='%s'\n", (char *)callout);
+
+	switch (fork()) {
+	case 0:
+		/* Only pass the auth token of interest onto /sbin/request-key */
+		if (keyctl(KEYCTL_MOVE, key, keyring, KEY_SPEC_THREAD_KEYRING) < 0) {
+			perror("keyctl_move/1");
+			exit(1);
+		}
+
+		if (keyctl_join_session_keyring(NULL) < 0) {
+			perror("keyctl_join");
+			exit(1);
+		}
+
+		if (keyctl(KEYCTL_MOVE, key,
+			   KEY_SPEC_THREAD_KEYRING, KEY_SPEC_SESSION_KEYRING) < 0) {
+			perror("keyctl_move/2");
+			exit(1);
+		}
+
+		execl("/sbin/request-key",
+		      "request-key", info.operation, target, uid, gid, thread, process, session,
+		      NULL);
+		perror("execve");
+		exit(1);
+
+	case -1:
+		perror("fork");
+		exit(1);
+
+	default:
+		return;
+	}
+}
+
+/*
+ * We saw a change on the keyring.
+ */
+static void saw_key_change(struct watch_notification *n)
+{
+	struct key_notification *k = (struct key_notification *)n;
+	unsigned int len = n->info & WATCH_INFO_LENGTH;
+
+	if (len != sizeof(struct key_notification))
+		return;
+
+	printf("KEY %d change=%u aux=%d\n", k->key_id, n->subtype, k->aux);
+
+	process_request(k->key_id, k->aux);
+}
+
+/*
+ * Consume and display events.
+ */
+static int consumer(int fd, struct watch_queue_buffer *buf)
+{
+	struct watch_notification *n;
+	struct pollfd p[1];
+	unsigned int head, tail, mask = buf->meta.mask;
+
+	for (;;) {
+		p[0].fd = fd;
+		p[0].events = POLLIN | POLLERR;
+		p[0].revents = 0;
+
+		if (poll(p, 1, -1) == -1) {
+			perror("poll");
+			break;
+		}
+
+		printf("ptrs h=%x t=%x m=%x\n",
+		       buf->meta.head, buf->meta.tail, buf->meta.mask);
+
+		while (head = __atomic_load_n(&buf->meta.head, __ATOMIC_ACQUIRE),
+		       tail = buf->meta.tail,
+		       tail != head
+		       ) {
+			n = &buf->slots[tail & mask];
+			printf("NOTIFY[%08x-%08x] ty=%04x sy=%04x i=%08x\n",
+			       head, tail, n->type, n->subtype, n->info);
+			if ((n->info & WATCH_INFO_LENGTH) == 0)
+				goto out;
+
+			switch (n->type) {
+			case WATCH_TYPE_META:
+				if (n->subtype == WATCH_META_REMOVAL_NOTIFICATION)
+					printf("REMOVAL of watchpoint %08x\n",
+					       n->info & WATCH_INFO_ID);
+				break;
+			case WATCH_TYPE_KEY_NOTIFY:
+				saw_key_change(n);
+				break;
+			}
+
+			tail += (n->info & WATCH_INFO_LENGTH) >> WATCH_LENGTH_SHIFT;
+			__atomic_store_n(&buf->meta.tail, tail, __ATOMIC_RELEASE);
+		}
+	}
+
+out:
+	return 0;
+}
+
+/*
+ * We're only interested in key insertion events.
+ */
+static struct watch_notification_filter filter = {
+	.nr_filters	= 1,
+	.filters = {
+		[0] = {
+			.type			= WATCH_TYPE_KEY_NOTIFY,
+			.subtype_filter[0]	= (1 << NOTIFY_KEY_LINKED),
+		},
+	}
+};
+
+int main(int argc, char *argv[])
+{
+	struct watch_queue_buffer *buf;
+	key_serial_t keyring;
+	size_t page_size = sysconf(_SC_PAGESIZE);
+	int fd;
+
+	if (argc == 1) {
+		keyring = keyctl_search(KEY_SPEC_SESSION_KEYRING, "keyring",
+					"upcall", 0);
+		if (keyring == -1) {
+			perror("keyctl_search");
+			exit(1);
+		}
+	} else if (argc == 2) {
+		keyring = strtoul(argv[1], NULL, 0);
+	} else {
+		fprintf(stderr, "Format: test-upcall [<keyring>]\n");
+		exit(2);
+	}
+
+	/* Create a watch on the keyring to detect the addition of keys. */
+	fd = open("/dev/watch_queue", O_RDWR | O_CLOEXEC);
+	if (fd == -1) {
+		perror("/dev/watch_queue");
+		exit(1);
+	}
+
+	if (ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, 1) == -1) {
+		perror("/dev/watch_queue(size)");
+		exit(1);
+	}
+
+	if (ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter) == -1) {
+		perror("/dev/watch_queue(filter)");
+		exit(1);
+	}
+
+	buf = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (buf == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	if (keyctl(KEYCTL_WATCH_KEY, keyring, fd, 0x01) == -1) {
+		perror("keyctl");
+		exit(1);
+	}
+
+	return consumer(fd, buf);
+}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 20/27] container, keys: Add a container keyring
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (18 preceding siblings ...)
  2019-02-15 16:10 ` [RFC PATCH 19/27] containers: Sample: request_key upcall handling David Howells
@ 2019-02-15 16:10 ` David Howells
  2019-02-15 21:46   ` Eric Biggers
  2019-02-15 16:11 ` [RFC PATCH 21/27] keys: Fix request_key() lack of Link perm check on found key David Howells
                   ` (9 subsequent siblings)
  29 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-15 16:10 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Allow a container manager to attach keyrings to a container such that the
keys contained therein are searched by request_key() in addition to a
process's normal keyrings.  This allows the manager to install keys to
support filesystem decryption and authentication for superblocks inside the
container without requiring any active role being played by processes
inside of the container.

So, for example, a container could be created, a keyring added and then an
rxrpc-type key added to the keyring such that a container's root filesystem
and data filesystems can be brought in from secure AFS volumes.  It would
also be possible to put filesystem crypto keys in there such that Ext4
encrypted files could be decrypted - without the need to share the key
between other containers or let the key leak into the container.

Because the container manager retains control of the keyring, it can update
the contained keys as necessary to prevent expiration.  Note that the
keyring and keys in the keyring must grant Search permission directly to
the container object.

[!] Note that NFS, CIFS and other filesystems wishing to make use of this
    would have to get the token to use by calling request_key() on entry to
    its VFS methods and retain it in its file struct.

[!] Note that request_key() called from userspace does not look in the
    container keyring.

[!] Note that keys are now tagged with a tag that identifies the network
    namespace (or other domain of operation).  This allows keys to be
    provided in one keyring that allow the same thing but in different
    network namespaces.

The keyring should be created by the container manager and then set using:

	keyctl(KEYCTL_SET_CONTAINER_KEYRING, int containerfd,
	       key_serial_t keyring);

With this, request_key() inside the kernel searches:

	thread-keyring, process-keyring, session-keyring, container-keyring

[!] It may be worth setting a flag on a mountpoint to indicate whether to
    search the container keyring first or last.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/container.h    |    1 +
 include/uapi/linux/keyctl.h  |    1 +
 kernel/container.c           |    1 +
 samples/vfs/test-container.c |   14 +++++++++++++
 security/keys/compat.c       |    2 ++
 security/keys/container.c    |   44 ++++++++++++++++++++++++++++++++++++++++++
 security/keys/internal.h     |    1 +
 security/keys/keyctl.c       |    2 ++
 security/keys/process_keys.c |   23 ++++++++++++++++++++++
 9 files changed, 89 insertions(+)

diff --git a/include/linux/container.h b/include/linux/container.h
index a8cac800ce75..7424f7fb5560 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -33,6 +33,7 @@ struct container {
 	refcount_t		usage;
 	int			exit_code;	/* The exit code of 'init' */
 	const struct cred	*cred;		/* Creds for this container, including userns */
+	struct key		*keyring;	/* Externally managed container keyring */
 	struct nsproxy		*ns;		/* This container's namespaces */
 	struct path		root;		/* The root of the container's fs namespace */
 	struct task_struct	*init;		/* The 'init' task for this container */
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 5b792303a05b..a2afb4512f34 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -72,6 +72,7 @@
 #define KEYCTL_QUERY_REQUEST_KEY_AUTH	32	/* Query a request_key_auth key */
 #define KEYCTL_MOVE			33	/* Move keys between keyrings */
 #define KEYCTL_FIND_LRU			34	/* Find the least-recently used key in a keyring */
+#define KEYCTL_SET_CONTAINER_KEYRING	35	/* Attach a keyring to a container */
 
 /* keyctl structures */
 struct keyctl_dh_params {
diff --git a/kernel/container.c b/kernel/container.c
index 33e41fe5050b..f2706a45f364 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -71,6 +71,7 @@ void put_container(struct container *c)
 
 		if (c->cred)
 			put_cred(c->cred);
+		key_put(c->keyring);
 		security_container_free(c);
 		kfree(c);
 		c = parent;
diff --git a/samples/vfs/test-container.c b/samples/vfs/test-container.c
index 7dc9071399b2..e24048fdbe33 100644
--- a/samples/vfs/test-container.c
+++ b/samples/vfs/test-container.c
@@ -21,6 +21,7 @@
 #include <keyutils.h>
 
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
+#define KEYCTL_SET_CONTAINER_KEYRING	35	/* Attach a keyring to a container */
 
 /* Hope -1 isn't a syscall */
 #ifndef __NR_fsopen
@@ -262,6 +263,19 @@ int main(int argc, char *argv[])
 	E(close(fsfd));
 	E(close(mfd));
 
+	/* Create a container keyring. */
+	printf("Container keyring...\n");
+	keyring = add_key("keyring", "_container", NULL, 0, KEY_SPEC_SESSION_KEYRING);
+	if (keyring == -1) {
+		perror("add_key/c");
+		exit(1);
+	}
+
+	if (keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) < 0) {
+		perror("keyctl_set_container_keyring");
+		exit(1);
+	}
+
 	/* Create a keyring to catch upcalls. */
 	printf("Intercepting...\n");
 	keyring = add_key("keyring", "upcall", NULL, 0, KEY_SPEC_SESSION_KEYRING);
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 160fb7b37352..7990ec026237 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -168,6 +168,8 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 		return keyctl_query_request_key_auth(arg2, compat_ptr(arg3));
 	case KEYCTL_FIND_LRU:
 		return keyctl_find_lru(arg2, compat_ptr(arg3));
+	case KEYCTL_SET_CONTAINER_KEYRING:
+		return keyctl_set_container_keyring(arg2, arg3);
 #endif
 
 	case KEYCTL_MOVE:
diff --git a/security/keys/container.c b/security/keys/container.c
index 8e6b3c8710e2..720600f6a318 100644
--- a/security/keys/container.c
+++ b/security/keys/container.c
@@ -373,3 +373,47 @@ long keyctl_find_lru(key_serial_t _keyring, const char __user *type_name)
 	key_ref_put(keyring_ref);
 	return ret;
 }
+
+/*
+ * Attach a keyring to a container as the container key, to be searched by
+ * request_key() after thread, process and session keyrings.  This is only
+ * permitted once per container.
+ */
+long keyctl_set_container_keyring(int containerfd, key_serial_t _keyring)
+{
+	struct container *c;
+	struct fd f;
+	key_ref_t keyring_ref = NULL;
+	long ret;
+
+	if (containerfd < 0 || _keyring <= 0)
+		return -EINVAL;
+
+	f = fdget(containerfd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (!is_container_file(f.file))
+		goto out_fd;
+
+	c = f.file->private_data;
+
+	keyring_ref = lookup_user_key(_keyring, 0, KEY_NEED_SEARCH);
+	if (IS_ERR(keyring_ref)) {
+		ret = PTR_ERR(keyring_ref);
+		goto out_fd;
+	}
+
+	ret = -EBUSY;
+	spin_lock(&c->lock);
+	if (!c->keyring) {
+		c->keyring = key_get(key_ref_to_ptr(keyring_ref));
+		ret = 0;
+	}
+	spin_unlock(&c->lock);
+
+	key_ref_put(keyring_ref);
+out_fd:
+	fdput(f);
+	return ret;
+}
diff --git a/security/keys/internal.h b/security/keys/internal.h
index fe4a4da1ff17..6be76caee874 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -366,6 +366,7 @@ extern long keyctl_container_intercept(int, const char __user *, unsigned int, k
 extern long keyctl_query_request_key_auth(key_serial_t,
 					  struct keyctl_query_request_key_auth __user *);
 extern long keyctl_find_lru(key_serial_t, const char __user *);
+extern long keyctl_set_container_keyring(int, key_serial_t);
 #endif
 
 /*
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 1446bc52e369..a25799249b8a 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1919,6 +1919,8 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case KEYCTL_FIND_LRU:
 		return keyctl_find_lru((key_serial_t)arg2,
 				       (const char __user *)arg3);
+	case KEYCTL_SET_CONTAINER_KEYRING:
+		return keyctl_set_container_keyring((int)arg2, (key_serial_t)arg3);
 #endif
 
 	case KEYCTL_MOVE:
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 0e0b9ccad2f8..39d3cbac920c 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -17,6 +17,7 @@
 #include <linux/err.h>
 #include <linux/mutex.h>
 #include <linux/security.h>
+#include <linux/container.h>
 #include <linux/user_namespace.h>
 #include <linux/uaccess.h>
 #include <keys/request_key_auth-type.h>
@@ -433,6 +434,28 @@ key_ref_t search_my_process_keyrings(struct keyring_search_context *ctx)
 		}
 	}
 
+	/* Search any container keyring on the end. */
+#ifdef CONFIG_CONTAINERS
+	if (current->container->keyring) {
+		key_ref = keyring_search_aux(
+			make_key_ref(current->container->keyring, 1), ctx);
+		if (!IS_ERR(key_ref))
+			goto found;
+
+		switch (PTR_ERR(key_ref)) {
+		case -EAGAIN: /* no key */
+			if (ret)
+				break;
+		case -ENOKEY: /* negative key */
+			ret = key_ref;
+			break;
+		default:
+			err = key_ref;
+			break;
+		}
+	}
+#endif
+
 	/* no key - decide on the error we're going to go for */
 	key_ref = ret ? ret : err;
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 21/27] keys: Fix request_key() lack of Link perm check on found key
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (19 preceding siblings ...)
  2019-02-15 16:10 ` [RFC PATCH 20/27] container, keys: Add a container keyring David Howells
@ 2019-02-15 16:11 ` David Howells
  2019-02-15 16:11 ` [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL David Howells
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:11 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

The request_key() syscall allows a process to gain access to the 'possessor'
permits of any key that grants it Search permission by virtue of request_key()
not checking whether a key it finds grants Link permission to the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/request_key.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index ab1f6de9e623..10244b6fbf5d 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -564,6 +564,16 @@ struct key *request_key_and_link(struct key_type *type,
 	key_ref = search_process_keyrings(&ctx);
 
 	if (!IS_ERR(key_ref)) {
+		if (dest_keyring) {
+			ret = key_task_permission(key_ref, current_cred(),
+						  KEY_NEED_LINK);
+			if (ret < 0) {
+				key_ref_put(key_ref);
+				key = ERR_PTR(ret);
+				goto error_free;
+			}
+		}
+
 		key = key_ref_to_ptr(key_ref);
 		if (dest_keyring) {
 			ret = key_link(dest_keyring, key);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (20 preceding siblings ...)
  2019-02-15 16:11 ` [RFC PATCH 21/27] keys: Fix request_key() lack of Link perm check on found key David Howells
@ 2019-02-15 16:11 ` David Howells
  2019-02-15 17:32   ` Stephen Smalley
  2019-02-15 17:39   ` David Howells
  2019-02-15 16:11 ` [RFC PATCH 23/27] KEYS: Provide KEYCTL_GRANT_PERMISSION David Howells
                   ` (7 subsequent siblings)
  29 siblings, 2 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:11 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split.  This will also allow a
greater range of subjects to represented.

============
WHY DO THIS?
============

The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.

For SETATTR, this includes actions that are about controlling access to a
key:

 (1) Changing a key's ownership.

 (2) Changing a key's security information.

 (3) Setting a keyring's restriction.

And actions that are about managing a key's lifetime:

 (4) Setting an expiry time.

 (5) Revoking a key.

and (proposed) managing a key as part of a cache:

 (6) Invalidating a key.

Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.

Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission.  It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.

As for SEARCH permission, that currently covers:

 (1) Finding keys in a keyring tree during a search.

 (2) Permitting keyrings to be joined.

 (3) Invalidation.

But these don't really belong together either, since these actions really
need to be controlled separately.

Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.


===============
WHAT IS CHANGED
===============

The SETATTR permission is split to create two new permissions:

 (1) SET_SECURITY - which allows the key's owner, group and ACL to be
     changed and a restriction to be placed on a keyring.

 (2) REVOKE - which allows a key to be revoked.

The SEARCH permission is split to create:

 (1) SEARCH - which allows a keyring to be search and a key to be found.

 (2) JOIN - which allows a keyring to be joined as a session keyring.

 (3) INVAL - which allows a key to be invalidated.

The WRITE permission is also split to create:

 (1) WRITE - which allows a key's content to be altered and links to be
     added, removed and replaced in a keyring.

 (2) CLEAR - which allows a keyring to be cleared completely.  This is
     split out to make it possible to give just this to an administrator.

 (3) REVOKE - see above.


Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together.  An ACE specifies a subject, such as:

 (*) Possessor - permitted to anyone who 'possesses' a key
 (*) Owner - permitted to the key owner
 (*) Group - permitted to the key group
 (*) Everyone - permitted to everyone

Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.

Further subjects may be made available by later patches.

The ACE also specifies a permissions mask.  The set of permissions is now:

	VIEW		Can view the key metadata
	READ		Can read the key content
	WRITE		Can update/modify the key content
	SEARCH		Can find the key by searching/requesting
	LINK		Can make a link to the key
	SET_SECURITY	Can change owner, ACL, expiry
	INVAL		Can invalidate
	REVOKE		Can revoke
	JOIN		Can join this keyring
	CLEAR		Can clear this keyring


The KEYCTL_SETPERM function is then deprecated.

The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.

The KEYCTL_INVALIDATE function then requires INVAL.

The KEYCTL_REVOKE function then requires REVOKE.

The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.

The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.


======================
BACKWARD COMPATIBILITY
======================

To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.

It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.

SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY.  WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR.  JOIN is turned
on if a keyring is being altered.

The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.

It will make the following mappings:

 (1) INVAL, JOIN -> SEARCH

 (2) SET_SECURITY -> SETATTR

 (3) REVOKE -> WRITE if SETATTR isn't already set

 (4) CLEAR -> WRITE

Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.


=======
TESTING
=======

This passes the keyutils testsuite for all but a couple of tests:

 (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
     returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
     if the type doesn't have ->read().  You still can't actually read the
     key.

 (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
     work as Other has been replaced with Everyone in the ACL.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 certs/blacklist.c                                  |    7 -
 certs/system_keyring.c                             |   12 -
 drivers/md/dm-crypt.c                              |    2 
 drivers/nvdimm/security.c                          |    2 
 fs/afs/security.c                                  |    2 
 fs/cifs/cifs_spnego.c                              |   25 ++
 fs/cifs/cifsacl.c                                  |   28 ++
 fs/cifs/connect.c                                  |    4 
 fs/crypto/keyinfo.c                                |    2 
 fs/ecryptfs/ecryptfs_kernel.h                      |    2 
 fs/ecryptfs/keystore.c                             |    2 
 fs/fscache/object-list.c                           |    2 
 fs/nfs/nfs4idmap.c                                 |   29 ++
 fs/ubifs/auth.c                                    |    2 
 include/linux/key.h                                |  113 +++++----
 include/uapi/linux/keyctl.h                        |   63 +++++
 lib/digsig.c                                       |    2 
 net/ceph/ceph_common.c                             |    2 
 net/dns_resolver/dns_key.c                         |   12 +
 net/dns_resolver/dns_query.c                       |   15 +
 net/rxrpc/key.c                                    |   16 +
 security/integrity/digsig.c                        |   31 +--
 security/integrity/digsig_asymmetric.c             |    2 
 security/integrity/evm/evm_crypto.c                |    2 
 security/integrity/ima/ima_mok.c                   |   13 +
 security/integrity/integrity.h                     |    4 
 .../integrity/platform_certs/platform_keyring.c    |   13 +
 security/keys/encrypted-keys/encrypted.c           |    2 
 security/keys/encrypted-keys/masterkey_trusted.c   |    2 
 security/keys/gc.c                                 |    2 
 security/keys/internal.h                           |   12 +
 security/keys/key.c                                |   29 +-
 security/keys/keyctl.c                             |   93 +++++---
 security/keys/keyring.c                            |   27 ++
 security/keys/permission.c                         |  238 +++++++++++++++++---
 security/keys/persistent.c                         |   27 ++
 security/keys/proc.c                               |   17 +
 security/keys/process_keys.c                       |   72 +++++-
 security/keys/request_key.c                        |   40 ++-
 security/keys/request_key_auth.c                   |   15 +
 security/selinux/hooks.c                           |   16 +
 security/smack/smack_lsm.c                         |    3 
 42 files changed, 726 insertions(+), 278 deletions(-)

diff --git a/certs/blacklist.c b/certs/blacklist.c
index 3a507b9e2568..7677c3b0a147 100644
--- a/certs/blacklist.c
+++ b/certs/blacklist.c
@@ -93,8 +93,7 @@ int mark_hash_blacklisted(const char *hash)
 				   hash,
 				   NULL,
 				   0,
-				   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				    KEY_USR_VIEW),
+				   &internal_key_acl,
 				   KEY_ALLOC_NOT_IN_QUOTA |
 				   KEY_ALLOC_BUILT_IN);
 	if (IS_ERR(key)) {
@@ -153,9 +152,7 @@ static int __init blacklist_init(void)
 		keyring_alloc(".blacklist",
 			      KUIDT_INIT(0), KGIDT_INIT(0),
 			      current_cred(),
-			      (KEY_POS_ALL & ~KEY_POS_SETATTR) |
-			      KEY_USR_VIEW | KEY_USR_READ |
-			      KEY_USR_SEARCH,
+			      &internal_keyring_acl,
 			      KEY_ALLOC_NOT_IN_QUOTA |
 			      KEY_FLAG_KEEP,
 			      NULL, NULL);
diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 81728717523d..7b775d6028e1 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -100,9 +100,7 @@ static __init int system_trusted_keyring_init(void)
 	builtin_trusted_keys =
 		keyring_alloc(".builtin_trusted_keys",
 			      KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
-			      ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-			      KEY_USR_VIEW | KEY_USR_READ | KEY_USR_SEARCH),
-			      KEY_ALLOC_NOT_IN_QUOTA,
+			      &internal_key_acl, KEY_ALLOC_NOT_IN_QUOTA,
 			      NULL, NULL);
 	if (IS_ERR(builtin_trusted_keys))
 		panic("Can't allocate builtin trusted keyring\n");
@@ -111,10 +109,7 @@ static __init int system_trusted_keyring_init(void)
 	secondary_trusted_keys =
 		keyring_alloc(".secondary_trusted_keys",
 			      KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
-			      ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-			       KEY_USR_VIEW | KEY_USR_READ | KEY_USR_SEARCH |
-			       KEY_USR_WRITE),
-			      KEY_ALLOC_NOT_IN_QUOTA,
+			      &internal_writable_keyring_acl, KEY_ALLOC_NOT_IN_QUOTA,
 			      get_builtin_and_secondary_restriction(),
 			      NULL);
 	if (IS_ERR(secondary_trusted_keys))
@@ -164,8 +159,7 @@ static __init int load_system_certificate_list(void)
 					   NULL,
 					   p,
 					   plen,
-					   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-					   KEY_USR_VIEW | KEY_USR_READ),
+					   &internal_key_acl,
 					   KEY_ALLOC_NOT_IN_QUOTA |
 					   KEY_ALLOC_BUILT_IN |
 					   KEY_ALLOC_BYPASS_RESTRICTION);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 0ff22159a0ca..7f37616cd21a 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2034,7 +2034,7 @@ static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string
 		return -ENOMEM;
 
 	key = request_key(key_string[0] == 'l' ? &key_type_logon : &key_type_user,
-			  key_desc + 1, NULL);
+			  key_desc + 1, NULL, NULL);
 	if (IS_ERR(key)) {
 		kzfree(new_key_string);
 		return PTR_ERR(key);
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index f8bb746a549f..db5cfd934ec8 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -53,7 +53,7 @@ static struct key *nvdimm_request_key(struct nvdimm *nvdimm)
 	struct device *dev = &nvdimm->dev;
 
 	sprintf(desc, "%s%s", NVDIMM_PREFIX, nvdimm->dimm_id);
-	key = request_key(&key_type_encrypted, desc, "");
+	key = request_key(&key_type_encrypted, desc, "", NULL);
 	if (IS_ERR(key)) {
 		if (PTR_ERR(key) == -ENOKEY)
 			dev_dbg(dev, "request_key() found no key\n");
diff --git a/fs/afs/security.c b/fs/afs/security.c
index 5f58a9a17e69..184274ce41e1 100644
--- a/fs/afs/security.c
+++ b/fs/afs/security.c
@@ -32,7 +32,7 @@ struct key *afs_request_key(struct afs_cell *cell)
 
 	_debug("key %s", cell->anonymous_key->description);
 	key = request_key(&key_type_rxrpc, cell->anonymous_key->description,
-			  NULL);
+			  NULL, NULL);
 	if (IS_ERR(key)) {
 		if (PTR_ERR(key) != -ENOKEY) {
 			_leave(" = %ld", PTR_ERR(key));
diff --git a/fs/cifs/cifs_spnego.c b/fs/cifs/cifs_spnego.c
index 7f01c6e60791..d1b439ad0f1a 100644
--- a/fs/cifs/cifs_spnego.c
+++ b/fs/cifs/cifs_spnego.c
@@ -32,6 +32,25 @@
 #include "cifsproto.h"
 static const struct cred *spnego_cred;
 
+static struct key_acl cifs_spnego_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
+static struct key_acl cifs_spnego_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_CLEAR),
+	}
+};
+
 /* create a new cifs key */
 static int
 cifs_spnego_key_instantiate(struct key *key, struct key_preparsed_payload *prep)
@@ -170,7 +189,8 @@ cifs_get_spnego_key(struct cifs_ses *sesInfo)
 
 	cifs_dbg(FYI, "key description = %s\n", description);
 	saved_cred = override_creds(spnego_cred);
-	spnego_key = request_key(&cifs_spnego_key_type, description, "");
+	spnego_key = request_key(&cifs_spnego_key_type, description, "",
+				 &cifs_spnego_key_acl);
 	revert_creds(saved_cred);
 
 #ifdef CONFIG_CIFS_DEBUG2
@@ -207,8 +227,7 @@ init_cifs_spnego(void)
 
 	keyring = keyring_alloc(".cifs_spnego",
 				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
-				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				KEY_USR_VIEW | KEY_USR_READ,
+				&cifs_spnego_keyring_acl,
 				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
index 1d377b7f2860..78eed72f3af0 100644
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -33,6 +33,25 @@
 #include "cifsproto.h"
 #include "cifs_debug.h"
 
+static struct key_acl cifs_idmap_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
+static struct key_acl cifs_idmap_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
+	}
+};
+
 /* security id for everyone/world system group */
 static const struct cifs_sid sid_everyone = {
 	1, 1, {0, 0, 0, 0, 0, 1}, {0} };
@@ -298,7 +317,8 @@ id_to_sid(unsigned int cid, uint sidtype, struct cifs_sid *ssid)
 
 	rc = 0;
 	saved_cred = override_creds(root_cred);
-	sidkey = request_key(&cifs_idmap_key_type, desc, "");
+	sidkey = request_key(&cifs_idmap_key_type, desc, "",
+			     &cifs_idmap_key_acl);
 	if (IS_ERR(sidkey)) {
 		rc = -EINVAL;
 		cifs_dbg(FYI, "%s: Can't map %cid %u to a SID\n",
@@ -403,7 +423,8 @@ sid_to_id(struct cifs_sb_info *cifs_sb, struct cifs_sid *psid,
 		return -ENOMEM;
 
 	saved_cred = override_creds(root_cred);
-	sidkey = request_key(&cifs_idmap_key_type, sidstr, "");
+	sidkey = request_key(&cifs_idmap_key_type, sidstr, "",
+			     &cifs_idmap_key_acl);
 	if (IS_ERR(sidkey)) {
 		rc = -EINVAL;
 		cifs_dbg(FYI, "%s: Can't map SID %s to a %cid\n",
@@ -481,8 +502,7 @@ init_cifs_idmap(void)
 
 	keyring = keyring_alloc(".cifs_idmap",
 				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
-				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				KEY_USR_VIEW | KEY_USR_READ,
+				&cifs_idmap_keyring_acl,
 				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 683310f26171..3b946fcf025c 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2903,7 +2903,7 @@ cifs_set_cifscreds(struct smb_vol *vol, struct cifs_ses *ses)
 	}
 
 	cifs_dbg(FYI, "%s: desc=%s\n", __func__, desc);
-	key = request_key(&key_type_logon, desc, "");
+	key = request_key(&key_type_logon, desc, "", NULL);
 	if (IS_ERR(key)) {
 		if (!ses->domainName) {
 			cifs_dbg(FYI, "domainName is NULL\n");
@@ -2914,7 +2914,7 @@ cifs_set_cifscreds(struct smb_vol *vol, struct cifs_ses *ses)
 		/* didn't work, try to find a domain key */
 		sprintf(desc, "cifs:d:%s", ses->domainName);
 		cifs_dbg(FYI, "%s: desc=%s\n", __func__, desc);
-		key = request_key(&key_type_logon, desc, "");
+		key = request_key(&key_type_logon, desc, "", NULL);
 		if (IS_ERR(key)) {
 			rc = PTR_ERR(key);
 			goto out_err;
diff --git a/fs/crypto/keyinfo.c b/fs/crypto/keyinfo.c
index 1e11a683f63d..201e8715302b 100644
--- a/fs/crypto/keyinfo.c
+++ b/fs/crypto/keyinfo.c
@@ -92,7 +92,7 @@ find_and_lock_process_key(const char *prefix,
 	if (!description)
 		return ERR_PTR(-ENOMEM);
 
-	key = request_key(&key_type_logon, description, NULL);
+	key = request_key(&key_type_logon, description, NULL, NULL);
 	kfree(description);
 	if (IS_ERR(key))
 		return key;
diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h
index e74cb2a0b299..6460bd2a4e9d 100644
--- a/fs/ecryptfs/ecryptfs_kernel.h
+++ b/fs/ecryptfs/ecryptfs_kernel.h
@@ -105,7 +105,7 @@ ecryptfs_get_encrypted_key_payload_data(struct key *key)
 
 static inline struct key *ecryptfs_get_encrypted_key(char *sig)
 {
-	return request_key(&key_type_encrypted, sig, NULL);
+	return request_key(&key_type_encrypted, sig, NULL, NULL);
 }
 
 #else
diff --git a/fs/ecryptfs/keystore.c b/fs/ecryptfs/keystore.c
index e74fe84d0886..38f4e30ed730 100644
--- a/fs/ecryptfs/keystore.c
+++ b/fs/ecryptfs/keystore.c
@@ -1625,7 +1625,7 @@ int ecryptfs_keyring_auth_tok_for_sig(struct key **auth_tok_key,
 {
 	int rc = 0;
 
-	(*auth_tok_key) = request_key(&key_type_user, sig, NULL);
+	(*auth_tok_key) = request_key(&key_type_user, sig, NULL, NULL);
 	if (!(*auth_tok_key) || IS_ERR(*auth_tok_key)) {
 		(*auth_tok_key) = ecryptfs_get_encrypted_key(sig);
 		if (!(*auth_tok_key) || IS_ERR(*auth_tok_key)) {
diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
index 43e6e28c164f..6a672289e5ec 100644
--- a/fs/fscache/object-list.c
+++ b/fs/fscache/object-list.c
@@ -321,7 +321,7 @@ static void fscache_objlist_config(struct fscache_objlist_data *data)
 	const char *buf;
 	int len;
 
-	key = request_key(&key_type_user, "fscache:objlist", NULL);
+	key = request_key(&key_type_user, "fscache:objlist", NULL, NULL);
 	if (IS_ERR(key))
 		goto no_config;
 
diff --git a/fs/nfs/nfs4idmap.c b/fs/nfs/nfs4idmap.c
index bf34ddaa2ad7..25f3f2f97ce9 100644
--- a/fs/nfs/nfs4idmap.c
+++ b/fs/nfs/nfs4idmap.c
@@ -71,6 +71,25 @@ struct idmap {
 	struct mutex		idmap_mutex;
 };
 
+static struct key_acl nfs_idmap_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
+static struct key_acl nfs_idmap_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
+	}
+};
+
 /**
  * nfs_fattr_init_names - initialise the nfs_fattr owner_name/group_name fields
  * @fattr: fully initialised struct nfs_fattr
@@ -200,8 +219,7 @@ int nfs_idmap_init(void)
 
 	keyring = keyring_alloc(".id_resolver",
 				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
-				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				KEY_USR_VIEW | KEY_USR_READ,
+				&nfs_idmap_keyring_acl,
 				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -278,11 +296,12 @@ static struct key *nfs_idmap_request_key(const char *name, size_t namelen,
 	if (ret < 0)
 		return ERR_PTR(ret);
 
-	rkey = request_key(&key_type_id_resolver, desc, "");
+	rkey = request_key(&key_type_id_resolver, desc, "", &nfs_idmap_key_acl);
 	if (IS_ERR(rkey)) {
 		mutex_lock(&idmap->idmap_mutex);
 		rkey = request_key_with_auxdata(&key_type_id_resolver_legacy,
-						desc, "", 0, idmap);
+						desc, "", 0, idmap,
+						&nfs_idmap_key_acl);
 		mutex_unlock(&idmap->idmap_mutex);
 	}
 	if (!IS_ERR(rkey))
@@ -311,8 +330,6 @@ static ssize_t nfs_idmap_get_key(const char *name, size_t namelen,
 	}
 
 	rcu_read_lock();
-	rkey->perm |= KEY_USR_VIEW;
-
 	ret = key_validate(rkey);
 	if (ret < 0)
 		goto out_up;
diff --git a/fs/ubifs/auth.c b/fs/ubifs/auth.c
index 5bf5fd08879e..38bae9737166 100644
--- a/fs/ubifs/auth.c
+++ b/fs/ubifs/auth.c
@@ -246,7 +246,7 @@ int ubifs_init_authentication(struct ubifs_info *c)
 	snprintf(hmac_name, CRYPTO_MAX_ALG_NAME, "hmac(%s)",
 		 c->auth_hash_name);
 
-	keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL);
+	keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL, NULL);
 
 	if (IS_ERR(keyring_key)) {
 		ubifs_err(c, "Failed to request key: %ld",
diff --git a/include/linux/key.h b/include/linux/key.h
index de190036512b..a38b89bd414c 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -32,49 +32,14 @@
 /* key handle serial number */
 typedef int32_t key_serial_t;
 
-/* key handle permissions mask */
-typedef uint32_t key_perm_t;
-
 struct key;
 struct net;
 
 #ifdef CONFIG_KEYS
 
-#undef KEY_DEBUGGING
+#include <linux/keyctl.h>
 
-#define KEY_POS_VIEW	0x01000000	/* possessor can view a key's attributes */
-#define KEY_POS_READ	0x02000000	/* possessor can read key payload / view keyring */
-#define KEY_POS_WRITE	0x04000000	/* possessor can update key payload / add link to keyring */
-#define KEY_POS_SEARCH	0x08000000	/* possessor can find a key in search / search a keyring */
-#define KEY_POS_LINK	0x10000000	/* possessor can create a link to a key/keyring */
-#define KEY_POS_SETATTR	0x20000000	/* possessor can set key attributes */
-#define KEY_POS_ALL	0x3f000000
-
-#define KEY_USR_VIEW	0x00010000	/* user permissions... */
-#define KEY_USR_READ	0x00020000
-#define KEY_USR_WRITE	0x00040000
-#define KEY_USR_SEARCH	0x00080000
-#define KEY_USR_LINK	0x00100000
-#define KEY_USR_SETATTR	0x00200000
-#define KEY_USR_ALL	0x003f0000
-
-#define KEY_GRP_VIEW	0x00000100	/* group permissions... */
-#define KEY_GRP_READ	0x00000200
-#define KEY_GRP_WRITE	0x00000400
-#define KEY_GRP_SEARCH	0x00000800
-#define KEY_GRP_LINK	0x00001000
-#define KEY_GRP_SETATTR	0x00002000
-#define KEY_GRP_ALL	0x00003f00
-
-#define KEY_OTH_VIEW	0x00000001	/* third party permissions... */
-#define KEY_OTH_READ	0x00000002
-#define KEY_OTH_WRITE	0x00000004
-#define KEY_OTH_SEARCH	0x00000008
-#define KEY_OTH_LINK	0x00000010
-#define KEY_OTH_SETATTR	0x00000020
-#define KEY_OTH_ALL	0x0000003f
-
-#define KEY_PERM_UNDEF	0xffffffff
+#undef KEY_DEBUGGING
 
 struct seq_file;
 struct user_struct;
@@ -118,6 +83,36 @@ union key_payload {
 	void			*data[4];
 };
 
+struct key_ace {
+	unsigned int		type;
+	unsigned int		perm;
+	union {
+		kuid_t		uid;
+		kgid_t		gid;
+		unsigned int	subject_id;
+	};
+};
+
+struct key_acl {
+	refcount_t		usage;
+	unsigned short		nr_ace;
+	bool			possessor_viewable;
+	struct rcu_head		rcu;
+	struct key_ace		aces[];
+};
+
+#define KEY_POSSESSOR_ACE(perms) {			\
+		.type = KEY_ACE_SUBJ_STANDARD,		\
+		.perm = perms,				\
+		.subject_id = KEY_ACE_POSSESSOR		\
+	}
+
+#define KEY_OWNER_ACE(perms) {				\
+		.type = KEY_ACE_SUBJ_STANDARD,		\
+		.perm = perms,				\
+		.subject_id = KEY_ACE_OWNER		\
+	}
+
 /*****************************************************************************/
 /*
  * key reference with possession attribute handling
@@ -187,6 +182,7 @@ struct key {
 	struct rw_semaphore	sem;		/* change vs change sem */
 	struct key_user		*user;		/* owner of this key */
 	void			*security;	/* security data for this key */
+	struct key_acl		__rcu *acl;
 	union {
 		time64_t	expiry;		/* time at which key expires (or 0) */
 		time64_t	revoked_at;	/* time at which key was revoked */
@@ -194,7 +190,6 @@ struct key {
 	time64_t		last_used_at;	/* last time used for LRU keyring discard */
 	kuid_t			uid;
 	kgid_t			gid;
-	key_perm_t		perm;		/* access permissions */
 	unsigned short		quotalen;	/* length added to quota */
 	unsigned short		datalen;	/* payload data length
 						 * - may not match RCU dereferenced payload
@@ -220,6 +215,7 @@ struct key {
 #define KEY_FLAG_UID_KEYRING	9	/* set if key is a user or user session keyring */
 #define KEY_FLAG_SET_WATCH_PROXY 10	/* Set if watch_proxy should be set on added keys */
 #define KEY_FLAG_SEEN		11	/* Set if returned by keyctl_find_oldest_key() */
+#define KEY_FLAG_HAS_ACL	12	/* Set if KEYCTL_SETACL called on key */
 
 	/* the key type and key description string
 	 * - the desc is used to match a key against search criteria
@@ -268,7 +264,7 @@ extern struct key *key_alloc(struct key_type *type,
 			     const char *desc,
 			     kuid_t uid, kgid_t gid,
 			     const struct cred *cred,
-			     key_perm_t perm,
+			     struct key_acl *acl,
 			     unsigned long flags,
 			     struct key_restriction *restrict_link);
 
@@ -304,18 +300,21 @@ static inline void key_ref_put(key_ref_t key_ref)
 
 extern struct key *request_key(struct key_type *type,
 			       const char *description,
-			       const char *callout_info);
+			       const char *callout_info,
+			       struct key_acl *acl);
 
 extern struct key *request_key_with_auxdata(struct key_type *type,
 					    const char *description,
 					    const void *callout_info,
 					    size_t callout_len,
-					    void *aux);
+					    void *aux,
+					    struct key_acl *acl);
 
 extern struct key *request_key_net(struct key_type *type,
 				   const char *description,
 				   struct net *net,
-				   const char *callout_info);
+				   const char *callout_info,
+				   struct key_acl *acl);
 
 extern int wait_for_key_construction(struct key *key, bool intr);
 
@@ -326,7 +325,7 @@ extern key_ref_t key_create_or_update(key_ref_t keyring,
 				      const char *description,
 				      const void *payload,
 				      size_t plen,
-				      key_perm_t perm,
+				      struct key_acl *acl,
 				      unsigned long flags);
 
 extern int key_update(key_ref_t key,
@@ -346,7 +345,7 @@ extern int key_unlink(struct key *keyring,
 
 extern struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
 				 const struct cred *cred,
-				 key_perm_t perm,
+				 struct key_acl *acl,
 				 unsigned long flags,
 				 struct key_restriction *restrict_link,
 				 struct key *dest);
@@ -378,19 +377,29 @@ static inline key_serial_t key_serial(const struct key *key)
 extern void key_set_timeout(struct key *, unsigned);
 
 extern key_ref_t lookup_user_key(key_serial_t id, unsigned long flags,
-				 key_perm_t perm);
+				 u32 desired_perm);
 extern void key_free_user_ns(struct user_namespace *);
 
 /*
  * The permissions required on a key that we're looking up.
  */
-#define	KEY_NEED_VIEW	0x01	/* Require permission to view attributes */
-#define	KEY_NEED_READ	0x02	/* Require permission to read content */
-#define	KEY_NEED_WRITE	0x04	/* Require permission to update / modify */
-#define	KEY_NEED_SEARCH	0x08	/* Require permission to search (keyring) or find (key) */
-#define	KEY_NEED_LINK	0x10	/* Require permission to link */
-#define	KEY_NEED_SETATTR 0x20	/* Require permission to change attributes */
-#define	KEY_NEED_ALL	0x3f	/* All the above permissions */
+#define	KEY_NEED_VIEW	0x001	/* Require permission to view attributes */
+#define	KEY_NEED_READ	0x002	/* Require permission to read content */
+#define	KEY_NEED_WRITE	0x004	/* Require permission to update / modify */
+#define	KEY_NEED_SEARCH	0x008	/* Require permission to search (keyring) or find (key) */
+#define	KEY_NEED_LINK	0x010	/* Require permission to link */
+#define	KEY_NEED_SETSEC	0x020	/* Require permission to set owner, group, ACL */
+#define	KEY_NEED_INVAL	0x040	/* Require permission to invalidate key */
+#define	KEY_NEED_REVOKE	0x080	/* Require permission to revoke key */
+#define	KEY_NEED_JOIN	0x100	/* Require permission to join keyring as session */
+#define	KEY_NEED_CLEAR	0x200	/* Require permission to clear a keyring */
+#define KEY_NEED_ALL	0x3ff
+
+#define OLD_KEY_NEED_SETATTR 0x20 /* Used to be Require permission to change attributes */
+
+extern struct key_acl internal_key_acl;
+extern struct key_acl internal_keyring_acl;
+extern struct key_acl internal_writable_keyring_acl;
 
 static inline short key_read_state(const struct key *key)
 {
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index a2afb4512f34..50d7b6ca82ab 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -15,6 +15,69 @@
 
 #include <linux/types.h>
 
+/*
+ * Keyring permission grant definitions
+ */
+enum key_ace_subject_type {
+	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
+	nr__key_ace_subject_type
+};
+
+enum key_ace_standard_subject {
+	KEY_ACE_EVERYONE	= 0,	/* Everyone, including owner and group */
+	KEY_ACE_GROUP		= 1,	/* The key's group */
+	KEY_ACE_OWNER		= 2,	/* The owner of the key */
+	KEY_ACE_POSSESSOR	= 3,	/* Any process that possesses of the key */
+	nr__key_ace_standard_subject
+};
+
+#define KEY_ACE_VIEW		0x00000001 /* Can describe the key */
+#define KEY_ACE_READ		0x00000002 /* Can read the key content */
+#define KEY_ACE_WRITE		0x00000004 /* Can update/modify the key content */
+#define KEY_ACE_SEARCH		0x00000008 /* Can find the key by search */
+#define KEY_ACE_LINK		0x00000010 /* Can make a link to the key */
+#define KEY_ACE_SET_SECURITY	0x00000020 /* Can set owner, group, ACL */
+#define KEY_ACE_INVAL		0x00000040 /* Can invalidate the key */
+#define KEY_ACE_REVOKE		0x00000080 /* Can revoke the key */
+#define KEY_ACE_JOIN		0x00000100 /* Can join keyring */
+#define KEY_ACE_CLEAR		0x00000200 /* Can clear keyring */
+#define KEY_ACE__PERMS		0xffffffff
+
+/*
+ * Old-style permissions mask, deprecated in favour of ACL.
+ */
+#define KEY_POS_VIEW	0x01000000	/* possessor can view a key's attributes */
+#define KEY_POS_READ	0x02000000	/* possessor can read key payload / view keyring */
+#define KEY_POS_WRITE	0x04000000	/* possessor can update key payload / add link to keyring */
+#define KEY_POS_SEARCH	0x08000000	/* possessor can find a key in search / search a keyring */
+#define KEY_POS_LINK	0x10000000	/* possessor can create a link to a key/keyring */
+#define KEY_POS_SETATTR	0x20000000	/* possessor can set key attributes */
+#define KEY_POS_ALL	0x3f000000
+
+#define KEY_USR_VIEW	0x00010000	/* user permissions... */
+#define KEY_USR_READ	0x00020000
+#define KEY_USR_WRITE	0x00040000
+#define KEY_USR_SEARCH	0x00080000
+#define KEY_USR_LINK	0x00100000
+#define KEY_USR_SETATTR	0x00200000
+#define KEY_USR_ALL	0x003f0000
+
+#define KEY_GRP_VIEW	0x00000100	/* group permissions... */
+#define KEY_GRP_READ	0x00000200
+#define KEY_GRP_WRITE	0x00000400
+#define KEY_GRP_SEARCH	0x00000800
+#define KEY_GRP_LINK	0x00001000
+#define KEY_GRP_SETATTR	0x00002000
+#define KEY_GRP_ALL	0x00003f00
+
+#define KEY_OTH_VIEW	0x00000001	/* third party permissions... */
+#define KEY_OTH_READ	0x00000002
+#define KEY_OTH_WRITE	0x00000004
+#define KEY_OTH_SEARCH	0x00000008
+#define KEY_OTH_LINK	0x00000010
+#define KEY_OTH_SETATTR	0x00000020
+#define KEY_OTH_ALL	0x0000003f
+
 /* special process keyring shortcut IDs */
 #define KEY_SPEC_THREAD_KEYRING		-1	/* - key ID for thread-specific keyring */
 #define KEY_SPEC_PROCESS_KEYRING	-2	/* - key ID for process-specific keyring */
diff --git a/lib/digsig.c b/lib/digsig.c
index 6ba6fcd92dd1..8cfa53585267 100644
--- a/lib/digsig.c
+++ b/lib/digsig.c
@@ -227,7 +227,7 @@ int digsig_verify(struct key *keyring, const char *sig, int siglen,
 		else
 			key = key_ref_to_ptr(kref);
 	} else {
-		key = request_key(&key_type_user, name, NULL);
+		key = request_key(&key_type_user, name, NULL, NULL);
 	}
 	if (IS_ERR(key)) {
 		pr_err("key not found, id: %s\n", name);
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 9cab80207ced..c6efe800392e 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -305,7 +305,7 @@ static int get_secret(struct ceph_crypto_key *dst, const char *name) {
 	int err = 0;
 	struct ceph_crypto_key *ckey;
 
-	ukey = request_key(&key_type_ceph, name, NULL);
+	ukey = request_key(&key_type_ceph, name, NULL, NULL);
 	if (IS_ERR(ukey)) {
 		/* request_key errors don't map nicely to mount(2)
 		   errors; don't even try, but still printk */
diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
index 3e1a90669006..6b201531b165 100644
--- a/net/dns_resolver/dns_key.c
+++ b/net/dns_resolver/dns_key.c
@@ -46,6 +46,15 @@ const struct cred *dns_resolver_cache;
 
 #define	DNS_ERRORNO_OPTION	"dnserror"
 
+static struct key_acl dns_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_CLEAR),
+	}
+};
+
 /*
  * Preparse instantiation data for a dns_resolver key.
  *
@@ -343,8 +352,7 @@ static int __init init_dns_resolver(void)
 
 	keyring = keyring_alloc(".dns_resolver",
 				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
-				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				KEY_USR_VIEW | KEY_USR_READ,
+				&dns_keyring_acl,
 				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
diff --git a/net/dns_resolver/dns_query.c b/net/dns_resolver/dns_query.c
index d88ea98da63e..3a6436a7931a 100644
--- a/net/dns_resolver/dns_query.c
+++ b/net/dns_resolver/dns_query.c
@@ -46,6 +46,16 @@
 
 #include "internal.h"
 
+static struct key_acl dns_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_INVAL),
+	}
+};
+
 /**
  * dns_query - Query the DNS
  * @net: The network namespace to operate in.
@@ -124,7 +134,8 @@ int dns_query(struct net *net,
 	 * add_key() to preinstall malicious redirections
 	 */
 	saved_cred = override_creds(dns_resolver_cache);
-	rkey = request_key_net(&key_type_dns_resolver, desc, net, options);
+	rkey = request_key_net(&key_type_dns_resolver, desc, net, options,
+			       &dns_key_acl);
 	revert_creds(saved_cred);
 	kfree(desc);
 	if (IS_ERR(rkey)) {
@@ -134,8 +145,6 @@ int dns_query(struct net *net,
 
 	down_read(&rkey->sem);
 	set_bit(KEY_FLAG_ROOT_CAN_INVAL, &rkey->flags);
-	rkey->perm |= KEY_USR_VIEW;
-
 	ret = key_validate(rkey);
 	if (ret < 0)
 		goto put;
diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c
index 1cc6b0c6cc42..284d7a025fbc 100644
--- a/net/rxrpc/key.c
+++ b/net/rxrpc/key.c
@@ -27,6 +27,14 @@
 #include <keys/user-type.h>
 #include "ar-internal.h"
 
+static struct key_acl rxrpc_null_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 1,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_READ),
+	}
+};
+
 static int rxrpc_vet_description_s(const char *);
 static int rxrpc_preparse(struct key_preparsed_payload *);
 static int rxrpc_preparse_s(struct key_preparsed_payload *);
@@ -914,7 +922,8 @@ int rxrpc_request_key(struct rxrpc_sock *rx, char __user *optval, int optlen)
 	if (IS_ERR(description))
 		return PTR_ERR(description);
 
-	key = request_key_net(&key_type_rxrpc, description, sock_net(&rx->sk), NULL);
+	key = request_key_net(&key_type_rxrpc, description, sock_net(&rx->sk),
+			      NULL, NULL);
 	if (IS_ERR(key)) {
 		kfree(description);
 		_leave(" = %ld", PTR_ERR(key));
@@ -945,7 +954,8 @@ int rxrpc_server_keyring(struct rxrpc_sock *rx, char __user *optval,
 	if (IS_ERR(description))
 		return PTR_ERR(description);
 
-	key = request_key_net(&key_type_keyring, description, sock_net(&rx->sk), NULL);
+	key = request_key_net(&key_type_keyring, description, sock_net(&rx->sk),
+			      NULL, NULL);
 	if (IS_ERR(key)) {
 		kfree(description);
 		_leave(" = %ld", PTR_ERR(key));
@@ -1026,7 +1036,7 @@ struct key *rxrpc_get_null_key(const char *keyname)
 
 	key = key_alloc(&key_type_rxrpc, keyname,
 			GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
-			KEY_POS_SEARCH, KEY_ALLOC_NOT_IN_QUOTA, NULL);
+			&rxrpc_null_key_acl, KEY_ALLOC_NOT_IN_QUOTA, NULL);
 	if (IS_ERR(key))
 		return key;
 
diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index f45d6edecf99..c666dc72006a 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -51,7 +51,8 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
 
 	if (!keyring[id]) {
 		keyring[id] =
-			request_key(&key_type_keyring, keyring_name[id], NULL);
+			request_key(&key_type_keyring, keyring_name[id],
+				    NULL, NULL);
 		if (IS_ERR(keyring[id])) {
 			int err = PTR_ERR(keyring[id]);
 			pr_err("no %s keyring: %d\n", keyring_name[id], err);
@@ -73,14 +74,14 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
 	return -EOPNOTSUPP;
 }
 
-static int __integrity_init_keyring(const unsigned int id, key_perm_t perm,
+static int __integrity_init_keyring(const unsigned int id, struct key_acl *acl,
 				    struct key_restriction *restriction)
 {
 	const struct cred *cred = current_cred();
 	int err = 0;
 
 	keyring[id] = keyring_alloc(keyring_name[id], KUIDT_INIT(0),
-				    KGIDT_INIT(0), cred, perm,
+				    KGIDT_INIT(0), cred, acl,
 				    KEY_ALLOC_NOT_IN_QUOTA, restriction, NULL);
 	if (IS_ERR(keyring[id])) {
 		err = PTR_ERR(keyring[id]);
@@ -95,10 +96,7 @@ static int __integrity_init_keyring(const unsigned int id, key_perm_t perm,
 int __init integrity_init_keyring(const unsigned int id)
 {
 	struct key_restriction *restriction;
-	key_perm_t perm;
-
-	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW
-		| KEY_USR_READ | KEY_USR_SEARCH;
+	struct key_acl *acl = &internal_keyring_acl;
 
 	if (id == INTEGRITY_KEYRING_PLATFORM) {
 		restriction = NULL;
@@ -113,14 +111,14 @@ int __init integrity_init_keyring(const unsigned int id)
 		return -ENOMEM;
 
 	restriction->check = restrict_link_to_ima;
-	perm |= KEY_USR_WRITE;
+	acl = &internal_writable_keyring_acl;
 
 out:
-	return __integrity_init_keyring(id, perm, restriction);
+	return __integrity_init_keyring(id, &internal_keyring_acl, restriction);
 }
 
-int __init integrity_add_key(const unsigned int id, const void *data,
-			     off_t size, key_perm_t perm)
+static int __init integrity_add_key(const unsigned int id, const void *data,
+				    off_t size, struct key_acl *acl)
 {
 	key_ref_t key;
 	int rc = 0;
@@ -129,7 +127,7 @@ int __init integrity_add_key(const unsigned int id, const void *data,
 		return -EINVAL;
 
 	key = key_create_or_update(make_key_ref(keyring[id], 1), "asymmetric",
-				   NULL, data, size, perm,
+				   NULL, data, size, acl ?: &internal_key_acl,
 				   KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(key)) {
 		rc = PTR_ERR(key);
@@ -149,7 +147,6 @@ int __init integrity_load_x509(const unsigned int id, const char *path)
 	void *data;
 	loff_t size;
 	int rc;
-	key_perm_t perm;
 
 	rc = kernel_read_file_from_path(path, &data, &size, 0,
 					READING_X509_CERTIFICATE);
@@ -158,21 +155,19 @@ int __init integrity_load_x509(const unsigned int id, const char *path)
 		return rc;
 	}
 
-	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW | KEY_USR_READ;
-
 	pr_info("Loading X.509 certificate: %s\n", path);
-	rc = integrity_add_key(id, (const void *)data, size, perm);
+	rc = integrity_add_key(id, data, size, NULL);
 
 	vfree(data);
 	return rc;
 }
 
 int __init integrity_load_cert(const unsigned int id, const char *source,
-			       const void *data, size_t len, key_perm_t perm)
+			       const void *data, size_t len, struct key_acl *acl)
 {
 	if (!data)
 		return -EINVAL;
 
 	pr_info("Loading X.509 certificate: %s\n", source);
-	return integrity_add_key(id, data, len, perm);
+	return integrity_add_key(id, data, len, acl);
 }
diff --git a/security/integrity/digsig_asymmetric.c b/security/integrity/digsig_asymmetric.c
index d775e03fbbcc..017cb6db521d 100644
--- a/security/integrity/digsig_asymmetric.c
+++ b/security/integrity/digsig_asymmetric.c
@@ -57,7 +57,7 @@ static struct key *request_asymmetric_key(struct key *keyring, uint32_t keyid)
 		else
 			key = key_ref_to_ptr(kref);
 	} else {
-		key = request_key(&key_type_asymmetric, name, NULL);
+		key = request_key(&key_type_asymmetric, name, NULL, NULL);
 	}
 
 	if (IS_ERR(key)) {
diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 43e2dc3a60d0..945f42b762e4 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -358,7 +358,7 @@ int evm_init_key(void)
 	struct encrypted_key_payload *ekp;
 	int rc;
 
-	evm_key = request_key(&key_type_encrypted, EVMKEY, NULL);
+	evm_key = request_key(&key_type_encrypted, EVMKEY, NULL, NULL);
 	if (IS_ERR(evm_key))
 		return -ENOENT;
 
diff --git a/security/integrity/ima/ima_mok.c b/security/integrity/ima/ima_mok.c
index 073ddc9bce5b..ce48303cfacc 100644
--- a/security/integrity/ima/ima_mok.c
+++ b/security/integrity/ima/ima_mok.c
@@ -21,6 +21,15 @@
 #include <keys/system_keyring.h>
 
 
+static struct key_acl integrity_blacklist_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE | KEY_ACE_SEARCH),
+	}
+};
+
 struct key *ima_blacklist_keyring;
 
 /*
@@ -40,9 +49,7 @@ __init int ima_mok_init(void)
 
 	ima_blacklist_keyring = keyring_alloc(".ima_blacklist",
 				KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
-				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				KEY_USR_VIEW | KEY_USR_READ |
-				KEY_USR_WRITE | KEY_USR_SEARCH,
+			        &integrity_blacklist_keyring_acl,
 				KEY_ALLOC_NOT_IN_QUOTA,
 				restriction, NULL);
 
diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
index 7de59f44cba3..fbc1264af55f 100644
--- a/security/integrity/integrity.h
+++ b/security/integrity/integrity.h
@@ -154,7 +154,7 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
 int __init integrity_init_keyring(const unsigned int id);
 int __init integrity_load_x509(const unsigned int id, const char *path);
 int __init integrity_load_cert(const unsigned int id, const char *source,
-			       const void *data, size_t len, key_perm_t perm);
+			       const void *data, size_t len, struct key_acl *acl);
 #else
 
 static inline int integrity_digsig_verify(const unsigned int id,
@@ -172,7 +172,7 @@ static inline int integrity_init_keyring(const unsigned int id)
 static inline int __init integrity_load_cert(const unsigned int id,
 					     const char *source,
 					     const void *data, size_t len,
-					     key_perm_t perm)
+					     struct key_acl *acl)
 {
 	return 0;
 }
diff --git a/security/integrity/platform_certs/platform_keyring.c b/security/integrity/platform_certs/platform_keyring.c
index bcafd7387729..80bb6f750045 100644
--- a/security/integrity/platform_certs/platform_keyring.c
+++ b/security/integrity/platform_certs/platform_keyring.c
@@ -14,6 +14,15 @@
 #include <linux/slab.h>
 #include "../integrity.h"
 
+static struct key_acl platform_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_READ),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
 /**
  * add_to_platform_keyring - Add to platform keyring without validation.
  * @source: Source of key
@@ -29,10 +38,8 @@ void __init add_to_platform_keyring(const char *source, const void *data,
 	key_perm_t perm;
 	int rc;
 
-	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW;
-
 	rc = integrity_load_cert(INTEGRITY_KEYRING_PLATFORM, source, data, len,
-				 perm);
+				 &platform_key_acl);
 	if (rc)
 		pr_info("Error adding keys to platform keyring %s\n", source);
 }
diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
index 389a298274d3..376068ec5a4e 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -307,7 +307,7 @@ static struct key *request_user_key(const char *master_desc, const u8 **master_k
 	const struct user_key_payload *upayload;
 	struct key *ukey;
 
-	ukey = request_key(&key_type_user, master_desc, NULL);
+	ukey = request_key(&key_type_user, master_desc, NULL, NULL);
 	if (IS_ERR(ukey))
 		goto error;
 
diff --git a/security/keys/encrypted-keys/masterkey_trusted.c b/security/keys/encrypted-keys/masterkey_trusted.c
index dc3d18cae642..3322e7eeafce 100644
--- a/security/keys/encrypted-keys/masterkey_trusted.c
+++ b/security/keys/encrypted-keys/masterkey_trusted.c
@@ -33,7 +33,7 @@ struct key *request_trusted_key(const char *trusted_desc,
 	struct trusted_key_payload *tpayload;
 	struct key *tkey;
 
-	tkey = request_key(&key_type_trusted, trusted_desc, NULL);
+	tkey = request_key(&key_type_trusted, trusted_desc, NULL, NULL);
 	if (IS_ERR(tkey))
 		goto error;
 
diff --git a/security/keys/gc.c b/security/keys/gc.c
index c39721163d43..cb667becf224 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -160,6 +160,7 @@ static noinline void key_gc_unused_keys(struct list_head *keys)
 
 		key_user_put(key->user);
 		key_put_tag(key->domain_tag);
+		key_put_acl(key->acl);
 		kfree(key->description);
 
 		memzero_explicit(key, sizeof(*key));
@@ -229,7 +230,6 @@ static void key_garbage_collector(struct work_struct *work)
 			if (key->type == key_gc_dead_keytype) {
 				gc_state |= KEY_GC_FOUND_DEAD_KEY;
 				set_bit(KEY_FLAG_DEAD, &key->flags);
-				key->perm = 0;
 				goto skip_dead_key;
 			} else if (key->type == &key_type_keyring &&
 				   key->restrict_link) {
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 6be76caee874..9f9ecc1810c9 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -89,8 +89,11 @@ extern struct rb_root key_serial_tree;
 extern spinlock_t key_serial_lock;
 extern struct mutex key_construction_mutex;
 extern wait_queue_head_t request_key_conswq;
+extern struct key_acl default_key_acl;
+extern struct key_acl joinable_keyring_acl;
 
 extern void key_set_index_key(struct keyring_index_key *index_key);
+
 extern struct key_type *key_type_lookup(const char *type);
 extern void key_type_put(struct key_type *ktype);
 extern int key_get_type_from_user(char *, const char __user *, unsigned);
@@ -157,6 +160,7 @@ extern struct key *request_key_and_link(struct key_type *type,
 					const void *callout_info,
 					size_t callout_len,
 					void *aux,
+					struct key_acl *acl,
 					struct key *dest_keyring,
 					unsigned long flags);
 
@@ -180,7 +184,11 @@ extern void key_gc_keytype(struct key_type *ktype);
 
 extern int key_task_permission(const key_ref_t key_ref,
 			       const struct cred *cred,
-			       key_perm_t perm);
+			       u32 desired_perm);
+extern unsigned int key_acl_to_perm(const struct key_acl *acl);
+extern long key_set_acl(struct key *key, struct key_acl *acl);
+extern void key_put_acl(struct key_acl *acl);
+
 #ifdef CONFIG_CONTAINERS
 extern int queue_request_key(struct key *);
 #else
@@ -249,7 +257,7 @@ extern long keyctl_keyring_search(key_serial_t, const char __user *,
 				  const char __user *, key_serial_t);
 extern long keyctl_read_key(key_serial_t, char __user *, size_t);
 extern long keyctl_chown_key(key_serial_t, uid_t, gid_t);
-extern long keyctl_setperm_key(key_serial_t, key_perm_t);
+extern long keyctl_setperm_key(key_serial_t, unsigned int);
 extern long keyctl_instantiate_key(key_serial_t, const void __user *,
 				   size_t, key_serial_t);
 extern long keyctl_negate_key(key_serial_t, unsigned, key_serial_t);
diff --git a/security/keys/key.c b/security/keys/key.c
index 63513ffcf2e8..bca9d01c05fa 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -199,7 +199,7 @@ static inline void key_alloc_serial(struct key *key)
  * @uid: The owner of the new key.
  * @gid: The group ID for the new key's group permissions.
  * @cred: The credentials specifying UID namespace.
- * @perm: The permissions mask of the new key.
+ * @acl: The ACL to attach to the new key.
  * @flags: Flags specifying quota properties.
  * @restrict_link: Optional link restriction for new keyrings.
  *
@@ -227,7 +227,7 @@ static inline void key_alloc_serial(struct key *key)
  */
 struct key *key_alloc(struct key_type *type, const char *desc,
 		      kuid_t uid, kgid_t gid, const struct cred *cred,
-		      key_perm_t perm, unsigned long flags,
+		      struct key_acl *acl, unsigned long flags,
 		      struct key_restriction *restrict_link)
 {
 	struct key_user *user = NULL;
@@ -250,6 +250,9 @@ struct key *key_alloc(struct key_type *type, const char *desc,
 	desclen = strlen(desc);
 	quotalen = desclen + 1 + type->def_datalen;
 
+	if (!acl)
+		acl = &default_key_acl;
+
 	/* get hold of the key tracking for this user */
 	user = key_user_lookup(uid);
 	if (!user)
@@ -296,7 +299,8 @@ struct key *key_alloc(struct key_type *type, const char *desc,
 	key->datalen = type->def_datalen;
 	key->uid = uid;
 	key->gid = gid;
-	key->perm = perm;
+	refcount_inc(&acl->usage);
+	rcu_assign_pointer(key->acl, acl);
 	key->restrict_link = restrict_link;
 	key->last_used_at = ktime_get_real_seconds();
 
@@ -785,7 +789,7 @@ static inline key_ref_t __key_update(key_ref_t key_ref,
  * @description: The searchable description for the key.
  * @payload: The data to use to instantiate or update the key.
  * @plen: The length of @payload.
- * @perm: The permissions mask for a new key.
+ * @acl: The ACL to attach if a key is created.
  * @flags: The quota flags for a new key.
  *
  * Search the destination keyring for a key of the same description and if one
@@ -808,7 +812,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 			       const char *description,
 			       const void *payload,
 			       size_t plen,
-			       key_perm_t perm,
+			       struct key_acl *acl,
 			       unsigned long flags)
 {
 	struct keyring_index_key index_key = {
@@ -899,22 +903,9 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 			goto found_matching_key;
 	}
 
-	/* if the client doesn't provide, decide on the permissions we want */
-	if (perm == KEY_PERM_UNDEF) {
-		perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS_SETATTR;
-		perm |= KEY_USR_VIEW;
-
-		if (index_key.type->read)
-			perm |= KEY_POS_READ;
-
-		if (index_key.type == &key_type_keyring ||
-		    index_key.type->update)
-			perm |= KEY_POS_WRITE;
-	}
-
 	/* allocate a new key */
 	key = key_alloc(index_key.type, index_key.description,
-			cred->fsuid, cred->fsgid, cred, perm, flags, NULL);
+			cred->fsuid, cred->fsgid, cred, acl, flags, NULL);
 	if (IS_ERR(key)) {
 		key_ref = ERR_CAST(key);
 		goto error_link_end;
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index a25799249b8a..2df896bfb8e4 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -120,8 +120,7 @@ SYSCALL_DEFINE5(add_key, const char __user *, _type,
 	/* create or update the requested key and add it to the target
 	 * keyring */
 	key_ref = key_create_or_update(keyring_ref, type, description,
-				       payload, plen, KEY_PERM_UNDEF,
-				       KEY_ALLOC_IN_QUOTA);
+				       payload, plen, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key_ref)) {
 		ret = key_ref_to_ptr(key_ref)->serial;
 		key_ref_put(key_ref);
@@ -211,7 +210,8 @@ SYSCALL_DEFINE4(request_key, const char __user *, _type,
 
 	/* do the search */
 	key = request_key_and_link(ktype, description, NULL, callout_info,
-				   callout_len, NULL, key_ref_to_ptr(dest_ref),
+				   callout_len, NULL, NULL,
+				   key_ref_to_ptr(dest_ref),
 				   KEY_ALLOC_IN_QUOTA);
 	if (IS_ERR(key)) {
 		ret = PTR_ERR(key);
@@ -373,16 +373,10 @@ long keyctl_revoke_key(key_serial_t id)
 	struct key *key;
 	long ret;
 
-	key_ref = lookup_user_key(id, 0, KEY_NEED_WRITE);
+	key_ref = lookup_user_key(id, 0, KEY_NEED_REVOKE);
 	if (IS_ERR(key_ref)) {
 		ret = PTR_ERR(key_ref);
-		if (ret != -EACCES)
-			goto error;
-		key_ref = lookup_user_key(id, 0, KEY_NEED_SETATTR);
-		if (IS_ERR(key_ref)) {
-			ret = PTR_ERR(key_ref);
-			goto error;
-		}
+		goto error;
 	}
 
 	key = key_ref_to_ptr(key_ref);
@@ -416,7 +410,7 @@ long keyctl_invalidate_key(key_serial_t id)
 
 	kenter("%d", id);
 
-	key_ref = lookup_user_key(id, 0, KEY_NEED_SEARCH);
+	key_ref = lookup_user_key(id, 0, KEY_NEED_INVAL);
 	if (IS_ERR(key_ref)) {
 		ret = PTR_ERR(key_ref);
 
@@ -461,7 +455,7 @@ long keyctl_keyring_clear(key_serial_t ringid)
 	struct key *keyring;
 	long ret;
 
-	keyring_ref = lookup_user_key(ringid, KEY_LOOKUP_CREATE, KEY_NEED_WRITE);
+	keyring_ref = lookup_user_key(ringid, KEY_LOOKUP_CREATE, KEY_NEED_CLEAR);
 	if (IS_ERR(keyring_ref)) {
 		ret = PTR_ERR(keyring_ref);
 
@@ -639,6 +633,7 @@ long keyctl_describe_key(key_serial_t keyid,
 			 size_t buflen)
 {
 	struct key *key, *instkey;
+	unsigned int perm;
 	key_ref_t key_ref;
 	char *infobuf;
 	long ret;
@@ -668,6 +663,10 @@ long keyctl_describe_key(key_serial_t keyid,
 	key = key_ref_to_ptr(key_ref);
 	desclen = strlen(key->description);
 
+	rcu_read_lock();
+	perm = key_acl_to_perm(rcu_dereference(key->acl));
+	rcu_read_unlock();
+
 	/* calculate how much information we're going to return */
 	ret = -ENOMEM;
 	infobuf = kasprintf(GFP_KERNEL,
@@ -675,7 +674,7 @@ long keyctl_describe_key(key_serial_t keyid,
 			    key->type->name,
 			    from_kuid_munged(current_user_ns(), key->uid),
 			    from_kgid_munged(current_user_ns(), key->gid),
-			    key->perm);
+			    perm);
 	if (!infobuf)
 		goto error2;
 	infolen = strlen(infobuf);
@@ -892,7 +891,7 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group)
 		goto error;
 
 	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
-				  KEY_NEED_SETATTR);
+				  KEY_NEED_SETSEC);
 	if (IS_ERR(key_ref)) {
 		ret = PTR_ERR(key_ref);
 		goto error;
@@ -988,18 +987,25 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group)
  * the key need not be fully instantiated yet.  If the caller does not have
  * sysadmin capability, it may only change the permission on keys that it owns.
  */
-long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
+long keyctl_setperm_key(key_serial_t id, unsigned int perm)
 {
+	struct key_acl *acl;
 	struct key *key;
 	key_ref_t key_ref;
 	long ret;
+	int nr, i, j;
 
-	ret = -EINVAL;
 	if (perm & ~(KEY_POS_ALL | KEY_USR_ALL | KEY_GRP_ALL | KEY_OTH_ALL))
-		goto error;
+		return -EINVAL;
+
+	nr = 0;
+	if (perm & KEY_POS_ALL) nr++;
+	if (perm & KEY_USR_ALL) nr++;
+	if (perm & KEY_GRP_ALL) nr++;
+	if (perm & KEY_OTH_ALL) nr++;
 
 	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
-				  KEY_NEED_SETATTR);
+				  KEY_NEED_SETSEC);
 	if (IS_ERR(key_ref)) {
 		ret = PTR_ERR(key_ref);
 		goto error;
@@ -1007,18 +1013,45 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 
 	key = key_ref_to_ptr(key_ref);
 
-	/* make the changes with the locks held to prevent chown/chmod races */
-	ret = -EACCES;
-	down_write(&key->sem);
+	ret = -EOPNOTSUPP;
+	if (test_bit(KEY_FLAG_HAS_ACL, &key->flags))
+		goto error_key;
 
-	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || uid_eq(key->uid, current_fsuid())) {
-		key->perm = perm;
-		notify_key(key, NOTIFY_KEY_SETATTR, 0);
-		ret = 0;
+	ret = -ENOMEM;
+	acl = kzalloc(struct_size(acl, aces, nr), GFP_KERNEL);
+	if (!acl)
+		goto error_key;
+
+	refcount_set(&acl->usage, 1);
+	acl->nr_ace = nr;
+	j = 0;
+	for (i = 0; i < 4; i++) {
+		struct key_ace *ace = &acl->aces[j];
+		unsigned int subset = (perm >> (i * 8)) & KEY_OTH_ALL;
+
+		if (!subset)
+			continue;
+		ace->type = KEY_ACE_SUBJ_STANDARD;
+		ace->subject_id = KEY_ACE_EVERYONE + i;
+		ace->perm = subset;
+		if (subset & (KEY_OTH_WRITE | KEY_OTH_SETATTR))
+			ace->perm |= KEY_ACE_REVOKE;
+		if (subset & KEY_OTH_SEARCH)
+			ace->perm |= KEY_ACE_INVAL;
+		if (key->type == &key_type_keyring) {
+			if (subset & KEY_OTH_SEARCH)
+				ace->perm |= KEY_ACE_JOIN;
+			if (subset & KEY_OTH_WRITE)
+				ace->perm |= KEY_ACE_CLEAR;
+		}
+		j++;
 	}
 
+	/* make the changes with the locks held to prevent chown/chmod races */
+	down_write(&key->sem);
+	ret = key_set_acl(key, acl);
 	up_write(&key->sem);
+error_key:
 	key_put(key);
 error:
 	return ret;
@@ -1383,7 +1416,7 @@ long keyctl_set_timeout(key_serial_t id, unsigned timeout)
 	long ret;
 
 	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
-				  KEY_NEED_SETATTR);
+				  KEY_NEED_SETSEC);
 	if (IS_ERR(key_ref)) {
 		/* setting the timeout on a key under construction is permitted
 		 * if we have the authorisation token handy */
@@ -1654,7 +1687,7 @@ long keyctl_restrict_keyring(key_serial_t id, const char __user *_type,
 	char *restriction = NULL;
 	long ret;
 
-	key_ref = lookup_user_key(id, 0, KEY_NEED_SETATTR);
+	key_ref = lookup_user_key(id, 0, KEY_NEED_SETSEC);
 	if (IS_ERR(key_ref))
 		return PTR_ERR(key_ref);
 
@@ -1819,7 +1852,7 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	case KEYCTL_SETPERM:
 		return keyctl_setperm_key((key_serial_t) arg2,
-					  (key_perm_t) arg3);
+					  (unsigned int)arg3);
 
 	case KEYCTL_INSTANTIATE:
 		return keyctl_instantiate_key((key_serial_t) arg2,
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 14df79814ea0..64f590632891 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -518,11 +518,19 @@ static long keyring_read(const struct key *keyring,
 	return ret;
 }
 
-/*
- * Allocate a keyring and link into the destination keyring.
+/**
+ * keyring_alloc - Allocate a keyring and link into the destination
+ * @description: The key description to allow the key to be searched out.
+ * @uid: The owner of the new key.
+ * @gid: The group ID for the new key's group permissions.
+ * @cred: The credentials specifying UID namespace.
+ * @acl: The ACL to attach to the new key.
+ * @flags: Flags specifying quota properties.
+ * @restrict_link: Optional link restriction for new keyrings.
+ * @dest: Destination keyring.
  */
 struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
-			  const struct cred *cred, key_perm_t perm,
+			  const struct cred *cred, struct key_acl *acl,
 			  unsigned long flags,
 			  struct key_restriction *restrict_link,
 			  struct key *dest)
@@ -531,7 +539,7 @@ struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
 	int ret;
 
 	keyring = key_alloc(&key_type_keyring, description,
-			    uid, gid, cred, perm, flags, restrict_link);
+			    uid, gid, cred, acl, flags, restrict_link);
 	if (!IS_ERR(keyring)) {
 		ret = key_instantiate_and_link(keyring, NULL, 0, dest, NULL);
 		if (ret < 0) {
@@ -1125,10 +1133,11 @@ key_ref_t find_key_to_update(key_ref_t keyring_ref,
 /*
  * Find a keyring with the specified name.
  *
- * Only keyrings that have nonzero refcount, are not revoked, and are owned by a
- * user in the current user namespace are considered.  If @uid_keyring is %true,
- * the keyring additionally must have been allocated as a user or user session
- * keyring; otherwise, it must grant Search permission directly to the caller.
+ * Only keyrings that have nonzero refcount, are not revoked, and are owned by
+ * a user in the current user namespace are considered.  If @uid_keyring is
+ * %true, the keyring additionally must have been allocated as a user or user
+ * session keyring; otherwise, it must grant JOIN permission directly to the
+ * caller (ie. not through possession).
  *
  * Returns a pointer to the keyring with the keyring's refcount having being
  * incremented on success.  -ENOKEY is returned if a key could not be found.
@@ -1162,7 +1171,7 @@ struct key *find_keyring_by_name(const char *name, bool uid_keyring)
 				continue;
 		} else {
 			if (key_permission(make_key_ref(keyring, 0),
-					   KEY_NEED_SEARCH) < 0)
+					   KEY_NEED_JOIN) < 0)
 				continue;
 		}
 
diff --git a/security/keys/permission.c b/security/keys/permission.c
index 06df9d5e7572..8dc6e80f6fd0 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -11,13 +11,62 @@
 
 #include <linux/export.h>
 #include <linux/security.h>
+#include <linux/user_namespace.h>
+#include <linux/uaccess.h>
 #include "internal.h"
 
+struct key_acl default_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
+struct key_acl joinable_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces	= {
+		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_LINK | KEY_ACE_JOIN),
+	}
+};
+
+struct key_acl internal_key_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH),
+	}
+};
+
+struct key_acl internal_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH),
+	}
+};
+
+struct key_acl internal_writable_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE | KEY_ACE_SEARCH),
+	}
+};
+
 /**
  * key_task_permission - Check a key can be used
  * @key_ref: The key to check.
  * @cred: The credentials to use.
- * @perm: The permissions to check for.
+ * @desired_perm: The permission to check for.
  *
  * Check to see whether permission is granted to use a key in the desired way,
  * but permit the security modules to override.
@@ -28,53 +77,73 @@
  * permissions bits or the LSM check.
  */
 int key_task_permission(const key_ref_t key_ref, const struct cred *cred,
-			unsigned perm)
+			unsigned int desired_perm)
 {
-	struct key *key;
-	key_perm_t kperm;
-	int ret;
+	const struct key_acl *acl;
+	const struct key *key;
+	unsigned int allow = 0;
+	int i;
+
+	BUILD_BUG_ON(KEY_NEED_VIEW	!= KEY_ACE_VIEW		||
+		     KEY_NEED_READ	!= KEY_ACE_READ		||
+		     KEY_NEED_WRITE	!= KEY_ACE_WRITE	||
+		     KEY_NEED_SEARCH	!= KEY_ACE_SEARCH	||
+		     KEY_NEED_LINK	!= KEY_ACE_LINK		||
+		     KEY_NEED_SETSEC	!= KEY_ACE_SET_SECURITY	||
+		     KEY_NEED_INVAL	!= KEY_ACE_INVAL	||
+		     KEY_NEED_REVOKE	!= KEY_ACE_REVOKE	||
+		     KEY_NEED_JOIN	!= KEY_ACE_JOIN		||
+		     KEY_NEED_CLEAR	!= KEY_ACE_CLEAR);
 
 	key = key_ref_to_ptr(key_ref);
 
-	/* use the second 8-bits of permissions for keys the caller owns */
-	if (uid_eq(key->uid, cred->fsuid)) {
-		kperm = key->perm >> 16;
-		goto use_these_perms;
-	}
+	rcu_read_lock();
 
-	/* use the third 8-bits of permissions for keys the caller has a group
-	 * membership in common with */
-	if (gid_valid(key->gid) && key->perm & KEY_GRP_ALL) {
-		if (gid_eq(key->gid, cred->fsgid)) {
-			kperm = key->perm >> 8;
-			goto use_these_perms;
-		}
+	acl = rcu_dereference(key->acl);
+	if (!acl || acl->nr_ace == 0)
+		goto no_access_rcu;
+
+	for (i = 0; i < acl->nr_ace; i++) {
+		const struct key_ace *ace = &acl->aces[i];
 
-		ret = groups_search(cred->group_info, key->gid);
-		if (ret) {
-			kperm = key->perm >> 8;
-			goto use_these_perms;
+		switch (ace->type) {
+		case KEY_ACE_SUBJ_STANDARD:
+			switch (ace->subject_id) {
+			case KEY_ACE_POSSESSOR:
+				if (is_key_possessed(key_ref))
+					allow |= ace->perm;
+				break;
+			case KEY_ACE_OWNER:
+				if (uid_eq(key->uid, cred->fsuid))
+					allow |= ace->perm;
+				break;
+			case KEY_ACE_GROUP:
+				if (gid_valid(key->gid)) {
+					if (gid_eq(key->gid, cred->fsgid))
+						allow |= ace->perm;
+					else if (groups_search(cred->group_info, key->gid))
+						allow |= ace->perm;
+				}
+				break;
+			case KEY_ACE_EVERYONE:
+				allow |= ace->perm;
+				break;
+			}
+			break;
 		}
 	}
 
-	/* otherwise use the least-significant 8-bits */
-	kperm = key->perm;
-
-use_these_perms:
+	rcu_read_unlock();
 
-	/* use the top 8-bits of permissions for keys the caller possesses
-	 * - possessor permissions are additive with other permissions
-	 */
-	if (is_key_possessed(key_ref))
-		kperm |= key->perm >> 24;
+	if (!(allow & desired_perm))
+		goto no_access;
 
-	kperm = kperm & perm & KEY_NEED_ALL;
+	return security_key_permission(key_ref, cred, desired_perm);
 
-	if (kperm != perm)
-		return -EACCES;
-
-	/* let LSM be the final arbiter */
-	return security_key_permission(key_ref, cred, perm);
+no_access_rcu:
+	rcu_read_unlock();
+no_access:
+	return -EACCES;
 }
 EXPORT_SYMBOL(key_task_permission);
 
@@ -108,3 +177,100 @@ int key_validate(const struct key *key)
 	return 0;
 }
 EXPORT_SYMBOL(key_validate);
+
+/*
+ * Roughly render an ACL to an old-style permissions mask.  We cannot
+ * accurately render what the ACL, particularly if it has ACEs that represent
+ * subjects outside of { poss, user, group, other }.
+ */
+unsigned int key_acl_to_perm(const struct key_acl *acl)
+{
+	unsigned int perm = 0, tperm;
+	int i;
+
+	BUILD_BUG_ON(KEY_OTH_VIEW	!= KEY_ACE_VIEW		||
+		     KEY_OTH_READ	!= KEY_ACE_READ		||
+		     KEY_OTH_WRITE	!= KEY_ACE_WRITE	||
+		     KEY_OTH_SEARCH	!= KEY_ACE_SEARCH	||
+		     KEY_OTH_LINK	!= KEY_ACE_LINK		||
+		     KEY_OTH_SETATTR	!= KEY_ACE_SET_SECURITY);
+
+	if (!acl || acl->nr_ace == 0)
+		return 0;
+
+	for (i = 0; i < acl->nr_ace; i++) {
+		const struct key_ace *ace = &acl->aces[i];
+
+		switch (ace->type) {
+		case KEY_ACE_SUBJ_STANDARD:
+			tperm = ace->perm & KEY_OTH_ALL;
+
+			/* Invalidation and joining were allowed by SEARCH */
+			if (ace->perm & (KEY_ACE_INVAL | KEY_ACE_JOIN))
+				tperm |= KEY_OTH_SEARCH;
+
+			/* Revocation was allowed by either SETATTR or WRITE */
+			if ((ace->perm & KEY_ACE_REVOKE) && !(tperm & KEY_OTH_SETATTR))
+				tperm |= KEY_OTH_WRITE;
+
+			/* Clearing was allowed by WRITE */
+			if (ace->perm & KEY_ACE_CLEAR)
+				tperm |= KEY_OTH_WRITE;
+
+			switch (ace->subject_id) {
+			case KEY_ACE_POSSESSOR:
+				perm |= tperm << 24;
+				break;
+			case KEY_ACE_OWNER:
+				perm |= tperm << 16;
+				break;
+			case KEY_ACE_GROUP:
+				perm |= tperm << 8;
+				break;
+			case KEY_ACE_EVERYONE:
+				perm |= tperm << 0;
+				break;
+			}
+		}
+	}
+
+	return perm;
+}
+
+/*
+ * Destroy a key's ACL.
+ */
+void key_put_acl(struct key_acl *acl)
+{
+	if (acl && refcount_dec_and_test(&acl->usage))
+		kfree_rcu(acl, rcu);
+}
+
+/*
+ * Try to set the ACL.  This either attaches or discards the proposed ACL.
+ */
+long key_set_acl(struct key *key, struct key_acl *acl)
+{
+	int i;
+
+	/* If we're not the sysadmin, we can only change a key that we own. */
+	if (!capable(CAP_SYS_ADMIN) && !uid_eq(key->uid, current_fsuid())) {
+		key_put_acl(acl);
+		return -EACCES;
+	}
+
+	for (i = 0; i < acl->nr_ace; i++) {
+		const struct key_ace *ace = &acl->aces[i];
+		if (ace->type == KEY_ACE_SUBJ_STANDARD &&
+		    ace->subject_id == KEY_ACE_POSSESSOR) {
+			if (ace->perm & KEY_ACE_VIEW)
+				acl->possessor_viewable = true;
+			break;
+		}
+	}
+
+	rcu_swap_protected(key->acl, acl, lockdep_is_held(&key->sem));
+	notify_key(key, NOTIFY_KEY_SETATTR, 0);
+	key_put_acl(acl);
+	return 0;
+}
diff --git a/security/keys/persistent.c b/security/keys/persistent.c
index c9fbe63adc58..0a115cc543df 100644
--- a/security/keys/persistent.c
+++ b/security/keys/persistent.c
@@ -16,6 +16,27 @@
 
 unsigned persistent_keyring_expiry = 3 * 24 * 3600; /* Expire after 3 days of non-use */
 
+static struct key_acl persistent_register_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
+	}
+};
+
+static struct key_acl persistent_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE |
+				  KEY_ACE_SEARCH | KEY_ACE_LINK |
+				  KEY_ACE_CLEAR | KEY_ACE_INVAL),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
+	}
+};
+
 /*
  * Create the persistent keyring register for the current user namespace.
  *
@@ -26,8 +47,7 @@ static int key_create_persistent_register(struct user_namespace *ns)
 	struct key *reg = keyring_alloc(".persistent_register",
 					KUIDT_INIT(0), KGIDT_INIT(0),
 					current_cred(),
-					((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-					 KEY_USR_VIEW | KEY_USR_READ),
+					&persistent_register_keyring_acl,
 					KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
 	if (IS_ERR(reg))
 		return PTR_ERR(reg);
@@ -60,8 +80,7 @@ static key_ref_t key_create_persistent(struct user_namespace *ns, kuid_t uid,
 
 	persistent = keyring_alloc(index_key->description,
 				   uid, INVALID_GID, current_cred(),
-				   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
-				    KEY_USR_VIEW | KEY_USR_READ),
+				   &persistent_keyring_acl,
 				   KEY_ALLOC_NOT_IN_QUOTA, NULL,
 				   ns->persistent_keyring_register);
 	if (IS_ERR(persistent))
diff --git a/security/keys/proc.c b/security/keys/proc.c
index d2b802072693..d697a2e95217 100644
--- a/security/keys/proc.c
+++ b/security/keys/proc.c
@@ -154,6 +154,7 @@ static void proc_keys_stop(struct seq_file *p, void *v)
 
 static int proc_keys_show(struct seq_file *m, void *v)
 {
+	const struct key_acl *acl;
 	struct rb_node *_p = v;
 	struct key *key = rb_entry(_p, struct key, serial_node);
 	unsigned long flags;
@@ -161,6 +162,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
 	time64_t now, expiry;
 	char xbuf[16];
 	short state;
+	bool check_pos;
 	u64 timo;
 	int rc;
 
@@ -174,12 +176,16 @@ static int proc_keys_show(struct seq_file *m, void *v)
 		.flags			= KEYRING_SEARCH_NO_STATE_CHECK,
 	};
 
-	key_ref = make_key_ref(key, 0);
+	rcu_read_lock();
+
+	acl = rcu_dereference(key->acl);
+	check_pos = acl->possessor_viewable;
 
 	/* determine if the key is possessed by this process (a test we can
 	 * skip if the key does not indicate the possessor can view it
 	 */
-	if (key->perm & KEY_POS_VIEW) {
+	key_ref = make_key_ref(key, 0);
+	if (check_pos) {
 		skey_ref = search_my_process_keyrings(&ctx);
 		if (!IS_ERR(skey_ref)) {
 			key_ref_put(skey_ref);
@@ -190,12 +196,10 @@ static int proc_keys_show(struct seq_file *m, void *v)
 	/* check whether the current task is allowed to view the key */
 	rc = key_task_permission(key_ref, ctx.cred, KEY_NEED_VIEW);
 	if (rc < 0)
-		return 0;
+		goto out;
 
 	now = ktime_get_real_seconds();
 
-	rcu_read_lock();
-
 	/* come up with a suitable timeout value */
 	expiry = READ_ONCE(key->expiry);
 	if (expiry == 0) {
@@ -234,7 +238,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
 		   showflag(flags, 'i', KEY_FLAG_INVALIDATED),
 		   refcount_read(&key->usage),
 		   xbuf,
-		   key->perm,
+		   key_acl_to_perm(acl),
 		   from_kuid_munged(seq_user_ns(m), key->uid),
 		   from_kgid_munged(seq_user_ns(m), key->gid),
 		   key->type->name);
@@ -245,6 +249,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
 		key->type->describe(key, m);
 	seq_putc(m, '\n');
 
+out:
 	rcu_read_unlock();
 	return 0;
 }
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 39d3cbac920c..0a231ede4d2b 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -39,6 +39,37 @@ struct key_user root_key_user = {
 	.uid		= GLOBAL_ROOT_UID,
 };
 
+static struct key_acl user_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE |
+				  KEY_ACE_SEARCH | KEY_ACE_LINK),
+		KEY_OWNER_ACE(KEY_ACE__PERMS & ~(KEY_ACE_JOIN | KEY_ACE_SET_SECURITY)),
+	}
+};
+
+static struct key_acl session_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
+		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
+	}
+};
+
+static struct key_acl thread_and_process_keyring_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~(KEY_ACE_JOIN | KEY_ACE_SET_SECURITY)),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
 /*
  * Install the user and user session keyrings for the current process's UID.
  */
@@ -47,12 +78,10 @@ int install_user_keyrings(void)
 	struct user_struct *user;
 	const struct cred *cred;
 	struct key *uid_keyring, *session_keyring;
-	key_perm_t user_keyring_perm;
 	char buf[20];
 	int ret;
 	uid_t uid;
 
-	user_keyring_perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_ALL;
 	cred = current_cred();
 	user = cred->user;
 	uid = from_kuid(cred->user_ns, user->uid);
@@ -77,9 +106,9 @@ int install_user_keyrings(void)
 		uid_keyring = find_keyring_by_name(buf, true);
 		if (IS_ERR(uid_keyring)) {
 			uid_keyring = keyring_alloc(buf, user->uid, INVALID_GID,
-						    cred, user_keyring_perm,
+						    cred, &user_keyring_acl,
 						    KEY_ALLOC_UID_KEYRING |
-							KEY_ALLOC_IN_QUOTA,
+						    KEY_ALLOC_IN_QUOTA,
 						    NULL, NULL);
 			if (IS_ERR(uid_keyring)) {
 				ret = PTR_ERR(uid_keyring);
@@ -95,9 +124,9 @@ int install_user_keyrings(void)
 		if (IS_ERR(session_keyring)) {
 			session_keyring =
 				keyring_alloc(buf, user->uid, INVALID_GID,
-					      cred, user_keyring_perm,
+					      cred, &user_keyring_acl,
 					      KEY_ALLOC_UID_KEYRING |
-						  KEY_ALLOC_IN_QUOTA,
+					      KEY_ALLOC_IN_QUOTA,
 					      NULL, NULL);
 			if (IS_ERR(session_keyring)) {
 				ret = PTR_ERR(session_keyring);
@@ -144,7 +173,7 @@ int install_thread_keyring_to_cred(struct cred *new)
 		return 0;
 
 	keyring = keyring_alloc("_tid", new->uid, new->gid, new,
-				KEY_POS_ALL | KEY_USR_VIEW,
+				&thread_and_process_keyring_acl,
 				KEY_ALLOC_QUOTA_OVERRUN,
 				NULL, NULL);
 	if (IS_ERR(keyring))
@@ -191,7 +220,7 @@ int install_process_keyring_to_cred(struct cred *new)
 		return 0;
 
 	keyring = keyring_alloc("_pid", new->uid, new->gid, new,
-				KEY_POS_ALL | KEY_USR_VIEW,
+				&thread_and_process_keyring_acl,
 				KEY_ALLOC_QUOTA_OVERRUN,
 				NULL, NULL);
 	if (IS_ERR(keyring))
@@ -245,8 +274,7 @@ int install_session_keyring_to_cred(struct cred *cred, struct key *keyring)
 			flags = KEY_ALLOC_IN_QUOTA;
 
 		keyring = keyring_alloc("_ses", cred->uid, cred->gid, cred,
-					KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ,
-					flags, NULL, NULL);
+					&session_keyring_acl, flags, NULL, NULL);
 		if (IS_ERR(keyring))
 			return PTR_ERR(keyring);
 	} else {
@@ -554,7 +582,7 @@ bool lookup_user_key_possessed(const struct key *key,
  * returned key reference.
  */
 key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
-			  key_perm_t perm)
+			  unsigned int desired_perm)
 {
 	struct keyring_search_context ctx = {
 		.match_data.cmp		= lookup_user_key_possessed,
@@ -740,12 +768,12 @@ key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
 		case -ERESTARTSYS:
 			goto invalid_key;
 		default:
-			if (perm)
+			if (desired_perm)
 				goto invalid_key;
 		case 0:
 			break;
 		}
-	} else if (perm) {
+	} else if (desired_perm) {
 		ret = key_validate(key);
 		if (ret < 0)
 			goto invalid_key;
@@ -757,9 +785,11 @@ key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
 		goto invalid_key;
 
 	/* check the permissions */
-	ret = key_task_permission(key_ref, ctx.cred, perm);
-	if (ret < 0)
-		goto invalid_key;
+	if (desired_perm) {
+		ret = key_task_permission(key_ref, ctx.cred, desired_perm);
+		if (ret < 0)
+			goto invalid_key;
+	}
 
 	key->last_used_at = ktime_get_real_seconds();
 
@@ -824,13 +854,13 @@ long join_session_keyring(const char *name)
 	if (PTR_ERR(keyring) == -ENOKEY) {
 		/* not found - try and create a new one */
 		keyring = keyring_alloc(
-			name, old->uid, old->gid, old,
-			KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ | KEY_USR_LINK,
+			name, old->uid, old->gid, old, &joinable_keyring_acl,
 			KEY_ALLOC_IN_QUOTA, NULL, NULL);
 		if (IS_ERR(keyring)) {
 			ret = PTR_ERR(keyring);
 			goto error2;
 		}
+		goto no_perm_test;
 	} else if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
 		goto error2;
@@ -839,6 +869,12 @@ long join_session_keyring(const char *name)
 		goto error3;
 	}
 
+	ret = key_task_permission(make_key_ref(keyring, false), old,
+				  KEY_NEED_JOIN);
+	if (ret < 0)
+		goto error3;
+
+no_perm_test:
 	/* we've got a keyring - now to install it */
 	ret = install_session_keyring_to_cred(new, keyring);
 	if (ret < 0)
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 10244b6fbf5d..0d609c1efece 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -115,8 +115,7 @@ static int call_sbin_request_key(struct key *authkey)
 
 	cred = get_current_cred();
 	keyring = keyring_alloc(desc, cred->fsuid, cred->fsgid, cred,
-				KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ,
-				KEY_ALLOC_QUOTA_OVERRUN, NULL, NULL);
+				NULL, KEY_ALLOC_QUOTA_OVERRUN, NULL, NULL);
 	put_cred(cred);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -344,11 +343,11 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 			       struct key *dest_keyring,
 			       unsigned long flags,
 			       struct key_user *user,
+			       struct key_acl *acl,
 			       struct key **_key)
 {
 	struct assoc_array_edit *edit;
 	struct key *key;
-	key_perm_t perm;
 	key_ref_t key_ref;
 	int ret;
 
@@ -358,17 +357,9 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 	*_key = NULL;
 	mutex_lock(&user->cons_lock);
 
-	perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS_SETATTR;
-	perm |= KEY_USR_VIEW;
-	if (ctx->index_key.type->read)
-		perm |= KEY_POS_READ;
-	if (ctx->index_key.type == &key_type_keyring ||
-	    ctx->index_key.type->update)
-		perm |= KEY_POS_WRITE;
-
 	key = key_alloc(ctx->index_key.type, ctx->index_key.description,
 			ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred,
-			perm, flags, NULL);
+			acl, flags, NULL);
 	if (IS_ERR(key))
 		goto alloc_failed;
 
@@ -444,6 +435,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
 					  const char *callout_info,
 					  size_t callout_len,
 					  void *aux,
+					  struct key_acl *acl,
 					  struct key *dest_keyring,
 					  unsigned long flags)
 {
@@ -466,7 +458,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
 		goto error_put_dest_keyring;
 	}
 
-	ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key);
+	ret = construct_alloc_key(ctx, dest_keyring, flags, user, acl, &key);
 	key_user_put(user);
 
 	if (ret == 0) {
@@ -504,6 +496,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
  * @callout_info: The data to pass to the instantiation upcall (or NULL).
  * @callout_len: The length of callout_info.
  * @aux: Auxiliary data for the upcall.
+ * @acl: The ACL to attach if a new key is created.
  * @dest_keyring: Where to cache the key.
  * @flags: Flags to key_alloc().
  *
@@ -531,6 +524,7 @@ struct key *request_key_and_link(struct key_type *type,
 				 const void *callout_info,
 				 size_t callout_len,
 				 void *aux,
+				 struct key_acl *acl,
 				 struct key *dest_keyring,
 				 unsigned long flags)
 {
@@ -593,7 +587,7 @@ struct key *request_key_and_link(struct key_type *type,
 			goto error_free;
 
 		key = construct_key_and_link(&ctx, callout_info, callout_len,
-					     aux, dest_keyring, flags);
+					     aux, acl, dest_keyring, flags);
 	}
 
 error_free:
@@ -635,6 +629,7 @@ EXPORT_SYMBOL(wait_for_key_construction);
  * @type: Type of key.
  * @description: The searchable description of the key.
  * @callout_info: The data to pass to the instantiation upcall (or NULL).
+ * @acl: The ACL to attach if a new key is created.
  *
  * As for request_key_and_link() except that it does not add the returned key
  * to a keyring if found, new keys are always allocated in the user's quota,
@@ -646,7 +641,8 @@ EXPORT_SYMBOL(wait_for_key_construction);
  */
 struct key *request_key(struct key_type *type,
 			const char *description,
-			const char *callout_info)
+			const char *callout_info,
+			struct key_acl *acl)
 {
 	struct key *key;
 	size_t callout_len = 0;
@@ -656,7 +652,7 @@ struct key *request_key(struct key_type *type,
 		callout_len = strlen(callout_info);
 	key = request_key_and_link(type, description, NULL,
 				   callout_info, callout_len,
-				   NULL, NULL, KEY_ALLOC_IN_QUOTA);
+				   NULL, acl, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key)) {
 		ret = wait_for_key_construction(key, false);
 		if (ret < 0) {
@@ -675,6 +671,7 @@ EXPORT_SYMBOL(request_key);
  * @callout_info: The data to pass to the instantiation upcall (or NULL).
  * @callout_len: The length of callout_info.
  * @aux: Auxiliary data for the upcall.
+ * @acl: The ACL to attach if a new key is created.
  *
  * As for request_key_and_link() except that it does not add the returned key
  * to a keyring if found and new keys are always allocated in the user's quota.
@@ -686,14 +683,15 @@ struct key *request_key_with_auxdata(struct key_type *type,
 				     const char *description,
 				     const void *callout_info,
 				     size_t callout_len,
-				     void *aux)
+				     void *aux,
+				     struct key_acl *acl)
 {
 	struct key *key;
 	int ret;
 
 	key = request_key_and_link(type, description, NULL,
 				   callout_info, callout_len,
-				   aux, NULL, KEY_ALLOC_IN_QUOTA);
+				   aux, acl, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key)) {
 		ret = wait_for_key_construction(key, false);
 		if (ret < 0) {
@@ -711,6 +709,7 @@ EXPORT_SYMBOL(request_key_with_auxdata);
  * @description: The searchable description of the key.
  * @net: The network namespace that is the key's domain of operation.
  * @callout_info: The data to pass to the instantiation upcall (or NULL).
+ * @acl: The ACL to attach if a new key is created.
  *
  * As for request_key() except that it does not add the returned key to a
  * keyring if found, new keys are always allocated in the user's quota, the
@@ -723,7 +722,8 @@ EXPORT_SYMBOL(request_key_with_auxdata);
 struct key *request_key_net(struct key_type *type,
 			    const char *description,
 			    struct net *net,
-			    const char *callout_info)
+			    const char *callout_info,
+			    struct key_acl *acl)
 {
 	struct key *key;
 	size_t callout_len = 0;
@@ -733,7 +733,7 @@ struct key *request_key_net(struct key_type *type,
 		callout_len = strlen(callout_info);
 	key = request_key_and_link(type, description, net->key_domain,
 				   callout_info, callout_len,
-				   NULL, NULL, KEY_ALLOC_IN_QUOTA);
+				   NULL, acl, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key)) {
 		ret = wait_for_key_construction(key, false);
 		if (ret < 0) {
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index 726555a0639c..790c809844ac 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -28,6 +28,17 @@ static void request_key_auth_revoke(struct key *);
 static void request_key_auth_destroy(struct key *);
 static long request_key_auth_read(const struct key *, char __user *, size_t);
 
+static struct key_acl request_key_auth_acl = {
+	.usage	= REFCOUNT_INIT(1),
+	.nr_ace	= 2,
+	.possessor_viewable = true,
+	.aces = {
+		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH |
+				  KEY_ACE_LINK),
+		KEY_OWNER_ACE(KEY_ACE_VIEW),
+	}
+};
+
 /*
  * The request-key authorisation key type definition.
  */
@@ -208,8 +219,8 @@ struct key *request_key_auth_new(struct key *target, const char *op,
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
 			    cred->fsuid, cred->fsgid, cred,
-			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH | KEY_POS_LINK |
-			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA, NULL);
+			    &request_key_auth_acl,
+			    KEY_ALLOC_NOT_IN_QUOTA, NULL);
 	if (IS_ERR(authkey)) {
 		ret = PTR_ERR(authkey);
 		goto error_free_rka;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fd845063b692..616b7c292eb6 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6560,6 +6560,7 @@ static int selinux_key_permission(key_ref_t key_ref,
 {
 	struct key *key;
 	struct key_security_struct *ksec;
+	unsigned oldstyle_perm;
 	u32 sid;
 
 	/* if no specific permissions are requested, we skip the
@@ -6568,13 +6569,26 @@ static int selinux_key_permission(key_ref_t key_ref,
 	if (perm == 0)
 		return 0;
 
+	oldstyle_perm = perm & (KEY_NEED_VIEW | KEY_NEED_READ | KEY_NEED_WRITE |
+				KEY_NEED_SEARCH | KEY_NEED_LINK);
+	if (perm & KEY_NEED_SETSEC)
+		oldstyle_perm |= OLD_KEY_NEED_SETATTR;
+	if (perm & KEY_NEED_INVAL)
+		oldstyle_perm |= KEY_NEED_SEARCH;
+	if (perm & KEY_NEED_REVOKE && !(perm & OLD_KEY_NEED_SETATTR))
+		oldstyle_perm |= KEY_NEED_WRITE;
+	if (perm & KEY_NEED_JOIN)
+		oldstyle_perm |= KEY_NEED_SEARCH;
+	if (perm & KEY_NEED_CLEAR)
+		oldstyle_perm |= KEY_NEED_WRITE;
+
 	sid = cred_sid(cred);
 
 	key = key_ref_to_ptr(key_ref);
 	ksec = key->security;
 
 	return avc_has_perm(&selinux_state,
-			    sid, ksec->sid, SECCLASS_KEY, perm, NULL);
+			    sid, ksec->sid, SECCLASS_KEY, oldstyle_perm, NULL);
 }
 
 static int selinux_key_getsecurity(struct key *key, char **_buffer)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index feaace1c24a2..c09133115769 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -4407,7 +4407,8 @@ static int smack_key_permission(key_ref_t key_ref,
 #endif
 	if (perm & (KEY_NEED_READ | KEY_NEED_SEARCH | KEY_NEED_VIEW))
 		request |= MAY_READ;
-	if (perm & (KEY_NEED_WRITE | KEY_NEED_LINK | KEY_NEED_SETATTR))
+	if (perm & (KEY_NEED_WRITE | KEY_NEED_LINK | KEY_NEED_SETSEC |
+		    KEY_NEED_INVAL | KEY_NEED_REVOKE | KEY_NEED_CLEAR))
 		request |= MAY_WRITE;
 	rc = smk_access(tkp, keyp->security, request, &ad);
 	rc = smk_bu_note("key access", tkp, keyp->security, request, rc);


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 23/27] KEYS: Provide KEYCTL_GRANT_PERMISSION
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (21 preceding siblings ...)
  2019-02-15 16:11 ` [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL David Howells
@ 2019-02-15 16:11 ` David Howells
  2019-02-15 16:11 ` [RFC PATCH 24/27] keys: Allow a container to be specified as a subject in a key's ACL David Howells
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:11 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a keyctl() operation to grant/remove permissions.  The grant
operation, wrapped by libkeyutils, looks like:

	int ret = keyctl_grant_permission(key_serial_t key,
					  enum key_ace_subject_type type,
					  unsigned int subject,
					  unsigned int perm);

Where key is the key to be modified, type and subject represent the subject
to which permission is to be granted (or removed) and perm is the set of
permissions to be granted.  0 is returned on success.  SET_SECURITY
permission is required for this.

The subject type currently must be KEY_ACE_SUBJ_STANDARD for the moment
(other subject types will come along later).

For subject type KEY_ACE_SUBJ_STANDARD, the following subject values are
available:

	KEY_ACE_POSSESSOR	The possessor of the key
	KEY_ACE_OWNER		The owner of the key
	KEY_ACE_GROUP		The key's group
	KEY_ACE_EVERYONE	Everyone

perm lists the permissions to be granted:

	KEY_ACE_VIEW		Can view the key metadata
	KEY_ACE_READ		Can read the key content
	KEY_ACE_WRITE		Can update/modify the key content
	KEY_ACE_SEARCH		Can find the key by searching/requesting
	KEY_ACE_LINK		Can make a link to the key
	KEY_ACE_SET_SECURITY	Can set security
	KEY_ACE_INVAL		Can invalidate
	KEY_ACE_REVOKE		Can revoke
	KEY_ACE_JOIN		Can join this keyring
	KEY_ACE_CLEAR		Can clear this keyring

If an ACE already exists for the subject, then the permissions mask will be
overwritten; if perm is 0, it will be deleted.

Currently, the internal ACL is limited to a maximum of 16 entries.

For example:

	int ret = keyctl_grant_permission(key,
					  KEY_ACE_SUBJ_STANDARD,
					  KEY_ACE_OWNER,
					  KEY_ACE_VIEW | KEY_ACE_READ);

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/uapi/linux/keyctl.h |    1 
 security/keys/compat.c      |    2 +
 security/keys/internal.h    |    5 ++
 security/keys/keyctl.c      |    5 ++
 security/keys/permission.c  |  119 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+)

diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 50d7b6ca82ab..045dcbb6bb8d 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -136,6 +136,7 @@ enum key_ace_standard_subject {
 #define KEYCTL_MOVE			33	/* Move keys between keyrings */
 #define KEYCTL_FIND_LRU			34	/* Find the least-recently used key in a keyring */
 #define KEYCTL_SET_CONTAINER_KEYRING	35	/* Attach a keyring to a container */
+#define KEYCTL_GRANT_PERMISSION		36	/* Grant a permit to a key */
 
 /* keyctl structures */
 struct keyctl_dh_params {
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 7990ec026237..953156f94320 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -174,6 +174,8 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 
 	case KEYCTL_MOVE:
 		return keyctl_keyring_move(arg2, arg3, arg4, arg5);
+	case KEYCTL_GRANT_PERMISSION:
+		return keyctl_grant_permission(arg2, arg3, arg4, arg5);
 
 	default:
 		return -EOPNOTSUPP;
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 9f9ecc1810c9..6cd7b5c17298 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -377,6 +377,11 @@ extern long keyctl_find_lru(key_serial_t, const char __user *);
 extern long keyctl_set_container_keyring(int, key_serial_t);
 #endif
 
+extern long keyctl_grant_permission(key_serial_t keyid,
+				    enum key_ace_subject_type type,
+				    unsigned int subject,
+				    unsigned int perm);
+
 /*
  * Debugging key validation
  */
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 2df896bfb8e4..02bd73d5a05a 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1961,6 +1961,11 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 					   (key_serial_t)arg3,
 					   (key_serial_t)arg4,
 					   (unsigned int)arg5);
+	case KEYCTL_GRANT_PERMISSION:
+		return keyctl_grant_permission((key_serial_t)arg2,
+					       (enum key_ace_subject_type)arg3,
+					       (unsigned int)arg4,
+					       (unsigned int)arg5);
 
 	default:
 		return -EOPNOTSUPP;
diff --git a/security/keys/permission.c b/security/keys/permission.c
index 8dc6e80f6fd0..cb1359f6c668 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -274,3 +274,122 @@ long key_set_acl(struct key *key, struct key_acl *acl)
 	key_put_acl(acl);
 	return 0;
 }
+
+/*
+ * Allocate a new ACL with an extra ACE slot.
+ */
+static struct key_acl *key_alloc_acl(const struct key_acl *old_acl, int nr, int skip)
+{
+	struct key_acl *acl;
+	int nr_ace, i, j = 0;
+
+	nr_ace = old_acl->nr_ace + nr;
+	if (nr_ace > 16)
+		return ERR_PTR(-EINVAL);
+
+	acl = kzalloc(struct_size(acl, aces, nr_ace), GFP_KERNEL);
+	if (!acl)
+		return ERR_PTR(-ENOMEM);
+
+	refcount_set(&acl->usage, 1);
+	acl->nr_ace = nr_ace;
+	for (i = 0; i < old_acl->nr_ace; i++) {
+		if (i == skip)
+			continue;
+		acl->aces[j] = old_acl->aces[i];
+		j++;
+	}
+	return acl;
+}
+
+/*
+ * Generate the revised ACL.
+ */
+static long key_change_acl(struct key *key, struct key_ace *new_ace)
+{
+	struct key_acl *acl, *old;
+	int i;
+
+	old = rcu_dereference_protected(key->acl, lockdep_is_held(&key->sem));
+
+	for (i = 0; i < old->nr_ace; i++)
+		if (old->aces[i].type == new_ace->type &&
+		    old->aces[i].subject_id == new_ace->subject_id)
+			goto found_match;
+
+	if (new_ace->perm == 0)
+		return 0; /* No permissions to remove.  Add deny record? */
+
+	acl = key_alloc_acl(old, 1, -1);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	acl->aces[i] = *new_ace;
+	goto change;
+
+found_match:
+	if (new_ace->perm == 0)
+		goto delete_ace;
+	if (new_ace->perm == old->aces[i].perm)
+		return 0;
+	acl = key_alloc_acl(old, 0, -1);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	acl->aces[i].perm = new_ace->perm;
+	goto change;
+
+delete_ace:
+	acl = key_alloc_acl(old, -1, i);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	goto change;
+
+change:
+	return key_set_acl(key, acl);
+}
+
+/*
+ * Add, alter or remove (if perm == 0) an ACE in a key's ACL.
+ */
+long keyctl_grant_permission(key_serial_t keyid,
+			     enum key_ace_subject_type type,
+			     unsigned int subject,
+			     unsigned int perm)
+{
+	struct key_ace new_ace;
+	struct key *key;
+	key_ref_t key_ref;
+	long ret;
+
+	new_ace.type = type;
+	new_ace.perm = perm;
+
+	switch (type) {
+	case KEY_ACE_SUBJ_STANDARD:
+		if (subject >= nr__key_ace_standard_subject)
+			return -ENOENT;
+		new_ace.subject_id = subject;
+		break;
+
+	default:
+		return -ENOENT;
+	}
+
+	key_ref = lookup_user_key(keyid, KEY_LOOKUP_PARTIAL, KEY_NEED_SETSEC);
+	if (IS_ERR(key_ref)) {
+		ret = PTR_ERR(key_ref);
+		goto error;
+	}
+
+	key = key_ref_to_ptr(key_ref);
+
+	down_write(&key->sem);
+
+	/* If we're not the sysadmin, we can only change a key that we own */
+	ret = -EACCES;
+	if (capable(CAP_SYS_ADMIN) || uid_eq(key->uid, current_fsuid()))
+		ret = key_change_acl(key, &new_ace);
+	up_write(&key->sem);
+	key_put(key);
+error:
+	return ret;
+}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 24/27] keys: Allow a container to be specified as a subject in a key's ACL
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (22 preceding siblings ...)
  2019-02-15 16:11 ` [RFC PATCH 23/27] KEYS: Provide KEYCTL_GRANT_PERMISSION David Howells
@ 2019-02-15 16:11 ` David Howells
  2019-02-15 16:11 ` [RFC PATCH 25/27] keys: Provide a way to ask for the container keyring David Howells
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:11 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Allow the ACL attached to a key to grant permissions to the denizens of a
container object when request_key() is called.  This allows separate
permissions to those granted in the possessor set.

	int cfd = container_create("foo", 0);

	int ret = keyctl_grant_permission(key,
					  KEY_ACE_SUBJ_CONTAINER,
					  cfd,
					  KEY_ACE_SEARCH);

To allow request_key() to find a key, KEY_ACE_SEARCH must be included in
the ACE.  This will allow filesystems and network protocols (eg. AFS and
AF_RXRPC) to use the key.  For the request_key() system call to be able to
find a key for a process inside the container, KEY_ACE_LINK must be granted
also.

Keys on the container keyring (and the container keyring itself) can be
accessed directly by ID from inside the container if other KEY_ACE_*
permits are granted.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/container.h    |    6 ++-
 include/linux/key.h          |    3 +
 include/uapi/linux/keyctl.h  |    1 
 kernel/container.c           |   41 ++++++++++++++++++-
 samples/vfs/test-container.c |   60 ++++++++++++++++++++++++++++
 security/keys/permission.c   |   90 ++++++++++++++++++++++++++++++++++++++----
 security/keys/process_keys.c |    2 -
 7 files changed, 188 insertions(+), 15 deletions(-)

diff --git a/include/linux/container.h b/include/linux/container.h
index 7424f7fb5560..cd82074c26a3 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -33,7 +33,11 @@ struct container {
 	refcount_t		usage;
 	int			exit_code;	/* The exit code of 'init' */
 	const struct cred	*cred;		/* Creds for this container, including userns */
+#ifdef CONFIG_KEYS
 	struct key		*keyring;	/* Externally managed container keyring */
+	struct key_tag		*tag;		/* Container ID for key ACL */
+	struct list_head	req_key_traps;	/* Traps for request-key upcalls */
+#endif
 	struct nsproxy		*ns;		/* This container's namespaces */
 	struct path		root;		/* The root of the container's fs namespace */
 	struct task_struct	*init;		/* The 'init' task for this container */
@@ -43,7 +47,6 @@ struct container {
 	struct list_head	members;	/* Member processes, guarded with ->lock */
 	struct list_head	child_link;	/* Link in parent->children */
 	struct list_head	children;	/* Child containers */
-	struct list_head	req_key_traps;	/* Traps for request-key upcalls */
 	wait_queue_head_t	waitq;		/* Someone waiting for init to exit waits here */
 	unsigned long		flags;
 #define CONTAINER_FLAG_INIT_STARTED	0	/* Init is started - certain ops now prohibited */
@@ -63,6 +66,7 @@ extern int copy_container(unsigned long flags, struct task_struct *tsk,
 extern void exit_container(struct task_struct *tsk);
 extern void put_container(struct container *c);
 extern long key_del_intercept(struct container *c, const char *type);
+extern struct container *fd_to_container(int fd);
 
 static inline struct container *get_container(struct container *c)
 {
diff --git a/include/linux/key.h b/include/linux/key.h
index a38b89bd414c..01bccaa40047 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -90,6 +90,9 @@ struct key_ace {
 		kuid_t		uid;
 		kgid_t		gid;
 		unsigned int	subject_id;
+#ifdef CONFIG_CONTAINERS
+		struct key_tag __rcu *container_tag;
+#endif
 	};
 };
 
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 045dcbb6bb8d..7136d14dd4d7 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -20,6 +20,7 @@
  */
 enum key_ace_subject_type {
 	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
+	KEY_ACE_SUBJ_CONTAINER	= 1,	/* subject is a container fd */
 	nr__key_ace_subject_type
 };
 
diff --git a/kernel/container.c b/kernel/container.c
index f2706a45f364..81be4ed915c2 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -35,7 +35,9 @@ struct container init_container = {
 	.members.next	= &init_task.container_link,
 	.members.prev	= &init_task.container_link,
 	.children	= LIST_HEAD_INIT(init_container.children),
+#ifdef CONFIG_KEYS
 	.req_key_traps	= LIST_HEAD_INIT(init_container.req_key_traps),
+#endif
 	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
 	.lock		= __SPIN_LOCK_UNLOCKED(init_container.lock),
 	.seq		= SEQCNT_ZERO(init_fs.seq),
@@ -54,8 +56,6 @@ void put_container(struct container *c)
 
 	while (c && refcount_dec_and_test(&c->usage)) {
 		BUG_ON(!list_empty(&c->members));
-		if (!list_empty(&c->req_key_traps))
-			key_del_intercept(c, NULL);
 		if (c->pid_ns)
 			put_pid_ns(c->pid_ns);
 		if (c->ns)
@@ -71,7 +71,15 @@ void put_container(struct container *c)
 
 		if (c->cred)
 			put_cred(c->cred);
+#ifdef CONFIG_KEYS
+		if (!list_empty(&c->req_key_traps))
+			key_del_intercept(c, NULL);
+		if (c->tag) {
+			c->tag->removed = true;
+			key_put_tag(c->tag);
+		}
 		key_put(c->keyring);
+#endif
 		security_container_free(c);
 		kfree(c);
 		c = parent;
@@ -209,6 +217,24 @@ const struct file_operations container_fops = {
 	.release	= container_release,
 };
 
+/**
+ * fd_to_container - Get the container attached to an fd.
+ */
+struct container *fd_to_container(int fd)
+{
+	struct container *c = ERR_PTR(-EINVAL);
+	struct fd f = fdget(fd);
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	if (is_container_file(f.file))
+		c = get_container(f.file->private_data);
+
+	fdput(f);
+	return c;
+}
+
 /*
  * Handle fork/clone.
  *
@@ -290,7 +316,9 @@ static struct container *alloc_container(const char __user *name)
 
 	INIT_LIST_HEAD(&c->members);
 	INIT_LIST_HEAD(&c->children);
+#ifdef CONFIG_KEYS
 	INIT_LIST_HEAD(&c->req_key_traps);
+#endif
 	init_waitqueue_head(&c->waitq);
 	spin_lock_init(&c->lock);
 	refcount_set(&c->usage, 1);
@@ -305,8 +333,15 @@ static struct container *alloc_container(const char __user *name)
 	ret = -EINVAL;
 	if (strchr(c->name, '/'))
 		goto err;
-
 	c->name[len] = 0;
+
+#ifdef CONFIG_KEYS
+	ret = -ENOMEM;
+	c->tag = kzalloc(sizeof(*c->tag), GFP_KERNEL);
+	if (!c->tag)
+		goto err;
+	refcount_set(&c->tag->usage, 1);
+#endif
 	return c;
 
 err:
diff --git a/samples/vfs/test-container.c b/samples/vfs/test-container.c
index e24048fdbe33..7b2081693fce 100644
--- a/samples/vfs/test-container.c
+++ b/samples/vfs/test-container.c
@@ -22,6 +22,30 @@
 
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
 #define KEYCTL_SET_CONTAINER_KEYRING	35	/* Attach a keyring to a container */
+#define KEYCTL_GRANT_PERMISSION		36	/* Grant a permit to a key */
+
+enum key_ace_subject_type {
+	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
+	KEY_ACE_SUBJ_CONTAINER	= 1,	/* subject is a container fd */
+};
+
+enum key_ace_standard_subject {
+	KEY_ACE_EVERYONE	= 0,	/* Everyone, including owner and group */
+	KEY_ACE_GROUP		= 1,	/* The key's group */
+	KEY_ACE_OWNER		= 2,	/* The owner of the key */
+	KEY_ACE_POSSESSOR	= 3,	/* Any process that possesses of the key */
+};
+
+#define KEY_ACE_VIEW		0x00000001 /* Can describe the key */
+#define KEY_ACE_READ		0x00000002 /* Can read the key content */
+#define KEY_ACE_WRITE		0x00000004 /* Can update/modify the key content */
+#define KEY_ACE_SEARCH		0x00000008 /* Can find the key by search */
+#define KEY_ACE_LINK		0x00000010 /* Can make a link to the key */
+#define KEY_ACE_SET_SECURITY	0x00000020 /* Can set owner, group, ACL */
+#define KEY_ACE_INVAL		0x00000040 /* Can invalidate the key */
+#define KEY_ACE_REVOKE		0x00000080 /* Can revoke the key */
+#define KEY_ACE_JOIN		0x00000100 /* Can join keyring */
+#define KEY_ACE_CLEAR		0x00000200 /* Can clear keyring */
 
 /* Hope -1 isn't a syscall */
 #ifndef __NR_fsopen
@@ -190,7 +214,7 @@ void container_init(void)
  */
 int main(int argc, char *argv[])
 {
-	key_serial_t keyring;
+	key_serial_t keyring, key;
 	pid_t pid;
 	int fsfd, mfd, cfd, ws;
 
@@ -271,11 +295,45 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* We need to grant the container permission to search for keys in the
+	 * container keyring.
+	 */
+	if (keyctl(KEYCTL_GRANT_PERMISSION, keyring, KEY_ACE_SUBJ_CONTAINER, cfd,
+		   KEY_ACE_SEARCH) < 0) {
+		perror("keyctl_grant/s");
+		exit(1);
+	}
+
+	if (keyctl(KEYCTL_GRANT_PERMISSION, keyring,
+		   KEY_ACE_SUBJ_STANDARD, KEY_ACE_OWNER, 0) < 0) {
+		perror("keyctl_grant/s");
+		exit(1);
+	}
+
 	if (keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) < 0) {
 		perror("keyctl_set_container_keyring");
 		exit(1);
 	}
 
+	/* Create a key that can be accessed from within the container */
+	printf("Sample key...\n");
+	key = add_key("user", "foobar", "wibble", 6, keyring);
+	if (key == -1) {
+		perror("add_key/s");
+		exit(1);
+	}
+
+	if (keyctl(KEYCTL_GRANT_PERMISSION, key, KEY_ACE_SUBJ_CONTAINER, cfd,
+		   KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ | KEY_ACE_LINK) < 0) {
+		perror("keyctl_grant/s");
+		exit(1);
+	}
+
+	if (keyctl_link(key, keyring) < 0) {
+		perror("keyctl_link");
+		exit(1);
+	}
+
 	/* Create a keyring to catch upcalls. */
 	printf("Intercepting...\n");
 	keyring = add_key("keyring", "upcall", NULL, 0, KEY_SPEC_SESSION_KEYRING);
diff --git a/security/keys/permission.c b/security/keys/permission.c
index cb1359f6c668..f16d1665885f 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -13,6 +13,7 @@
 #include <linux/security.h>
 #include <linux/user_namespace.h>
 #include <linux/uaccess.h>
+#include <linux/container.h>
 #include "internal.h"
 
 struct key_acl default_key_acl = {
@@ -130,6 +131,15 @@ int key_task_permission(const key_ref_t key_ref, const struct cred *cred,
 				break;
 			}
 			break;
+#ifdef CONFIG_CONTAINERS
+		case KEY_ACE_SUBJ_CONTAINER: {
+			const struct key_tag *tag = rcu_dereference(ace->container_tag);
+
+			if (!tag->removed && current->container->tag == tag)
+				allow |= ace->perm;
+			break;
+		}
+#endif
 		}
 	}
 
@@ -185,8 +195,7 @@ EXPORT_SYMBOL(key_validate);
  */
 unsigned int key_acl_to_perm(const struct key_acl *acl)
 {
-	unsigned int perm = 0, tperm;
-	int i;
+	unsigned int perm = 0, tperm, i;
 
 	BUILD_BUG_ON(KEY_OTH_VIEW	!= KEY_ACE_VIEW		||
 		     KEY_OTH_READ	!= KEY_ACE_READ		||
@@ -237,13 +246,37 @@ unsigned int key_acl_to_perm(const struct key_acl *acl)
 	return perm;
 }
 
+/*
+ * Clean up an ACL.
+ */
+static void key_free_acl(struct rcu_head *rcu)
+{
+	struct key_acl *acl = container_of(rcu, struct key_acl, rcu);
+#ifdef CONFIG_CONTAINERS
+	struct key_tag *tag;
+	unsigned int i;
+
+	for (i = 0; i < acl->nr_ace; i++) {
+		const struct key_ace *ace = &acl->aces[i];
+		switch (ace->type) {
+		case KEY_ACE_SUBJ_CONTAINER:
+			tag = rcu_access_pointer(ace->container_tag);
+			key_put_tag(ace->container_tag);
+			break;
+		}
+	}
+#endif
+
+	kfree(acl);
+}
+
 /*
  * Destroy a key's ACL.
  */
 void key_put_acl(struct key_acl *acl)
 {
 	if (acl && refcount_dec_and_test(&acl->usage))
-		kfree_rcu(acl, rcu);
+		call_rcu(&acl->rcu, key_free_acl);
 }
 
 /*
@@ -297,6 +330,10 @@ static struct key_acl *key_alloc_acl(const struct key_acl *old_acl, int nr, int
 		if (i == skip)
 			continue;
 		acl->aces[j] = old_acl->aces[i];
+#ifdef CONFIG_CONTAINERS
+		if (acl->aces[j].type == KEY_ACE_SUBJ_CONTAINER)
+			refcount_inc(&acl->aces[j].container_tag->usage);
+#endif
 		j++;
 	}
 	return acl;
@@ -312,21 +349,39 @@ static long key_change_acl(struct key *key, struct key_ace *new_ace)
 
 	old = rcu_dereference_protected(key->acl, lockdep_is_held(&key->sem));
 
-	for (i = 0; i < old->nr_ace; i++)
-		if (old->aces[i].type == new_ace->type &&
-		    old->aces[i].subject_id == new_ace->subject_id)
-			goto found_match;
+	for (i = 0; i < old->nr_ace; i++) {
+		if (old->aces[i].type != new_ace->type)
+			continue;
+		switch (old->aces[i].type) {
+		case KEY_ACE_SUBJ_STANDARD:
+			if (old->aces[i].subject_id == new_ace->subject_id)
+				goto replace_ace;
+			break;
+#ifdef CONFIG_CONTAINERS
+		case KEY_ACE_SUBJ_CONTAINER:
+			if (old->aces[i].container_tag == new_ace->container_tag)
+				goto replace_ace;
+			break;
+#endif
+		default:
+			break;
+		}
+	}
 
 	if (new_ace->perm == 0)
-		return 0; /* No permissions to remove.  Add deny record? */
+		return 0; /* No permissions to cancel.  Add deny record? */
 
 	acl = key_alloc_acl(old, 1, -1);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	acl->aces[i] = *new_ace;
+#ifdef CONFIG_CONTAINERS
+	if (acl->aces[i].type == KEY_ACE_SUBJ_CONTAINER)
+		refcount_inc(&acl->aces[i].container_tag->usage);
+#endif
 	goto change;
 
-found_match:
+replace_ace:
 	if (new_ace->perm == 0)
 		goto delete_ace;
 	if (new_ace->perm == old->aces[i].perm)
@@ -360,6 +415,7 @@ long keyctl_grant_permission(key_serial_t keyid,
 	key_ref_t key_ref;
 	long ret;
 
+	memset(&new_ace, 0, sizeof(new_ace));
 	new_ace.type = type;
 	new_ace.perm = perm;
 
@@ -370,6 +426,18 @@ long keyctl_grant_permission(key_serial_t keyid,
 		new_ace.subject_id = subject;
 		break;
 
+#ifdef CONFIG_CONTAINERS
+	case KEY_ACE_SUBJ_CONTAINER: {
+		struct container *c = fd_to_container(subject);
+		if (IS_ERR(c))
+			return -EINVAL;
+		refcount_inc(&c->tag->usage);
+		new_ace.container_tag = c->tag;
+		put_container(c);
+		break;
+	}
+#endif
+
 	default:
 		return -ENOENT;
 	}
@@ -391,5 +459,9 @@ long keyctl_grant_permission(key_serial_t keyid,
 	up_write(&key->sem);
 	key_put(key);
 error:
+#ifdef CONFIG_CONTAINERS
+	if (new_ace.type == KEY_ACE_SUBJ_CONTAINER && new_ace.container_tag)
+		key_put_tag(new_ace.container_tag);
+#endif
 	return ret;
 }
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 0a231ede4d2b..f296a1cc979a 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -466,7 +466,7 @@ key_ref_t search_my_process_keyrings(struct keyring_search_context *ctx)
 #ifdef CONFIG_CONTAINERS
 	if (current->container->keyring) {
 		key_ref = keyring_search_aux(
-			make_key_ref(current->container->keyring, 1), ctx);
+			make_key_ref(current->container->keyring, false), ctx);
 		if (!IS_ERR(key_ref))
 			goto found;
 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 25/27] keys: Provide a way to ask for the container keyring
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (23 preceding siblings ...)
  2019-02-15 16:11 ` [RFC PATCH 24/27] keys: Allow a container to be specified as a subject in a key's ACL David Howells
@ 2019-02-15 16:11 ` David Howells
  2019-02-15 16:12 ` [RFC PATCH 26/27] keys: Allow containers to be included in key ACLs by name David Howells
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:11 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a constant that can be used in place of a key ID to indicate the
keyring belonging to the current process's container.  Used as:

	key_serial_t container_keyring =
		keyctl_get_key_ID(KEY_SPEC_CONTAINER_KEYRING, 0);

Note that this is merely a 'macro' for the ID of the keyring.  To be able
to actually do anything with it requires the keyring to grant appropriate
permissions to the denizens of the container.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/uapi/linux/keyctl.h  |    1 +
 samples/vfs/test-container.c |   15 +++++++++++++++
 security/keys/process_keys.c |    7 +++++++
 3 files changed, 23 insertions(+)

diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 7136d14dd4d7..89ab609f774c 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -88,6 +88,7 @@ enum key_ace_standard_subject {
 #define KEY_SPEC_GROUP_KEYRING		-6	/* - key ID for GID-specific keyring */
 #define KEY_SPEC_REQKEY_AUTH_KEY	-7	/* - key ID for assumed request_key auth key */
 #define KEY_SPEC_REQUESTOR_KEYRING	-8	/* - key ID for request_key() dest keyring */
+#define KEY_SPEC_CONTAINER_KEYRING	-9	/* - key ID for current->container's keyring */
 
 /* request-key default keyrings */
 #define KEY_REQKEY_DEFL_NO_CHANGE		-1
diff --git a/samples/vfs/test-container.c b/samples/vfs/test-container.c
index 7b2081693fce..4716dd50b696 100644
--- a/samples/vfs/test-container.c
+++ b/samples/vfs/test-container.c
@@ -20,6 +20,7 @@
 #include <sys/stat.h>
 #include <keyutils.h>
 
+#define KEY_SPEC_CONTAINER_KEYRING	-9	/* - key ID for current->container's keyring */
 #define KEYCTL_CONTAINER_INTERCEPT	31	/* Intercept upcalls inside a container */
 #define KEYCTL_SET_CONTAINER_KEYRING	35	/* Attach a keyring to a container */
 #define KEYCTL_GRANT_PERMISSION		36	/* Grant a permit to a key */
@@ -160,6 +161,8 @@ static inline int fork_into_container(int containerfd)
 static __attribute__((noreturn))
 void container_init(void)
 {
+	key_serial_t ckey;
+
 	if (0) {
 		/* Do a bit of debugging on the container. */
 		struct dirent **dlist;
@@ -203,6 +206,12 @@ void container_init(void)
 		exit(1);
 	}
 
+	ckey = keyctl_get_keyring_ID(KEY_SPEC_CONTAINER_KEYRING, 0);
+	if (ckey == -1)
+		perror("keyctl_get_keyring_ID");
+	else
+		printf("Container keyring %d\n", ckey);
+	
 	setenv("PS1", "container>", 1);
 	execl("/bin/bash", "bash", NULL);
 	perror("execl");
@@ -310,6 +319,12 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
+	if (keyctl(KEYCTL_GRANT_PERMISSION, keyring,
+		   KEY_ACE_SUBJ_STANDARD, KEY_ACE_OWNER, 0) < 0) {
+		perror("keyctl_grant/s");
+		exit(1);
+	}
+
 	if (keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) < 0) {
 		perror("keyctl_set_container_keyring");
 		exit(1);
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index f296a1cc979a..f8f580a760c9 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -725,6 +725,13 @@ key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
 		key_ref = make_key_ref(key, 1);
 		break;
 
+	case KEY_SPEC_CONTAINER_KEYRING:
+		key = current->container->keyring;
+		if (!key)
+			goto error;
+		key_ref = make_key_ref(key, 0);
+		goto error;
+
 	default:
 		key_ref = ERR_PTR(-EINVAL);
 		if (id < 1)


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 26/27] keys: Allow containers to be included in key ACLs by name
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (24 preceding siblings ...)
  2019-02-15 16:11 ` [RFC PATCH 25/27] keys: Provide a way to ask for the container keyring David Howells
@ 2019-02-15 16:12 ` David Howells
  2019-02-15 16:12 ` [RFC PATCH 27/27] containers: Sample to grant access to a key in a container David Howells
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:12 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Allow a container to be specified to KEYCTL_GRANT_PERMISSION by name.  This
allows processes that don't have access to the container fd to grant
permission on a key to a container.  This is restricted to the containers
that are children of the current container.

This can be effected with something like:

	keyctl(KEYCTL_GRANT_PERMISSION, key,
	       KEY_ACE_SUBJ_CONTAINER_NAME, "foo-test",
	       KEY_ACE_SEARCH);


Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/container.h   |    1 +
 include/uapi/linux/keyctl.h |    1 +
 kernel/container.c          |   24 ++++++++++++++++++++++++
 security/keys/compat.c      |    4 ++++
 security/keys/internal.h    |    2 +-
 security/keys/keyctl.c      |    2 +-
 security/keys/permission.c  |   19 ++++++++++++++++++-
 7 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/include/linux/container.h b/include/linux/container.h
index cd82074c26a3..fd49ce23467d 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -61,6 +61,7 @@ extern struct container init_container;
 #ifdef CONFIG_CONTAINERS
 extern const struct file_operations container_fops;
 
+extern struct container *find_container(const char *name);
 extern int copy_container(unsigned long flags, struct task_struct *tsk,
 			  struct container *container);
 extern void exit_container(struct task_struct *tsk);
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index 89ab609f774c..31520da17f37 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -21,6 +21,7 @@
 enum key_ace_subject_type {
 	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
 	KEY_ACE_SUBJ_CONTAINER	= 1,	/* subject is a container fd */
+	KEY_ACE_SUBJ_CONTAINER_NAME = 2, /* subject is a container name pointer */
 	nr__key_ace_subject_type
 };
 
diff --git a/kernel/container.c b/kernel/container.c
index 81be4ed915c2..c164c16328d6 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -235,6 +235,30 @@ struct container *fd_to_container(int fd)
 	return c;
 }
 
+/**
+ * find_container - Find a child container by name.
+ * @name: The name of the container to find.
+ *
+ * Find a child of the current container by name.
+ */
+struct container *find_container(const char *name)
+{
+	struct container *c = current->container, *p;
+
+	spin_lock(&c->lock);
+	list_for_each_entry(p, &c->children, child_link) {
+		if (strcmp(p->name, name) == 0) {
+			get_container(p);
+			goto found;
+		}
+	}
+
+	p = NULL;
+found:
+	spin_unlock(&c->lock);
+	return p;
+}
+
 /*
  * Handle fork/clone.
  *
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 953156f94320..78c6c0e0eb59 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -175,6 +175,10 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 	case KEYCTL_MOVE:
 		return keyctl_keyring_move(arg2, arg3, arg4, arg5);
 	case KEYCTL_GRANT_PERMISSION:
+		if (arg3 == KEY_ACE_SUBJ_CONTAINER_NAME)
+			return keyctl_grant_permission(arg2, arg3,
+						       (unsigned long)compat_ptr(arg4),
+						       arg5);
 		return keyctl_grant_permission(arg2, arg3, arg4, arg5);
 
 	default:
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 6cd7b5c17298..aa4ad9c8002e 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -379,7 +379,7 @@ extern long keyctl_set_container_keyring(int, key_serial_t);
 
 extern long keyctl_grant_permission(key_serial_t keyid,
 				    enum key_ace_subject_type type,
-				    unsigned int subject,
+				    unsigned long subject,
 				    unsigned int perm);
 
 /*
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 02bd73d5a05a..978c9008c3b2 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1964,7 +1964,7 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case KEYCTL_GRANT_PERMISSION:
 		return keyctl_grant_permission((key_serial_t)arg2,
 					       (enum key_ace_subject_type)arg3,
-					       (unsigned int)arg4,
+					       (unsigned long)arg4,
 					       (unsigned int)arg5);
 
 	default:
diff --git a/security/keys/permission.c b/security/keys/permission.c
index f16d1665885f..b0e94ccc4635 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -407,7 +407,7 @@ static long key_change_acl(struct key *key, struct key_ace *new_ace)
  */
 long keyctl_grant_permission(key_serial_t keyid,
 			     enum key_ace_subject_type type,
-			     unsigned int subject,
+			     unsigned long subject,
 			     unsigned int perm)
 {
 	struct key_ace new_ace;
@@ -436,6 +436,23 @@ long keyctl_grant_permission(key_serial_t keyid,
 		put_container(c);
 		break;
 	}
+	case KEY_ACE_SUBJ_CONTAINER_NAME: {
+		struct container *c;
+		char *name;
+
+		name = strndup_user((const char __user *)subject, 23);
+		if (IS_ERR(name))
+			return PTR_ERR(name);
+		c = find_container(name);
+		kfree(name);
+		if (!c)
+			return -EINVAL;
+		new_ace.type = KEY_ACE_SUBJ_CONTAINER;
+		refcount_inc(&c->tag->usage);
+		new_ace.container_tag = c->tag;
+		put_container(c);
+		break;
+	}
 #endif
 
 	default:


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 27/27] containers: Sample to grant access to a key in a container
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (25 preceding siblings ...)
  2019-02-15 16:12 ` [RFC PATCH 26/27] keys: Allow containers to be included in key ACLs by name David Howells
@ 2019-02-15 16:12 ` David Howells
  2019-02-15 22:36 ` [RFC PATCH 00/27] Containers and using authenticated filesystems James Morris
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 16:12 UTC (permalink / raw)
  To: keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	dhowells, linux-kernel

Provide a sample program that will grant access to the specified key for a
container named "foo-test" (as created by the test-container sample) and
then link the key into the container keyring (either given on the command
line or searches for a keyring called "_container" in the session keyring
as placed there by the test-container sample).

So, for example, this could be used to place an rxrpc key in the container
keyring for kAFS inside the container to use:

 (1) Poke kerberos to get a ticket for accessing AFS.

	# kinit
	# aklog-kafs redhat.com

 (2) Find the rxrpc key ID:

	# keyctl show
	Session Keyring
	1071328996 --alswrv      0     0  keyring: _ses
	 574060623 ---lswrv      0 65534   \_ keyring: _uid.0
	1004048468 --alswrv      0     0   \_ rxrpc: afs@redhat.com
	 918328787 --alswrv      0     0   \_ keyring: upcall
	 996275498 --alswrv      0     0   \_ keyring: _container
	 785497401 --alswrv      0     0       \_ user: foobar

     which would be 1004048468 in this example.

 (3) Invoke the sample:

	# test-cont-grant 1004048468

     The rxrpc key can now be seen in the container keyring:

	# keyctl show
	Session Keyring
	1071328996 --alswrv      0     0  keyring: _ses
	 574060623 ---lswrv      0 65534   \_ keyring: _uid.0
	1004048468 --alswrv      0     0   \_ rxrpc: afs@redhat.com
	 918328787 --alswrv      0     0   \_ keyring: upcall
	 996275498 --alswrv      0     0   \_ keyring: _container
	 785497401 --alswrv      0     0       \_ user: foobar
	1004048468 --alswrv      0     0       \_ rxrpc: afs@redhat.com

 (4) Mount the kAFS filesystem inside the container:

	> mount -t afs "%redhat.com:root.cell" /mnt

The contents of /mnt can then be used from inside the container using the
key placed into the container keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/vfs/Makefile          |    3 +
 samples/vfs/test-cont-grant.c |   84 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)
 create mode 100644 samples/vfs/test-cont-grant.c

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index a8e9e1142ae3..c8eea193a856 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -6,6 +6,7 @@ hostprogs-$(CONFIG_SAMPLE_VFS) := \
 	test-mntinfo \
 	test-statx \
 	test-container \
+	test-cont-grant \
 	test-upcall
 
 # Tell kbuild to always build the programs
@@ -22,5 +23,7 @@ HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
 
 HOSTCFLAGS_test-container.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-container += -lkeyutils
+HOSTCFLAGS_test-cont-grant.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-cont-grant += -lkeyutils
 HOSTCFLAGS_test-upcall.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-upcall += -lkeyutils
diff --git a/samples/vfs/test-cont-grant.c b/samples/vfs/test-cont-grant.c
new file mode 100644
index 000000000000..da4a60bc71fa
--- /dev/null
+++ b/samples/vfs/test-cont-grant.c
@@ -0,0 +1,84 @@
+/* Link a key into a container keyring and grant perms to the container.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <linux/mount.h>
+#include <linux/unistd.h>
+#include <dirent.h>
+#include <sys/stat.h>
+#include <keyutils.h>
+
+#define KEYCTL_GRANT_PERMISSION		36	/* Grant a permit to a key */
+
+enum key_ace_subject_type {
+	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
+	KEY_ACE_SUBJ_CONTAINER	= 1,	/* subject is a container fd */
+	KEY_ACE_SUBJ_CONTAINER_NAME = 2, /* subject is a container name pointer */
+};
+
+enum key_ace_standard_subject {
+	KEY_ACE_EVERYONE	= 0,	/* Everyone, including owner and group */
+	KEY_ACE_GROUP		= 1,	/* The key's group */
+	KEY_ACE_OWNER		= 2,	/* The owner of the key */
+	KEY_ACE_POSSESSOR	= 3,	/* Any process that possesses of the key */
+};
+
+#define KEY_ACE_VIEW		0x00000001 /* Can describe the key */
+#define KEY_ACE_READ		0x00000002 /* Can read the key content */
+#define KEY_ACE_WRITE		0x00000004 /* Can update/modify the key content */
+#define KEY_ACE_SEARCH		0x00000008 /* Can find the key by search */
+#define KEY_ACE_LINK		0x00000010 /* Can make a link to the key */
+#define KEY_ACE_SET_SECURITY	0x00000020 /* Can set owner, group, ACL */
+#define KEY_ACE_INVAL		0x00000040 /* Can invalidate the key */
+#define KEY_ACE_REVOKE		0x00000080 /* Can revoke the key */
+#define KEY_ACE_JOIN		0x00000100 /* Can join keyring */
+#define KEY_ACE_CLEAR		0x00000200 /* Can clear keyring */
+
+int main(int argc, char *argv[])
+{
+	key_serial_t key, keyring;
+
+	if (argc == 2) {
+		printf("Find keyring '_container'...\n");
+		keyring = keyctl_search(KEY_SPEC_SESSION_KEYRING, "keyring", "_container", 0);
+		if (keyring == -1) {
+			perror("keyctl_search");
+			exit(1);
+		}
+
+		key = atoi(argv[1]);
+	} else if (argc == 3) {
+		printf("Use specified keyring...\n");
+		keyring = atoi(argv[2]);
+		key = atoi(argv[1]);
+	} else {
+		fprintf(stderr, "Format: test-cont-grant <key> [<cont-keyring>]\n");
+		exit(2);
+	}
+
+	if (keyctl(KEYCTL_GRANT_PERMISSION, key,
+		   KEY_ACE_SUBJ_CONTAINER_NAME, "foo-test",
+		   KEY_ACE_SEARCH) < 0) {
+		perror("keyctl_grant/s");
+		exit(1);
+	}
+
+	if (keyctl_link(key, keyring) < 0) {
+		perror("keyctl_link");
+		exit(1);
+	}
+
+	exit(0);
+}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL
  2019-02-15 16:11 ` [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL David Howells
@ 2019-02-15 17:32   ` Stephen Smalley
  2019-02-15 17:39   ` David Howells
  1 sibling, 0 replies; 61+ messages in thread
From: Stephen Smalley @ 2019-02-15 17:32 UTC (permalink / raw)
  To: David Howells, keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, Paul Moore, SELinux

On 2/15/19 11:11 AM, David Howells wrote:
> Replace the uid/gid/perm permissions checking on a key with an ACL to allow
> the SETATTR and SEARCH permissions to be split.  This will also allow a
> greater range of subjects to represented.
> 
> ============
> WHY DO THIS?
> ============
> 
> The problem is that SETATTR and SEARCH cover a slew of actions, not all of
> which should be grouped together.
> 
> For SETATTR, this includes actions that are about controlling access to a
> key:
> 
>   (1) Changing a key's ownership.
> 
>   (2) Changing a key's security information.
> 
>   (3) Setting a keyring's restriction.
> 
> And actions that are about managing a key's lifetime:
> 
>   (4) Setting an expiry time.
> 
>   (5) Revoking a key.
> 
> and (proposed) managing a key as part of a cache:
> 
>   (6) Invalidating a key.
> 
> Managing a key's lifetime doesn't really have anything to do with
> controlling access to that key.
> 
> Expiry time is awkward since it's more about the lifetime of the content
> and so, in some ways goes better with WRITE permission.  It can, however,
> be set unconditionally by a process with an appropriate authorisation token
> for instantiating a key, and can also be set by the key type driver when a
> key is instantiated, so lumping it with the access-controlling actions is
> probably okay.
> 
> As for SEARCH permission, that currently covers:
> 
>   (1) Finding keys in a keyring tree during a search.
> 
>   (2) Permitting keyrings to be joined.
> 
>   (3) Invalidation.
> 
> But these don't really belong together either, since these actions really
> need to be controlled separately.
> 
> Finally, there are number of special cases to do with granting the
> administrator special rights to invalidate or clear keys that I would like
> to handle with the ACL rather than key flags and special checks.
> 
> 
> ===============
> WHAT IS CHANGED
> ===============
> 
> The SETATTR permission is split to create two new permissions:
> 
>   (1) SET_SECURITY - which allows the key's owner, group and ACL to be
>       changed and a restriction to be placed on a keyring.
> 
>   (2) REVOKE - which allows a key to be revoked.
> 
> The SEARCH permission is split to create:
> 
>   (1) SEARCH - which allows a keyring to be search and a key to be found.
> 
>   (2) JOIN - which allows a keyring to be joined as a session keyring.
> 
>   (3) INVAL - which allows a key to be invalidated.
> 
> The WRITE permission is also split to create:
> 
>   (1) WRITE - which allows a key's content to be altered and links to be
>       added, removed and replaced in a keyring.
> 
>   (2) CLEAR - which allows a keyring to be cleared completely.  This is
>       split out to make it possible to give just this to an administrator.
> 
>   (3) REVOKE - see above.
> 
> 
> Keys acquire ACLs which consist of a series of ACEs, and all that apply are
> unioned together.  An ACE specifies a subject, such as:
> 
>   (*) Possessor - permitted to anyone who 'possesses' a key
>   (*) Owner - permitted to the key owner
>   (*) Group - permitted to the key group
>   (*) Everyone - permitted to everyone
> 
> Note that 'Other' has been replaced with 'Everyone' on the assumption that
> you wouldn't grant a permit to 'Other' that you wouldn't also grant to
> everyone else.
> 
> Further subjects may be made available by later patches.
> 
> The ACE also specifies a permissions mask.  The set of permissions is now:
> 
> 	VIEW		Can view the key metadata
> 	READ		Can read the key content
> 	WRITE		Can update/modify the key content
> 	SEARCH		Can find the key by searching/requesting
> 	LINK		Can make a link to the key
> 	SET_SECURITY	Can change owner, ACL, expiry
> 	INVAL		Can invalidate
> 	REVOKE		Can revoke
> 	JOIN		Can join this keyring
> 	CLEAR		Can clear this keyring
> 
> 
> The KEYCTL_SETPERM function is then deprecated.
> 
> The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
> or if the caller has a valid instantiation auth token.
> 
> The KEYCTL_INVALIDATE function then requires INVAL.
> 
> The KEYCTL_REVOKE function then requires REVOKE.
> 
> The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
> existing keyring.
> 
> The JOIN permission is enabled by default for session keyrings and manually
> created keyrings only.
> 
> 
> ======================
> BACKWARD COMPATIBILITY
> ======================
> 
> To maintain backward compatibility, KEYCTL_SETPERM will translate the
> permissions mask it is given into a new ACL for a key - unless
> KEYCTL_SET_ACL has been called on that key, in which case an error will be
> returned.
> 
> It will convert possessor, owner, group and other permissions into separate
> ACEs, if each portion of the mask is non-zero.
> 
> SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY.  WRITE
> permission turns on WRITE, REVOKE and, if a keyring, CLEAR.  JOIN is turned
> on if a keyring is being altered.
> 
> The KEYCTL_DESCRIBE function translates the ACL back into a permissions
> mask to return depending on possessor, owner, group and everyone ACEs.
> 
> It will make the following mappings:
> 
>   (1) INVAL, JOIN -> SEARCH
> 
>   (2) SET_SECURITY -> SETATTR
> 
>   (3) REVOKE -> WRITE if SETATTR isn't already set
> 
>   (4) CLEAR -> WRITE
> 
> Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
> the value set with KEYCTL_SETATTR.
> 
> 
> =======
> TESTING
> =======
> 
> This passes the keyutils testsuite for all but a couple of tests:
> 
>   (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
>       returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
>       if the type doesn't have ->read().  You still can't actually read the
>       key.
> 
>   (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
>       work as Other has been replaced with Everyone in the ACL.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>   certs/blacklist.c                                  |    7 -
>   certs/system_keyring.c                             |   12 -
>   drivers/md/dm-crypt.c                              |    2
>   drivers/nvdimm/security.c                          |    2
>   fs/afs/security.c                                  |    2
>   fs/cifs/cifs_spnego.c                              |   25 ++
>   fs/cifs/cifsacl.c                                  |   28 ++
>   fs/cifs/connect.c                                  |    4
>   fs/crypto/keyinfo.c                                |    2
>   fs/ecryptfs/ecryptfs_kernel.h                      |    2
>   fs/ecryptfs/keystore.c                             |    2
>   fs/fscache/object-list.c                           |    2
>   fs/nfs/nfs4idmap.c                                 |   29 ++
>   fs/ubifs/auth.c                                    |    2
>   include/linux/key.h                                |  113 +++++----
>   include/uapi/linux/keyctl.h                        |   63 +++++
>   lib/digsig.c                                       |    2
>   net/ceph/ceph_common.c                             |    2
>   net/dns_resolver/dns_key.c                         |   12 +
>   net/dns_resolver/dns_query.c                       |   15 +
>   net/rxrpc/key.c                                    |   16 +
>   security/integrity/digsig.c                        |   31 +--
>   security/integrity/digsig_asymmetric.c             |    2
>   security/integrity/evm/evm_crypto.c                |    2
>   security/integrity/ima/ima_mok.c                   |   13 +
>   security/integrity/integrity.h                     |    4
>   .../integrity/platform_certs/platform_keyring.c    |   13 +
>   security/keys/encrypted-keys/encrypted.c           |    2
>   security/keys/encrypted-keys/masterkey_trusted.c   |    2
>   security/keys/gc.c                                 |    2
>   security/keys/internal.h                           |   12 +
>   security/keys/key.c                                |   29 +-
>   security/keys/keyctl.c                             |   93 +++++---
>   security/keys/keyring.c                            |   27 ++
>   security/keys/permission.c                         |  238 +++++++++++++++++---
>   security/keys/persistent.c                         |   27 ++
>   security/keys/proc.c                               |   17 +
>   security/keys/process_keys.c                       |   72 +++++-
>   security/keys/request_key.c                        |   40 ++-
>   security/keys/request_key_auth.c                   |   15 +
>   security/selinux/hooks.c                           |   16 +
>   security/smack/smack_lsm.c                         |    3
>   42 files changed, 726 insertions(+), 278 deletions(-)
> 
> diff --git a/certs/blacklist.c b/certs/blacklist.c
> index 3a507b9e2568..7677c3b0a147 100644
> --- a/certs/blacklist.c
> +++ b/certs/blacklist.c
> @@ -93,8 +93,7 @@ int mark_hash_blacklisted(const char *hash)
>   				   hash,
>   				   NULL,
>   				   0,
> -				   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				    KEY_USR_VIEW),
> +				   &internal_key_acl,
>   				   KEY_ALLOC_NOT_IN_QUOTA |
>   				   KEY_ALLOC_BUILT_IN);
>   	if (IS_ERR(key)) {
> @@ -153,9 +152,7 @@ static int __init blacklist_init(void)
>   		keyring_alloc(".blacklist",
>   			      KUIDT_INIT(0), KGIDT_INIT(0),
>   			      current_cred(),
> -			      (KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -			      KEY_USR_VIEW | KEY_USR_READ |
> -			      KEY_USR_SEARCH,
> +			      &internal_keyring_acl,
>   			      KEY_ALLOC_NOT_IN_QUOTA |
>   			      KEY_FLAG_KEEP,
>   			      NULL, NULL);
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index 81728717523d..7b775d6028e1 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -100,9 +100,7 @@ static __init int system_trusted_keyring_init(void)
>   	builtin_trusted_keys =
>   		keyring_alloc(".builtin_trusted_keys",
>   			      KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
> -			      ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -			      KEY_USR_VIEW | KEY_USR_READ | KEY_USR_SEARCH),
> -			      KEY_ALLOC_NOT_IN_QUOTA,
> +			      &internal_key_acl, KEY_ALLOC_NOT_IN_QUOTA,
>   			      NULL, NULL);
>   	if (IS_ERR(builtin_trusted_keys))
>   		panic("Can't allocate builtin trusted keyring\n");
> @@ -111,10 +109,7 @@ static __init int system_trusted_keyring_init(void)
>   	secondary_trusted_keys =
>   		keyring_alloc(".secondary_trusted_keys",
>   			      KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
> -			      ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -			       KEY_USR_VIEW | KEY_USR_READ | KEY_USR_SEARCH |
> -			       KEY_USR_WRITE),
> -			      KEY_ALLOC_NOT_IN_QUOTA,
> +			      &internal_writable_keyring_acl, KEY_ALLOC_NOT_IN_QUOTA,
>   			      get_builtin_and_secondary_restriction(),
>   			      NULL);
>   	if (IS_ERR(secondary_trusted_keys))
> @@ -164,8 +159,7 @@ static __init int load_system_certificate_list(void)
>   					   NULL,
>   					   p,
>   					   plen,
> -					   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -					   KEY_USR_VIEW | KEY_USR_READ),
> +					   &internal_key_acl,
>   					   KEY_ALLOC_NOT_IN_QUOTA |
>   					   KEY_ALLOC_BUILT_IN |
>   					   KEY_ALLOC_BYPASS_RESTRICTION);
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 0ff22159a0ca..7f37616cd21a 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2034,7 +2034,7 @@ static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string
>   		return -ENOMEM;
>   
>   	key = request_key(key_string[0] == 'l' ? &key_type_logon : &key_type_user,
> -			  key_desc + 1, NULL);
> +			  key_desc + 1, NULL, NULL);
>   	if (IS_ERR(key)) {
>   		kzfree(new_key_string);
>   		return PTR_ERR(key);
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index f8bb746a549f..db5cfd934ec8 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -53,7 +53,7 @@ static struct key *nvdimm_request_key(struct nvdimm *nvdimm)
>   	struct device *dev = &nvdimm->dev;
>   
>   	sprintf(desc, "%s%s", NVDIMM_PREFIX, nvdimm->dimm_id);
> -	key = request_key(&key_type_encrypted, desc, "");
> +	key = request_key(&key_type_encrypted, desc, "", NULL);
>   	if (IS_ERR(key)) {
>   		if (PTR_ERR(key) == -ENOKEY)
>   			dev_dbg(dev, "request_key() found no key\n");
> diff --git a/fs/afs/security.c b/fs/afs/security.c
> index 5f58a9a17e69..184274ce41e1 100644
> --- a/fs/afs/security.c
> +++ b/fs/afs/security.c
> @@ -32,7 +32,7 @@ struct key *afs_request_key(struct afs_cell *cell)
>   
>   	_debug("key %s", cell->anonymous_key->description);
>   	key = request_key(&key_type_rxrpc, cell->anonymous_key->description,
> -			  NULL);
> +			  NULL, NULL);
>   	if (IS_ERR(key)) {
>   		if (PTR_ERR(key) != -ENOKEY) {
>   			_leave(" = %ld", PTR_ERR(key));
> diff --git a/fs/cifs/cifs_spnego.c b/fs/cifs/cifs_spnego.c
> index 7f01c6e60791..d1b439ad0f1a 100644
> --- a/fs/cifs/cifs_spnego.c
> +++ b/fs/cifs/cifs_spnego.c
> @@ -32,6 +32,25 @@
>   #include "cifsproto.h"
>   static const struct cred *spnego_cred;
>   
> +static struct key_acl cifs_spnego_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
> +static struct key_acl cifs_spnego_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_CLEAR),
> +	}
> +};
> +
>   /* create a new cifs key */
>   static int
>   cifs_spnego_key_instantiate(struct key *key, struct key_preparsed_payload *prep)
> @@ -170,7 +189,8 @@ cifs_get_spnego_key(struct cifs_ses *sesInfo)
>   
>   	cifs_dbg(FYI, "key description = %s\n", description);
>   	saved_cred = override_creds(spnego_cred);
> -	spnego_key = request_key(&cifs_spnego_key_type, description, "");
> +	spnego_key = request_key(&cifs_spnego_key_type, description, "",
> +				 &cifs_spnego_key_acl);
>   	revert_creds(saved_cred);
>   
>   #ifdef CONFIG_CIFS_DEBUG2
> @@ -207,8 +227,7 @@ init_cifs_spnego(void)
>   
>   	keyring = keyring_alloc(".cifs_spnego",
>   				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
> -				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				KEY_USR_VIEW | KEY_USR_READ,
> +				&cifs_spnego_keyring_acl,
>   				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
>   	if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
> diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
> index 1d377b7f2860..78eed72f3af0 100644
> --- a/fs/cifs/cifsacl.c
> +++ b/fs/cifs/cifsacl.c
> @@ -33,6 +33,25 @@
>   #include "cifsproto.h"
>   #include "cifs_debug.h"
>   
> +static struct key_acl cifs_idmap_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
> +static struct key_acl cifs_idmap_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> +	}
> +};
> +
>   /* security id for everyone/world system group */
>   static const struct cifs_sid sid_everyone = {
>   	1, 1, {0, 0, 0, 0, 0, 1}, {0} };
> @@ -298,7 +317,8 @@ id_to_sid(unsigned int cid, uint sidtype, struct cifs_sid *ssid)
>   
>   	rc = 0;
>   	saved_cred = override_creds(root_cred);
> -	sidkey = request_key(&cifs_idmap_key_type, desc, "");
> +	sidkey = request_key(&cifs_idmap_key_type, desc, "",
> +			     &cifs_idmap_key_acl);
>   	if (IS_ERR(sidkey)) {
>   		rc = -EINVAL;
>   		cifs_dbg(FYI, "%s: Can't map %cid %u to a SID\n",
> @@ -403,7 +423,8 @@ sid_to_id(struct cifs_sb_info *cifs_sb, struct cifs_sid *psid,
>   		return -ENOMEM;
>   
>   	saved_cred = override_creds(root_cred);
> -	sidkey = request_key(&cifs_idmap_key_type, sidstr, "");
> +	sidkey = request_key(&cifs_idmap_key_type, sidstr, "",
> +			     &cifs_idmap_key_acl);
>   	if (IS_ERR(sidkey)) {
>   		rc = -EINVAL;
>   		cifs_dbg(FYI, "%s: Can't map SID %s to a %cid\n",
> @@ -481,8 +502,7 @@ init_cifs_idmap(void)
>   
>   	keyring = keyring_alloc(".cifs_idmap",
>   				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
> -				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				KEY_USR_VIEW | KEY_USR_READ,
> +				&cifs_idmap_keyring_acl,
>   				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
>   	if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 683310f26171..3b946fcf025c 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -2903,7 +2903,7 @@ cifs_set_cifscreds(struct smb_vol *vol, struct cifs_ses *ses)
>   	}
>   
>   	cifs_dbg(FYI, "%s: desc=%s\n", __func__, desc);
> -	key = request_key(&key_type_logon, desc, "");
> +	key = request_key(&key_type_logon, desc, "", NULL);
>   	if (IS_ERR(key)) {
>   		if (!ses->domainName) {
>   			cifs_dbg(FYI, "domainName is NULL\n");
> @@ -2914,7 +2914,7 @@ cifs_set_cifscreds(struct smb_vol *vol, struct cifs_ses *ses)
>   		/* didn't work, try to find a domain key */
>   		sprintf(desc, "cifs:d:%s", ses->domainName);
>   		cifs_dbg(FYI, "%s: desc=%s\n", __func__, desc);
> -		key = request_key(&key_type_logon, desc, "");
> +		key = request_key(&key_type_logon, desc, "", NULL);
>   		if (IS_ERR(key)) {
>   			rc = PTR_ERR(key);
>   			goto out_err;
> diff --git a/fs/crypto/keyinfo.c b/fs/crypto/keyinfo.c
> index 1e11a683f63d..201e8715302b 100644
> --- a/fs/crypto/keyinfo.c
> +++ b/fs/crypto/keyinfo.c
> @@ -92,7 +92,7 @@ find_and_lock_process_key(const char *prefix,
>   	if (!description)
>   		return ERR_PTR(-ENOMEM);
>   
> -	key = request_key(&key_type_logon, description, NULL);
> +	key = request_key(&key_type_logon, description, NULL, NULL);
>   	kfree(description);
>   	if (IS_ERR(key))
>   		return key;
> diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h
> index e74cb2a0b299..6460bd2a4e9d 100644
> --- a/fs/ecryptfs/ecryptfs_kernel.h
> +++ b/fs/ecryptfs/ecryptfs_kernel.h
> @@ -105,7 +105,7 @@ ecryptfs_get_encrypted_key_payload_data(struct key *key)
>   
>   static inline struct key *ecryptfs_get_encrypted_key(char *sig)
>   {
> -	return request_key(&key_type_encrypted, sig, NULL);
> +	return request_key(&key_type_encrypted, sig, NULL, NULL);
>   }
>   
>   #else
> diff --git a/fs/ecryptfs/keystore.c b/fs/ecryptfs/keystore.c
> index e74fe84d0886..38f4e30ed730 100644
> --- a/fs/ecryptfs/keystore.c
> +++ b/fs/ecryptfs/keystore.c
> @@ -1625,7 +1625,7 @@ int ecryptfs_keyring_auth_tok_for_sig(struct key **auth_tok_key,
>   {
>   	int rc = 0;
>   
> -	(*auth_tok_key) = request_key(&key_type_user, sig, NULL);
> +	(*auth_tok_key) = request_key(&key_type_user, sig, NULL, NULL);
>   	if (!(*auth_tok_key) || IS_ERR(*auth_tok_key)) {
>   		(*auth_tok_key) = ecryptfs_get_encrypted_key(sig);
>   		if (!(*auth_tok_key) || IS_ERR(*auth_tok_key)) {
> diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
> index 43e6e28c164f..6a672289e5ec 100644
> --- a/fs/fscache/object-list.c
> +++ b/fs/fscache/object-list.c
> @@ -321,7 +321,7 @@ static void fscache_objlist_config(struct fscache_objlist_data *data)
>   	const char *buf;
>   	int len;
>   
> -	key = request_key(&key_type_user, "fscache:objlist", NULL);
> +	key = request_key(&key_type_user, "fscache:objlist", NULL, NULL);
>   	if (IS_ERR(key))
>   		goto no_config;
>   
> diff --git a/fs/nfs/nfs4idmap.c b/fs/nfs/nfs4idmap.c
> index bf34ddaa2ad7..25f3f2f97ce9 100644
> --- a/fs/nfs/nfs4idmap.c
> +++ b/fs/nfs/nfs4idmap.c
> @@ -71,6 +71,25 @@ struct idmap {
>   	struct mutex		idmap_mutex;
>   };
>   
> +static struct key_acl nfs_idmap_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
> +static struct key_acl nfs_idmap_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> +	}
> +};
> +
>   /**
>    * nfs_fattr_init_names - initialise the nfs_fattr owner_name/group_name fields
>    * @fattr: fully initialised struct nfs_fattr
> @@ -200,8 +219,7 @@ int nfs_idmap_init(void)
>   
>   	keyring = keyring_alloc(".id_resolver",
>   				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
> -				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				KEY_USR_VIEW | KEY_USR_READ,
> +				&nfs_idmap_keyring_acl,
>   				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
>   	if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
> @@ -278,11 +296,12 @@ static struct key *nfs_idmap_request_key(const char *name, size_t namelen,
>   	if (ret < 0)
>   		return ERR_PTR(ret);
>   
> -	rkey = request_key(&key_type_id_resolver, desc, "");
> +	rkey = request_key(&key_type_id_resolver, desc, "", &nfs_idmap_key_acl);
>   	if (IS_ERR(rkey)) {
>   		mutex_lock(&idmap->idmap_mutex);
>   		rkey = request_key_with_auxdata(&key_type_id_resolver_legacy,
> -						desc, "", 0, idmap);
> +						desc, "", 0, idmap,
> +						&nfs_idmap_key_acl);
>   		mutex_unlock(&idmap->idmap_mutex);
>   	}
>   	if (!IS_ERR(rkey))
> @@ -311,8 +330,6 @@ static ssize_t nfs_idmap_get_key(const char *name, size_t namelen,
>   	}
>   
>   	rcu_read_lock();
> -	rkey->perm |= KEY_USR_VIEW;
> -
>   	ret = key_validate(rkey);
>   	if (ret < 0)
>   		goto out_up;
> diff --git a/fs/ubifs/auth.c b/fs/ubifs/auth.c
> index 5bf5fd08879e..38bae9737166 100644
> --- a/fs/ubifs/auth.c
> +++ b/fs/ubifs/auth.c
> @@ -246,7 +246,7 @@ int ubifs_init_authentication(struct ubifs_info *c)
>   	snprintf(hmac_name, CRYPTO_MAX_ALG_NAME, "hmac(%s)",
>   		 c->auth_hash_name);
>   
> -	keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL);
> +	keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL, NULL);
>   
>   	if (IS_ERR(keyring_key)) {
>   		ubifs_err(c, "Failed to request key: %ld",
> diff --git a/include/linux/key.h b/include/linux/key.h
> index de190036512b..a38b89bd414c 100644
> --- a/include/linux/key.h
> +++ b/include/linux/key.h
> @@ -32,49 +32,14 @@
>   /* key handle serial number */
>   typedef int32_t key_serial_t;
>   
> -/* key handle permissions mask */
> -typedef uint32_t key_perm_t;
> -
>   struct key;
>   struct net;
>   
>   #ifdef CONFIG_KEYS
>   
> -#undef KEY_DEBUGGING
> +#include <linux/keyctl.h>
>   
> -#define KEY_POS_VIEW	0x01000000	/* possessor can view a key's attributes */
> -#define KEY_POS_READ	0x02000000	/* possessor can read key payload / view keyring */
> -#define KEY_POS_WRITE	0x04000000	/* possessor can update key payload / add link to keyring */
> -#define KEY_POS_SEARCH	0x08000000	/* possessor can find a key in search / search a keyring */
> -#define KEY_POS_LINK	0x10000000	/* possessor can create a link to a key/keyring */
> -#define KEY_POS_SETATTR	0x20000000	/* possessor can set key attributes */
> -#define KEY_POS_ALL	0x3f000000
> -
> -#define KEY_USR_VIEW	0x00010000	/* user permissions... */
> -#define KEY_USR_READ	0x00020000
> -#define KEY_USR_WRITE	0x00040000
> -#define KEY_USR_SEARCH	0x00080000
> -#define KEY_USR_LINK	0x00100000
> -#define KEY_USR_SETATTR	0x00200000
> -#define KEY_USR_ALL	0x003f0000
> -
> -#define KEY_GRP_VIEW	0x00000100	/* group permissions... */
> -#define KEY_GRP_READ	0x00000200
> -#define KEY_GRP_WRITE	0x00000400
> -#define KEY_GRP_SEARCH	0x00000800
> -#define KEY_GRP_LINK	0x00001000
> -#define KEY_GRP_SETATTR	0x00002000
> -#define KEY_GRP_ALL	0x00003f00
> -
> -#define KEY_OTH_VIEW	0x00000001	/* third party permissions... */
> -#define KEY_OTH_READ	0x00000002
> -#define KEY_OTH_WRITE	0x00000004
> -#define KEY_OTH_SEARCH	0x00000008
> -#define KEY_OTH_LINK	0x00000010
> -#define KEY_OTH_SETATTR	0x00000020
> -#define KEY_OTH_ALL	0x0000003f
> -
> -#define KEY_PERM_UNDEF	0xffffffff
> +#undef KEY_DEBUGGING
>   
>   struct seq_file;
>   struct user_struct;
> @@ -118,6 +83,36 @@ union key_payload {
>   	void			*data[4];
>   };
>   
> +struct key_ace {
> +	unsigned int		type;
> +	unsigned int		perm;
> +	union {
> +		kuid_t		uid;
> +		kgid_t		gid;
> +		unsigned int	subject_id;
> +	};
> +};
> +
> +struct key_acl {
> +	refcount_t		usage;
> +	unsigned short		nr_ace;
> +	bool			possessor_viewable;
> +	struct rcu_head		rcu;
> +	struct key_ace		aces[];
> +};
> +
> +#define KEY_POSSESSOR_ACE(perms) {			\
> +		.type = KEY_ACE_SUBJ_STANDARD,		\
> +		.perm = perms,				\
> +		.subject_id = KEY_ACE_POSSESSOR		\
> +	}
> +
> +#define KEY_OWNER_ACE(perms) {				\
> +		.type = KEY_ACE_SUBJ_STANDARD,		\
> +		.perm = perms,				\
> +		.subject_id = KEY_ACE_OWNER		\
> +	}
> +
>   /*****************************************************************************/
>   /*
>    * key reference with possession attribute handling
> @@ -187,6 +182,7 @@ struct key {
>   	struct rw_semaphore	sem;		/* change vs change sem */
>   	struct key_user		*user;		/* owner of this key */
>   	void			*security;	/* security data for this key */
> +	struct key_acl		__rcu *acl;
>   	union {
>   		time64_t	expiry;		/* time at which key expires (or 0) */
>   		time64_t	revoked_at;	/* time at which key was revoked */
> @@ -194,7 +190,6 @@ struct key {
>   	time64_t		last_used_at;	/* last time used for LRU keyring discard */
>   	kuid_t			uid;
>   	kgid_t			gid;
> -	key_perm_t		perm;		/* access permissions */
>   	unsigned short		quotalen;	/* length added to quota */
>   	unsigned short		datalen;	/* payload data length
>   						 * - may not match RCU dereferenced payload
> @@ -220,6 +215,7 @@ struct key {
>   #define KEY_FLAG_UID_KEYRING	9	/* set if key is a user or user session keyring */
>   #define KEY_FLAG_SET_WATCH_PROXY 10	/* Set if watch_proxy should be set on added keys */
>   #define KEY_FLAG_SEEN		11	/* Set if returned by keyctl_find_oldest_key() */
> +#define KEY_FLAG_HAS_ACL	12	/* Set if KEYCTL_SETACL called on key */
>   
>   	/* the key type and key description string
>   	 * - the desc is used to match a key against search criteria
> @@ -268,7 +264,7 @@ extern struct key *key_alloc(struct key_type *type,
>   			     const char *desc,
>   			     kuid_t uid, kgid_t gid,
>   			     const struct cred *cred,
> -			     key_perm_t perm,
> +			     struct key_acl *acl,
>   			     unsigned long flags,
>   			     struct key_restriction *restrict_link);
>   
> @@ -304,18 +300,21 @@ static inline void key_ref_put(key_ref_t key_ref)
>   
>   extern struct key *request_key(struct key_type *type,
>   			       const char *description,
> -			       const char *callout_info);
> +			       const char *callout_info,
> +			       struct key_acl *acl);
>   
>   extern struct key *request_key_with_auxdata(struct key_type *type,
>   					    const char *description,
>   					    const void *callout_info,
>   					    size_t callout_len,
> -					    void *aux);
> +					    void *aux,
> +					    struct key_acl *acl);
>   
>   extern struct key *request_key_net(struct key_type *type,
>   				   const char *description,
>   				   struct net *net,
> -				   const char *callout_info);
> +				   const char *callout_info,
> +				   struct key_acl *acl);
>   
>   extern int wait_for_key_construction(struct key *key, bool intr);
>   
> @@ -326,7 +325,7 @@ extern key_ref_t key_create_or_update(key_ref_t keyring,
>   				      const char *description,
>   				      const void *payload,
>   				      size_t plen,
> -				      key_perm_t perm,
> +				      struct key_acl *acl,
>   				      unsigned long flags);
>   
>   extern int key_update(key_ref_t key,
> @@ -346,7 +345,7 @@ extern int key_unlink(struct key *keyring,
>   
>   extern struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
>   				 const struct cred *cred,
> -				 key_perm_t perm,
> +				 struct key_acl *acl,
>   				 unsigned long flags,
>   				 struct key_restriction *restrict_link,
>   				 struct key *dest);
> @@ -378,19 +377,29 @@ static inline key_serial_t key_serial(const struct key *key)
>   extern void key_set_timeout(struct key *, unsigned);
>   
>   extern key_ref_t lookup_user_key(key_serial_t id, unsigned long flags,
> -				 key_perm_t perm);
> +				 u32 desired_perm);
>   extern void key_free_user_ns(struct user_namespace *);
>   
>   /*
>    * The permissions required on a key that we're looking up.
>    */
> -#define	KEY_NEED_VIEW	0x01	/* Require permission to view attributes */
> -#define	KEY_NEED_READ	0x02	/* Require permission to read content */
> -#define	KEY_NEED_WRITE	0x04	/* Require permission to update / modify */
> -#define	KEY_NEED_SEARCH	0x08	/* Require permission to search (keyring) or find (key) */
> -#define	KEY_NEED_LINK	0x10	/* Require permission to link */
> -#define	KEY_NEED_SETATTR 0x20	/* Require permission to change attributes */
> -#define	KEY_NEED_ALL	0x3f	/* All the above permissions */
> +#define	KEY_NEED_VIEW	0x001	/* Require permission to view attributes */
> +#define	KEY_NEED_READ	0x002	/* Require permission to read content */
> +#define	KEY_NEED_WRITE	0x004	/* Require permission to update / modify */
> +#define	KEY_NEED_SEARCH	0x008	/* Require permission to search (keyring) or find (key) */
> +#define	KEY_NEED_LINK	0x010	/* Require permission to link */
> +#define	KEY_NEED_SETSEC	0x020	/* Require permission to set owner, group, ACL */
> +#define	KEY_NEED_INVAL	0x040	/* Require permission to invalidate key */
> +#define	KEY_NEED_REVOKE	0x080	/* Require permission to revoke key */
> +#define	KEY_NEED_JOIN	0x100	/* Require permission to join keyring as session */
> +#define	KEY_NEED_CLEAR	0x200	/* Require permission to clear a keyring */
> +#define KEY_NEED_ALL	0x3ff
> +
> +#define OLD_KEY_NEED_SETATTR 0x20 /* Used to be Require permission to change attributes */
> +
> +extern struct key_acl internal_key_acl;
> +extern struct key_acl internal_keyring_acl;
> +extern struct key_acl internal_writable_keyring_acl;
>   
>   static inline short key_read_state(const struct key *key)
>   {
> diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
> index a2afb4512f34..50d7b6ca82ab 100644
> --- a/include/uapi/linux/keyctl.h
> +++ b/include/uapi/linux/keyctl.h
> @@ -15,6 +15,69 @@
>   
>   #include <linux/types.h>
>   
> +/*
> + * Keyring permission grant definitions
> + */
> +enum key_ace_subject_type {
> +	KEY_ACE_SUBJ_STANDARD	= 0,	/* subject is one of key_ace_standard_subject */
> +	nr__key_ace_subject_type
> +};
> +
> +enum key_ace_standard_subject {
> +	KEY_ACE_EVERYONE	= 0,	/* Everyone, including owner and group */
> +	KEY_ACE_GROUP		= 1,	/* The key's group */
> +	KEY_ACE_OWNER		= 2,	/* The owner of the key */
> +	KEY_ACE_POSSESSOR	= 3,	/* Any process that possesses of the key */
> +	nr__key_ace_standard_subject
> +};
> +
> +#define KEY_ACE_VIEW		0x00000001 /* Can describe the key */
> +#define KEY_ACE_READ		0x00000002 /* Can read the key content */
> +#define KEY_ACE_WRITE		0x00000004 /* Can update/modify the key content */
> +#define KEY_ACE_SEARCH		0x00000008 /* Can find the key by search */
> +#define KEY_ACE_LINK		0x00000010 /* Can make a link to the key */
> +#define KEY_ACE_SET_SECURITY	0x00000020 /* Can set owner, group, ACL */
> +#define KEY_ACE_INVAL		0x00000040 /* Can invalidate the key */
> +#define KEY_ACE_REVOKE		0x00000080 /* Can revoke the key */
> +#define KEY_ACE_JOIN		0x00000100 /* Can join keyring */
> +#define KEY_ACE_CLEAR		0x00000200 /* Can clear keyring */
> +#define KEY_ACE__PERMS		0xffffffff
> +
> +/*
> + * Old-style permissions mask, deprecated in favour of ACL.
> + */
> +#define KEY_POS_VIEW	0x01000000	/* possessor can view a key's attributes */
> +#define KEY_POS_READ	0x02000000	/* possessor can read key payload / view keyring */
> +#define KEY_POS_WRITE	0x04000000	/* possessor can update key payload / add link to keyring */
> +#define KEY_POS_SEARCH	0x08000000	/* possessor can find a key in search / search a keyring */
> +#define KEY_POS_LINK	0x10000000	/* possessor can create a link to a key/keyring */
> +#define KEY_POS_SETATTR	0x20000000	/* possessor can set key attributes */
> +#define KEY_POS_ALL	0x3f000000
> +
> +#define KEY_USR_VIEW	0x00010000	/* user permissions... */
> +#define KEY_USR_READ	0x00020000
> +#define KEY_USR_WRITE	0x00040000
> +#define KEY_USR_SEARCH	0x00080000
> +#define KEY_USR_LINK	0x00100000
> +#define KEY_USR_SETATTR	0x00200000
> +#define KEY_USR_ALL	0x003f0000
> +
> +#define KEY_GRP_VIEW	0x00000100	/* group permissions... */
> +#define KEY_GRP_READ	0x00000200
> +#define KEY_GRP_WRITE	0x00000400
> +#define KEY_GRP_SEARCH	0x00000800
> +#define KEY_GRP_LINK	0x00001000
> +#define KEY_GRP_SETATTR	0x00002000
> +#define KEY_GRP_ALL	0x00003f00
> +
> +#define KEY_OTH_VIEW	0x00000001	/* third party permissions... */
> +#define KEY_OTH_READ	0x00000002
> +#define KEY_OTH_WRITE	0x00000004
> +#define KEY_OTH_SEARCH	0x00000008
> +#define KEY_OTH_LINK	0x00000010
> +#define KEY_OTH_SETATTR	0x00000020
> +#define KEY_OTH_ALL	0x0000003f
> +
>   /* special process keyring shortcut IDs */
>   #define KEY_SPEC_THREAD_KEYRING		-1	/* - key ID for thread-specific keyring */
>   #define KEY_SPEC_PROCESS_KEYRING	-2	/* - key ID for process-specific keyring */
> diff --git a/lib/digsig.c b/lib/digsig.c
> index 6ba6fcd92dd1..8cfa53585267 100644
> --- a/lib/digsig.c
> +++ b/lib/digsig.c
> @@ -227,7 +227,7 @@ int digsig_verify(struct key *keyring, const char *sig, int siglen,
>   		else
>   			key = key_ref_to_ptr(kref);
>   	} else {
> -		key = request_key(&key_type_user, name, NULL);
> +		key = request_key(&key_type_user, name, NULL, NULL);
>   	}
>   	if (IS_ERR(key)) {
>   		pr_err("key not found, id: %s\n", name);
> diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
> index 9cab80207ced..c6efe800392e 100644
> --- a/net/ceph/ceph_common.c
> +++ b/net/ceph/ceph_common.c
> @@ -305,7 +305,7 @@ static int get_secret(struct ceph_crypto_key *dst, const char *name) {
>   	int err = 0;
>   	struct ceph_crypto_key *ckey;
>   
> -	ukey = request_key(&key_type_ceph, name, NULL);
> +	ukey = request_key(&key_type_ceph, name, NULL, NULL);
>   	if (IS_ERR(ukey)) {
>   		/* request_key errors don't map nicely to mount(2)
>   		   errors; don't even try, but still printk */
> diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> index 3e1a90669006..6b201531b165 100644
> --- a/net/dns_resolver/dns_key.c
> +++ b/net/dns_resolver/dns_key.c
> @@ -46,6 +46,15 @@ const struct cred *dns_resolver_cache;
>   
>   #define	DNS_ERRORNO_OPTION	"dnserror"
>   
> +static struct key_acl dns_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_CLEAR),
> +	}
> +};
> +
>   /*
>    * Preparse instantiation data for a dns_resolver key.
>    *
> @@ -343,8 +352,7 @@ static int __init init_dns_resolver(void)
>   
>   	keyring = keyring_alloc(".dns_resolver",
>   				GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
> -				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				KEY_USR_VIEW | KEY_USR_READ,
> +				&dns_keyring_acl,
>   				KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
>   	if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
> diff --git a/net/dns_resolver/dns_query.c b/net/dns_resolver/dns_query.c
> index d88ea98da63e..3a6436a7931a 100644
> --- a/net/dns_resolver/dns_query.c
> +++ b/net/dns_resolver/dns_query.c
> @@ -46,6 +46,16 @@
>   
>   #include "internal.h"
>   
> +static struct key_acl dns_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_SEARCH | KEY_ACE_READ),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_INVAL),
> +	}
> +};
> +
>   /**
>    * dns_query - Query the DNS
>    * @net: The network namespace to operate in.
> @@ -124,7 +134,8 @@ int dns_query(struct net *net,
>   	 * add_key() to preinstall malicious redirections
>   	 */
>   	saved_cred = override_creds(dns_resolver_cache);
> -	rkey = request_key_net(&key_type_dns_resolver, desc, net, options);
> +	rkey = request_key_net(&key_type_dns_resolver, desc, net, options,
> +			       &dns_key_acl);
>   	revert_creds(saved_cred);
>   	kfree(desc);
>   	if (IS_ERR(rkey)) {
> @@ -134,8 +145,6 @@ int dns_query(struct net *net,
>   
>   	down_read(&rkey->sem);
>   	set_bit(KEY_FLAG_ROOT_CAN_INVAL, &rkey->flags);
> -	rkey->perm |= KEY_USR_VIEW;
> -
>   	ret = key_validate(rkey);
>   	if (ret < 0)
>   		goto put;
> diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c
> index 1cc6b0c6cc42..284d7a025fbc 100644
> --- a/net/rxrpc/key.c
> +++ b/net/rxrpc/key.c
> @@ -27,6 +27,14 @@
>   #include <keys/user-type.h>
>   #include "ar-internal.h"
>   
> +static struct key_acl rxrpc_null_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 1,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_READ),
> +	}
> +};
> +
>   static int rxrpc_vet_description_s(const char *);
>   static int rxrpc_preparse(struct key_preparsed_payload *);
>   static int rxrpc_preparse_s(struct key_preparsed_payload *);
> @@ -914,7 +922,8 @@ int rxrpc_request_key(struct rxrpc_sock *rx, char __user *optval, int optlen)
>   	if (IS_ERR(description))
>   		return PTR_ERR(description);
>   
> -	key = request_key_net(&key_type_rxrpc, description, sock_net(&rx->sk), NULL);
> +	key = request_key_net(&key_type_rxrpc, description, sock_net(&rx->sk),
> +			      NULL, NULL);
>   	if (IS_ERR(key)) {
>   		kfree(description);
>   		_leave(" = %ld", PTR_ERR(key));
> @@ -945,7 +954,8 @@ int rxrpc_server_keyring(struct rxrpc_sock *rx, char __user *optval,
>   	if (IS_ERR(description))
>   		return PTR_ERR(description);
>   
> -	key = request_key_net(&key_type_keyring, description, sock_net(&rx->sk), NULL);
> +	key = request_key_net(&key_type_keyring, description, sock_net(&rx->sk),
> +			      NULL, NULL);
>   	if (IS_ERR(key)) {
>   		kfree(description);
>   		_leave(" = %ld", PTR_ERR(key));
> @@ -1026,7 +1036,7 @@ struct key *rxrpc_get_null_key(const char *keyname)
>   
>   	key = key_alloc(&key_type_rxrpc, keyname,
>   			GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred,
> -			KEY_POS_SEARCH, KEY_ALLOC_NOT_IN_QUOTA, NULL);
> +			&rxrpc_null_key_acl, KEY_ALLOC_NOT_IN_QUOTA, NULL);
>   	if (IS_ERR(key))
>   		return key;
>   
> diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
> index f45d6edecf99..c666dc72006a 100644
> --- a/security/integrity/digsig.c
> +++ b/security/integrity/digsig.c
> @@ -51,7 +51,8 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
>   
>   	if (!keyring[id]) {
>   		keyring[id] =
> -			request_key(&key_type_keyring, keyring_name[id], NULL);
> +			request_key(&key_type_keyring, keyring_name[id],
> +				    NULL, NULL);
>   		if (IS_ERR(keyring[id])) {
>   			int err = PTR_ERR(keyring[id]);
>   			pr_err("no %s keyring: %d\n", keyring_name[id], err);
> @@ -73,14 +74,14 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
>   	return -EOPNOTSUPP;
>   }
>   
> -static int __integrity_init_keyring(const unsigned int id, key_perm_t perm,
> +static int __integrity_init_keyring(const unsigned int id, struct key_acl *acl,
>   				    struct key_restriction *restriction)
>   {
>   	const struct cred *cred = current_cred();
>   	int err = 0;
>   
>   	keyring[id] = keyring_alloc(keyring_name[id], KUIDT_INIT(0),
> -				    KGIDT_INIT(0), cred, perm,
> +				    KGIDT_INIT(0), cred, acl,
>   				    KEY_ALLOC_NOT_IN_QUOTA, restriction, NULL);
>   	if (IS_ERR(keyring[id])) {
>   		err = PTR_ERR(keyring[id]);
> @@ -95,10 +96,7 @@ static int __integrity_init_keyring(const unsigned int id, key_perm_t perm,
>   int __init integrity_init_keyring(const unsigned int id)
>   {
>   	struct key_restriction *restriction;
> -	key_perm_t perm;
> -
> -	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW
> -		| KEY_USR_READ | KEY_USR_SEARCH;
> +	struct key_acl *acl = &internal_keyring_acl;
>   
>   	if (id == INTEGRITY_KEYRING_PLATFORM) {
>   		restriction = NULL;
> @@ -113,14 +111,14 @@ int __init integrity_init_keyring(const unsigned int id)
>   		return -ENOMEM;
>   
>   	restriction->check = restrict_link_to_ima;
> -	perm |= KEY_USR_WRITE;
> +	acl = &internal_writable_keyring_acl;
>   
>   out:
> -	return __integrity_init_keyring(id, perm, restriction);
> +	return __integrity_init_keyring(id, &internal_keyring_acl, restriction);
>   }
>   
> -int __init integrity_add_key(const unsigned int id, const void *data,
> -			     off_t size, key_perm_t perm)
> +static int __init integrity_add_key(const unsigned int id, const void *data,
> +				    off_t size, struct key_acl *acl)
>   {
>   	key_ref_t key;
>   	int rc = 0;
> @@ -129,7 +127,7 @@ int __init integrity_add_key(const unsigned int id, const void *data,
>   		return -EINVAL;
>   
>   	key = key_create_or_update(make_key_ref(keyring[id], 1), "asymmetric",
> -				   NULL, data, size, perm,
> +				   NULL, data, size, acl ?: &internal_key_acl,
>   				   KEY_ALLOC_NOT_IN_QUOTA);
>   	if (IS_ERR(key)) {
>   		rc = PTR_ERR(key);
> @@ -149,7 +147,6 @@ int __init integrity_load_x509(const unsigned int id, const char *path)
>   	void *data;
>   	loff_t size;
>   	int rc;
> -	key_perm_t perm;
>   
>   	rc = kernel_read_file_from_path(path, &data, &size, 0,
>   					READING_X509_CERTIFICATE);
> @@ -158,21 +155,19 @@ int __init integrity_load_x509(const unsigned int id, const char *path)
>   		return rc;
>   	}
>   
> -	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW | KEY_USR_READ;
> -
>   	pr_info("Loading X.509 certificate: %s\n", path);
> -	rc = integrity_add_key(id, (const void *)data, size, perm);
> +	rc = integrity_add_key(id, data, size, NULL);
>   
>   	vfree(data);
>   	return rc;
>   }
>   
>   int __init integrity_load_cert(const unsigned int id, const char *source,
> -			       const void *data, size_t len, key_perm_t perm)
> +			       const void *data, size_t len, struct key_acl *acl)
>   {
>   	if (!data)
>   		return -EINVAL;
>   
>   	pr_info("Loading X.509 certificate: %s\n", source);
> -	return integrity_add_key(id, data, len, perm);
> +	return integrity_add_key(id, data, len, acl);
>   }
> diff --git a/security/integrity/digsig_asymmetric.c b/security/integrity/digsig_asymmetric.c
> index d775e03fbbcc..017cb6db521d 100644
> --- a/security/integrity/digsig_asymmetric.c
> +++ b/security/integrity/digsig_asymmetric.c
> @@ -57,7 +57,7 @@ static struct key *request_asymmetric_key(struct key *keyring, uint32_t keyid)
>   		else
>   			key = key_ref_to_ptr(kref);
>   	} else {
> -		key = request_key(&key_type_asymmetric, name, NULL);
> +		key = request_key(&key_type_asymmetric, name, NULL, NULL);
>   	}
>   
>   	if (IS_ERR(key)) {
> diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> index 43e2dc3a60d0..945f42b762e4 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -358,7 +358,7 @@ int evm_init_key(void)
>   	struct encrypted_key_payload *ekp;
>   	int rc;
>   
> -	evm_key = request_key(&key_type_encrypted, EVMKEY, NULL);
> +	evm_key = request_key(&key_type_encrypted, EVMKEY, NULL, NULL);
>   	if (IS_ERR(evm_key))
>   		return -ENOENT;
>   
> diff --git a/security/integrity/ima/ima_mok.c b/security/integrity/ima/ima_mok.c
> index 073ddc9bce5b..ce48303cfacc 100644
> --- a/security/integrity/ima/ima_mok.c
> +++ b/security/integrity/ima/ima_mok.c
> @@ -21,6 +21,15 @@
>   #include <keys/system_keyring.h>
>   
>   
> +static struct key_acl integrity_blacklist_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE | KEY_ACE_SEARCH),
> +	}
> +};
> +
>   struct key *ima_blacklist_keyring;
>   
>   /*
> @@ -40,9 +49,7 @@ __init int ima_mok_init(void)
>   
>   	ima_blacklist_keyring = keyring_alloc(".ima_blacklist",
>   				KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
> -				(KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				KEY_USR_VIEW | KEY_USR_READ |
> -				KEY_USR_WRITE | KEY_USR_SEARCH,
> +			        &integrity_blacklist_keyring_acl,
>   				KEY_ALLOC_NOT_IN_QUOTA,
>   				restriction, NULL);
>   
> diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
> index 7de59f44cba3..fbc1264af55f 100644
> --- a/security/integrity/integrity.h
> +++ b/security/integrity/integrity.h
> @@ -154,7 +154,7 @@ int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
>   int __init integrity_init_keyring(const unsigned int id);
>   int __init integrity_load_x509(const unsigned int id, const char *path);
>   int __init integrity_load_cert(const unsigned int id, const char *source,
> -			       const void *data, size_t len, key_perm_t perm);
> +			       const void *data, size_t len, struct key_acl *acl);
>   #else
>   
>   static inline int integrity_digsig_verify(const unsigned int id,
> @@ -172,7 +172,7 @@ static inline int integrity_init_keyring(const unsigned int id)
>   static inline int __init integrity_load_cert(const unsigned int id,
>   					     const char *source,
>   					     const void *data, size_t len,
> -					     key_perm_t perm)
> +					     struct key_acl *acl)
>   {
>   	return 0;
>   }
> diff --git a/security/integrity/platform_certs/platform_keyring.c b/security/integrity/platform_certs/platform_keyring.c
> index bcafd7387729..80bb6f750045 100644
> --- a/security/integrity/platform_certs/platform_keyring.c
> +++ b/security/integrity/platform_certs/platform_keyring.c
> @@ -14,6 +14,15 @@
>   #include <linux/slab.h>
>   #include "../integrity.h"
>   
> +static struct key_acl platform_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_READ),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
>   /**
>    * add_to_platform_keyring - Add to platform keyring without validation.
>    * @source: Source of key
> @@ -29,10 +38,8 @@ void __init add_to_platform_keyring(const char *source, const void *data,
>   	key_perm_t perm;
>   	int rc;
>   
> -	perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW;
> -
>   	rc = integrity_load_cert(INTEGRITY_KEYRING_PLATFORM, source, data, len,
> -				 perm);
> +				 &platform_key_acl);
>   	if (rc)
>   		pr_info("Error adding keys to platform keyring %s\n", source);
>   }
> diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
> index 389a298274d3..376068ec5a4e 100644
> --- a/security/keys/encrypted-keys/encrypted.c
> +++ b/security/keys/encrypted-keys/encrypted.c
> @@ -307,7 +307,7 @@ static struct key *request_user_key(const char *master_desc, const u8 **master_k
>   	const struct user_key_payload *upayload;
>   	struct key *ukey;
>   
> -	ukey = request_key(&key_type_user, master_desc, NULL);
> +	ukey = request_key(&key_type_user, master_desc, NULL, NULL);
>   	if (IS_ERR(ukey))
>   		goto error;
>   
> diff --git a/security/keys/encrypted-keys/masterkey_trusted.c b/security/keys/encrypted-keys/masterkey_trusted.c
> index dc3d18cae642..3322e7eeafce 100644
> --- a/security/keys/encrypted-keys/masterkey_trusted.c
> +++ b/security/keys/encrypted-keys/masterkey_trusted.c
> @@ -33,7 +33,7 @@ struct key *request_trusted_key(const char *trusted_desc,
>   	struct trusted_key_payload *tpayload;
>   	struct key *tkey;
>   
> -	tkey = request_key(&key_type_trusted, trusted_desc, NULL);
> +	tkey = request_key(&key_type_trusted, trusted_desc, NULL, NULL);
>   	if (IS_ERR(tkey))
>   		goto error;
>   
> diff --git a/security/keys/gc.c b/security/keys/gc.c
> index c39721163d43..cb667becf224 100644
> --- a/security/keys/gc.c
> +++ b/security/keys/gc.c
> @@ -160,6 +160,7 @@ static noinline void key_gc_unused_keys(struct list_head *keys)
>   
>   		key_user_put(key->user);
>   		key_put_tag(key->domain_tag);
> +		key_put_acl(key->acl);
>   		kfree(key->description);
>   
>   		memzero_explicit(key, sizeof(*key));
> @@ -229,7 +230,6 @@ static void key_garbage_collector(struct work_struct *work)
>   			if (key->type == key_gc_dead_keytype) {
>   				gc_state |= KEY_GC_FOUND_DEAD_KEY;
>   				set_bit(KEY_FLAG_DEAD, &key->flags);
> -				key->perm = 0;
>   				goto skip_dead_key;
>   			} else if (key->type == &key_type_keyring &&
>   				   key->restrict_link) {
> diff --git a/security/keys/internal.h b/security/keys/internal.h
> index 6be76caee874..9f9ecc1810c9 100644
> --- a/security/keys/internal.h
> +++ b/security/keys/internal.h
> @@ -89,8 +89,11 @@ extern struct rb_root key_serial_tree;
>   extern spinlock_t key_serial_lock;
>   extern struct mutex key_construction_mutex;
>   extern wait_queue_head_t request_key_conswq;
> +extern struct key_acl default_key_acl;
> +extern struct key_acl joinable_keyring_acl;
>   
>   extern void key_set_index_key(struct keyring_index_key *index_key);
> +
>   extern struct key_type *key_type_lookup(const char *type);
>   extern void key_type_put(struct key_type *ktype);
>   extern int key_get_type_from_user(char *, const char __user *, unsigned);
> @@ -157,6 +160,7 @@ extern struct key *request_key_and_link(struct key_type *type,
>   					const void *callout_info,
>   					size_t callout_len,
>   					void *aux,
> +					struct key_acl *acl,
>   					struct key *dest_keyring,
>   					unsigned long flags);
>   
> @@ -180,7 +184,11 @@ extern void key_gc_keytype(struct key_type *ktype);
>   
>   extern int key_task_permission(const key_ref_t key_ref,
>   			       const struct cred *cred,
> -			       key_perm_t perm);
> +			       u32 desired_perm);
> +extern unsigned int key_acl_to_perm(const struct key_acl *acl);
> +extern long key_set_acl(struct key *key, struct key_acl *acl);
> +extern void key_put_acl(struct key_acl *acl);
> +
>   #ifdef CONFIG_CONTAINERS
>   extern int queue_request_key(struct key *);
>   #else
> @@ -249,7 +257,7 @@ extern long keyctl_keyring_search(key_serial_t, const char __user *,
>   				  const char __user *, key_serial_t);
>   extern long keyctl_read_key(key_serial_t, char __user *, size_t);
>   extern long keyctl_chown_key(key_serial_t, uid_t, gid_t);
> -extern long keyctl_setperm_key(key_serial_t, key_perm_t);
> +extern long keyctl_setperm_key(key_serial_t, unsigned int);
>   extern long keyctl_instantiate_key(key_serial_t, const void __user *,
>   				   size_t, key_serial_t);
>   extern long keyctl_negate_key(key_serial_t, unsigned, key_serial_t);
> diff --git a/security/keys/key.c b/security/keys/key.c
> index 63513ffcf2e8..bca9d01c05fa 100644
> --- a/security/keys/key.c
> +++ b/security/keys/key.c
> @@ -199,7 +199,7 @@ static inline void key_alloc_serial(struct key *key)
>    * @uid: The owner of the new key.
>    * @gid: The group ID for the new key's group permissions.
>    * @cred: The credentials specifying UID namespace.
> - * @perm: The permissions mask of the new key.
> + * @acl: The ACL to attach to the new key.
>    * @flags: Flags specifying quota properties.
>    * @restrict_link: Optional link restriction for new keyrings.
>    *
> @@ -227,7 +227,7 @@ static inline void key_alloc_serial(struct key *key)
>    */
>   struct key *key_alloc(struct key_type *type, const char *desc,
>   		      kuid_t uid, kgid_t gid, const struct cred *cred,
> -		      key_perm_t perm, unsigned long flags,
> +		      struct key_acl *acl, unsigned long flags,
>   		      struct key_restriction *restrict_link)
>   {
>   	struct key_user *user = NULL;
> @@ -250,6 +250,9 @@ struct key *key_alloc(struct key_type *type, const char *desc,
>   	desclen = strlen(desc);
>   	quotalen = desclen + 1 + type->def_datalen;
>   
> +	if (!acl)
> +		acl = &default_key_acl;
> +
>   	/* get hold of the key tracking for this user */
>   	user = key_user_lookup(uid);
>   	if (!user)
> @@ -296,7 +299,8 @@ struct key *key_alloc(struct key_type *type, const char *desc,
>   	key->datalen = type->def_datalen;
>   	key->uid = uid;
>   	key->gid = gid;
> -	key->perm = perm;
> +	refcount_inc(&acl->usage);
> +	rcu_assign_pointer(key->acl, acl);
>   	key->restrict_link = restrict_link;
>   	key->last_used_at = ktime_get_real_seconds();
>   
> @@ -785,7 +789,7 @@ static inline key_ref_t __key_update(key_ref_t key_ref,
>    * @description: The searchable description for the key.
>    * @payload: The data to use to instantiate or update the key.
>    * @plen: The length of @payload.
> - * @perm: The permissions mask for a new key.
> + * @acl: The ACL to attach if a key is created.
>    * @flags: The quota flags for a new key.
>    *
>    * Search the destination keyring for a key of the same description and if one
> @@ -808,7 +812,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
>   			       const char *description,
>   			       const void *payload,
>   			       size_t plen,
> -			       key_perm_t perm,
> +			       struct key_acl *acl,
>   			       unsigned long flags)
>   {
>   	struct keyring_index_key index_key = {
> @@ -899,22 +903,9 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
>   			goto found_matching_key;
>   	}
>   
> -	/* if the client doesn't provide, decide on the permissions we want */
> -	if (perm == KEY_PERM_UNDEF) {
> -		perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS_SETATTR;
> -		perm |= KEY_USR_VIEW;
> -
> -		if (index_key.type->read)
> -			perm |= KEY_POS_READ;
> -
> -		if (index_key.type == &key_type_keyring ||
> -		    index_key.type->update)
> -			perm |= KEY_POS_WRITE;
> -	}
> -
>   	/* allocate a new key */
>   	key = key_alloc(index_key.type, index_key.description,
> -			cred->fsuid, cred->fsgid, cred, perm, flags, NULL);
> +			cred->fsuid, cred->fsgid, cred, acl, flags, NULL);
>   	if (IS_ERR(key)) {
>   		key_ref = ERR_CAST(key);
>   		goto error_link_end;
> diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
> index a25799249b8a..2df896bfb8e4 100644
> --- a/security/keys/keyctl.c
> +++ b/security/keys/keyctl.c
> @@ -120,8 +120,7 @@ SYSCALL_DEFINE5(add_key, const char __user *, _type,
>   	/* create or update the requested key and add it to the target
>   	 * keyring */
>   	key_ref = key_create_or_update(keyring_ref, type, description,
> -				       payload, plen, KEY_PERM_UNDEF,
> -				       KEY_ALLOC_IN_QUOTA);
> +				       payload, plen, NULL, KEY_ALLOC_IN_QUOTA);
>   	if (!IS_ERR(key_ref)) {
>   		ret = key_ref_to_ptr(key_ref)->serial;
>   		key_ref_put(key_ref);
> @@ -211,7 +210,8 @@ SYSCALL_DEFINE4(request_key, const char __user *, _type,
>   
>   	/* do the search */
>   	key = request_key_and_link(ktype, description, NULL, callout_info,
> -				   callout_len, NULL, key_ref_to_ptr(dest_ref),
> +				   callout_len, NULL, NULL,
> +				   key_ref_to_ptr(dest_ref),
>   				   KEY_ALLOC_IN_QUOTA);
>   	if (IS_ERR(key)) {
>   		ret = PTR_ERR(key);
> @@ -373,16 +373,10 @@ long keyctl_revoke_key(key_serial_t id)
>   	struct key *key;
>   	long ret;
>   
> -	key_ref = lookup_user_key(id, 0, KEY_NEED_WRITE);
> +	key_ref = lookup_user_key(id, 0, KEY_NEED_REVOKE);
>   	if (IS_ERR(key_ref)) {
>   		ret = PTR_ERR(key_ref);
> -		if (ret != -EACCES)
> -			goto error;
> -		key_ref = lookup_user_key(id, 0, KEY_NEED_SETATTR);
> -		if (IS_ERR(key_ref)) {
> -			ret = PTR_ERR(key_ref);
> -			goto error;
> -		}
> +		goto error;
>   	}
>   
>   	key = key_ref_to_ptr(key_ref);
> @@ -416,7 +410,7 @@ long keyctl_invalidate_key(key_serial_t id)
>   
>   	kenter("%d", id);
>   
> -	key_ref = lookup_user_key(id, 0, KEY_NEED_SEARCH);
> +	key_ref = lookup_user_key(id, 0, KEY_NEED_INVAL);
>   	if (IS_ERR(key_ref)) {
>   		ret = PTR_ERR(key_ref);
>   
> @@ -461,7 +455,7 @@ long keyctl_keyring_clear(key_serial_t ringid)
>   	struct key *keyring;
>   	long ret;
>   
> -	keyring_ref = lookup_user_key(ringid, KEY_LOOKUP_CREATE, KEY_NEED_WRITE);
> +	keyring_ref = lookup_user_key(ringid, KEY_LOOKUP_CREATE, KEY_NEED_CLEAR);
>   	if (IS_ERR(keyring_ref)) {
>   		ret = PTR_ERR(keyring_ref);
>   
> @@ -639,6 +633,7 @@ long keyctl_describe_key(key_serial_t keyid,
>   			 size_t buflen)
>   {
>   	struct key *key, *instkey;
> +	unsigned int perm;
>   	key_ref_t key_ref;
>   	char *infobuf;
>   	long ret;
> @@ -668,6 +663,10 @@ long keyctl_describe_key(key_serial_t keyid,
>   	key = key_ref_to_ptr(key_ref);
>   	desclen = strlen(key->description);
>   
> +	rcu_read_lock();
> +	perm = key_acl_to_perm(rcu_dereference(key->acl));
> +	rcu_read_unlock();
> +
>   	/* calculate how much information we're going to return */
>   	ret = -ENOMEM;
>   	infobuf = kasprintf(GFP_KERNEL,
> @@ -675,7 +674,7 @@ long keyctl_describe_key(key_serial_t keyid,
>   			    key->type->name,
>   			    from_kuid_munged(current_user_ns(), key->uid),
>   			    from_kgid_munged(current_user_ns(), key->gid),
> -			    key->perm);
> +			    perm);
>   	if (!infobuf)
>   		goto error2;
>   	infolen = strlen(infobuf);
> @@ -892,7 +891,7 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group)
>   		goto error;
>   
>   	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
> -				  KEY_NEED_SETATTR);
> +				  KEY_NEED_SETSEC);
>   	if (IS_ERR(key_ref)) {
>   		ret = PTR_ERR(key_ref);
>   		goto error;
> @@ -988,18 +987,25 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group)
>    * the key need not be fully instantiated yet.  If the caller does not have
>    * sysadmin capability, it may only change the permission on keys that it owns.
>    */
> -long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
> +long keyctl_setperm_key(key_serial_t id, unsigned int perm)
>   {
> +	struct key_acl *acl;
>   	struct key *key;
>   	key_ref_t key_ref;
>   	long ret;
> +	int nr, i, j;
>   
> -	ret = -EINVAL;
>   	if (perm & ~(KEY_POS_ALL | KEY_USR_ALL | KEY_GRP_ALL | KEY_OTH_ALL))
> -		goto error;
> +		return -EINVAL;
> +
> +	nr = 0;
> +	if (perm & KEY_POS_ALL) nr++;
> +	if (perm & KEY_USR_ALL) nr++;
> +	if (perm & KEY_GRP_ALL) nr++;
> +	if (perm & KEY_OTH_ALL) nr++;
>   
>   	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
> -				  KEY_NEED_SETATTR);
> +				  KEY_NEED_SETSEC);
>   	if (IS_ERR(key_ref)) {
>   		ret = PTR_ERR(key_ref);
>   		goto error;
> @@ -1007,18 +1013,45 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
>   
>   	key = key_ref_to_ptr(key_ref);
>   
> -	/* make the changes with the locks held to prevent chown/chmod races */
> -	ret = -EACCES;
> -	down_write(&key->sem);
> +	ret = -EOPNOTSUPP;
> +	if (test_bit(KEY_FLAG_HAS_ACL, &key->flags))
> +		goto error_key;
>   
> -	/* if we're not the sysadmin, we can only change a key that we own */
> -	if (capable(CAP_SYS_ADMIN) || uid_eq(key->uid, current_fsuid())) {
> -		key->perm = perm;
> -		notify_key(key, NOTIFY_KEY_SETATTR, 0);
> -		ret = 0;
> +	ret = -ENOMEM;
> +	acl = kzalloc(struct_size(acl, aces, nr), GFP_KERNEL);
> +	if (!acl)
> +		goto error_key;
> +
> +	refcount_set(&acl->usage, 1);
> +	acl->nr_ace = nr;
> +	j = 0;
> +	for (i = 0; i < 4; i++) {
> +		struct key_ace *ace = &acl->aces[j];
> +		unsigned int subset = (perm >> (i * 8)) & KEY_OTH_ALL;
> +
> +		if (!subset)
> +			continue;
> +		ace->type = KEY_ACE_SUBJ_STANDARD;
> +		ace->subject_id = KEY_ACE_EVERYONE + i;
> +		ace->perm = subset;
> +		if (subset & (KEY_OTH_WRITE | KEY_OTH_SETATTR))
> +			ace->perm |= KEY_ACE_REVOKE;
> +		if (subset & KEY_OTH_SEARCH)
> +			ace->perm |= KEY_ACE_INVAL;
> +		if (key->type == &key_type_keyring) {
> +			if (subset & KEY_OTH_SEARCH)
> +				ace->perm |= KEY_ACE_JOIN;
> +			if (subset & KEY_OTH_WRITE)
> +				ace->perm |= KEY_ACE_CLEAR;
> +		}
> +		j++;
>   	}
>   
> +	/* make the changes with the locks held to prevent chown/chmod races */
> +	down_write(&key->sem);
> +	ret = key_set_acl(key, acl);
>   	up_write(&key->sem);
> +error_key:
>   	key_put(key);
>   error:
>   	return ret;
> @@ -1383,7 +1416,7 @@ long keyctl_set_timeout(key_serial_t id, unsigned timeout)
>   	long ret;
>   
>   	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE | KEY_LOOKUP_PARTIAL,
> -				  KEY_NEED_SETATTR);
> +				  KEY_NEED_SETSEC);
>   	if (IS_ERR(key_ref)) {
>   		/* setting the timeout on a key under construction is permitted
>   		 * if we have the authorisation token handy */
> @@ -1654,7 +1687,7 @@ long keyctl_restrict_keyring(key_serial_t id, const char __user *_type,
>   	char *restriction = NULL;
>   	long ret;
>   
> -	key_ref = lookup_user_key(id, 0, KEY_NEED_SETATTR);
> +	key_ref = lookup_user_key(id, 0, KEY_NEED_SETSEC);
>   	if (IS_ERR(key_ref))
>   		return PTR_ERR(key_ref);
>   
> @@ -1819,7 +1852,7 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
>   
>   	case KEYCTL_SETPERM:
>   		return keyctl_setperm_key((key_serial_t) arg2,
> -					  (key_perm_t) arg3);
> +					  (unsigned int)arg3);
>   
>   	case KEYCTL_INSTANTIATE:
>   		return keyctl_instantiate_key((key_serial_t) arg2,
> diff --git a/security/keys/keyring.c b/security/keys/keyring.c
> index 14df79814ea0..64f590632891 100644
> --- a/security/keys/keyring.c
> +++ b/security/keys/keyring.c
> @@ -518,11 +518,19 @@ static long keyring_read(const struct key *keyring,
>   	return ret;
>   }
>   
> -/*
> - * Allocate a keyring and link into the destination keyring.
> +/**
> + * keyring_alloc - Allocate a keyring and link into the destination
> + * @description: The key description to allow the key to be searched out.
> + * @uid: The owner of the new key.
> + * @gid: The group ID for the new key's group permissions.
> + * @cred: The credentials specifying UID namespace.
> + * @acl: The ACL to attach to the new key.
> + * @flags: Flags specifying quota properties.
> + * @restrict_link: Optional link restriction for new keyrings.
> + * @dest: Destination keyring.
>    */
>   struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
> -			  const struct cred *cred, key_perm_t perm,
> +			  const struct cred *cred, struct key_acl *acl,
>   			  unsigned long flags,
>   			  struct key_restriction *restrict_link,
>   			  struct key *dest)
> @@ -531,7 +539,7 @@ struct key *keyring_alloc(const char *description, kuid_t uid, kgid_t gid,
>   	int ret;
>   
>   	keyring = key_alloc(&key_type_keyring, description,
> -			    uid, gid, cred, perm, flags, restrict_link);
> +			    uid, gid, cred, acl, flags, restrict_link);
>   	if (!IS_ERR(keyring)) {
>   		ret = key_instantiate_and_link(keyring, NULL, 0, dest, NULL);
>   		if (ret < 0) {
> @@ -1125,10 +1133,11 @@ key_ref_t find_key_to_update(key_ref_t keyring_ref,
>   /*
>    * Find a keyring with the specified name.
>    *
> - * Only keyrings that have nonzero refcount, are not revoked, and are owned by a
> - * user in the current user namespace are considered.  If @uid_keyring is %true,
> - * the keyring additionally must have been allocated as a user or user session
> - * keyring; otherwise, it must grant Search permission directly to the caller.
> + * Only keyrings that have nonzero refcount, are not revoked, and are owned by
> + * a user in the current user namespace are considered.  If @uid_keyring is
> + * %true, the keyring additionally must have been allocated as a user or user
> + * session keyring; otherwise, it must grant JOIN permission directly to the
> + * caller (ie. not through possession).
>    *
>    * Returns a pointer to the keyring with the keyring's refcount having being
>    * incremented on success.  -ENOKEY is returned if a key could not be found.
> @@ -1162,7 +1171,7 @@ struct key *find_keyring_by_name(const char *name, bool uid_keyring)
>   				continue;
>   		} else {
>   			if (key_permission(make_key_ref(keyring, 0),
> -					   KEY_NEED_SEARCH) < 0)
> +					   KEY_NEED_JOIN) < 0)
>   				continue;
>   		}
>   
> diff --git a/security/keys/permission.c b/security/keys/permission.c
> index 06df9d5e7572..8dc6e80f6fd0 100644
> --- a/security/keys/permission.c
> +++ b/security/keys/permission.c
> @@ -11,13 +11,62 @@
>   
>   #include <linux/export.h>
>   #include <linux/security.h>
> +#include <linux/user_namespace.h>
> +#include <linux/uaccess.h>
>   #include "internal.h"
>   
> +struct key_acl default_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
> +struct key_acl joinable_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces	= {
> +		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_LINK | KEY_ACE_JOIN),
> +	}
> +};
> +
> +struct key_acl internal_key_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH),
> +	}
> +};
> +
> +struct key_acl internal_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH),
> +	}
> +};
> +
> +struct key_acl internal_writable_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE | KEY_ACE_SEARCH),
> +	}
> +};
> +
>   /**
>    * key_task_permission - Check a key can be used
>    * @key_ref: The key to check.
>    * @cred: The credentials to use.
> - * @perm: The permissions to check for.
> + * @desired_perm: The permission to check for.
>    *
>    * Check to see whether permission is granted to use a key in the desired way,
>    * but permit the security modules to override.
> @@ -28,53 +77,73 @@
>    * permissions bits or the LSM check.
>    */
>   int key_task_permission(const key_ref_t key_ref, const struct cred *cred,
> -			unsigned perm)
> +			unsigned int desired_perm)
>   {
> -	struct key *key;
> -	key_perm_t kperm;
> -	int ret;
> +	const struct key_acl *acl;
> +	const struct key *key;
> +	unsigned int allow = 0;
> +	int i;
> +
> +	BUILD_BUG_ON(KEY_NEED_VIEW	!= KEY_ACE_VIEW		||
> +		     KEY_NEED_READ	!= KEY_ACE_READ		||
> +		     KEY_NEED_WRITE	!= KEY_ACE_WRITE	||
> +		     KEY_NEED_SEARCH	!= KEY_ACE_SEARCH	||
> +		     KEY_NEED_LINK	!= KEY_ACE_LINK		||
> +		     KEY_NEED_SETSEC	!= KEY_ACE_SET_SECURITY	||
> +		     KEY_NEED_INVAL	!= KEY_ACE_INVAL	||
> +		     KEY_NEED_REVOKE	!= KEY_ACE_REVOKE	||
> +		     KEY_NEED_JOIN	!= KEY_ACE_JOIN		||
> +		     KEY_NEED_CLEAR	!= KEY_ACE_CLEAR);
>   
>   	key = key_ref_to_ptr(key_ref);
>   
> -	/* use the second 8-bits of permissions for keys the caller owns */
> -	if (uid_eq(key->uid, cred->fsuid)) {
> -		kperm = key->perm >> 16;
> -		goto use_these_perms;
> -	}
> +	rcu_read_lock();
>   
> -	/* use the third 8-bits of permissions for keys the caller has a group
> -	 * membership in common with */
> -	if (gid_valid(key->gid) && key->perm & KEY_GRP_ALL) {
> -		if (gid_eq(key->gid, cred->fsgid)) {
> -			kperm = key->perm >> 8;
> -			goto use_these_perms;
> -		}
> +	acl = rcu_dereference(key->acl);
> +	if (!acl || acl->nr_ace == 0)
> +		goto no_access_rcu;
> +
> +	for (i = 0; i < acl->nr_ace; i++) {
> +		const struct key_ace *ace = &acl->aces[i];
>   
> -		ret = groups_search(cred->group_info, key->gid);
> -		if (ret) {
> -			kperm = key->perm >> 8;
> -			goto use_these_perms;
> +		switch (ace->type) {
> +		case KEY_ACE_SUBJ_STANDARD:
> +			switch (ace->subject_id) {
> +			case KEY_ACE_POSSESSOR:
> +				if (is_key_possessed(key_ref))
> +					allow |= ace->perm;
> +				break;
> +			case KEY_ACE_OWNER:
> +				if (uid_eq(key->uid, cred->fsuid))
> +					allow |= ace->perm;
> +				break;
> +			case KEY_ACE_GROUP:
> +				if (gid_valid(key->gid)) {
> +					if (gid_eq(key->gid, cred->fsgid))
> +						allow |= ace->perm;
> +					else if (groups_search(cred->group_info, key->gid))
> +						allow |= ace->perm;
> +				}
> +				break;
> +			case KEY_ACE_EVERYONE:
> +				allow |= ace->perm;
> +				break;
> +			}
> +			break;
>   		}
>   	}
>   
> -	/* otherwise use the least-significant 8-bits */
> -	kperm = key->perm;
> -
> -use_these_perms:
> +	rcu_read_unlock();
>   
> -	/* use the top 8-bits of permissions for keys the caller possesses
> -	 * - possessor permissions are additive with other permissions
> -	 */
> -	if (is_key_possessed(key_ref))
> -		kperm |= key->perm >> 24;
> +	if (!(allow & desired_perm))
> +		goto no_access;
>   
> -	kperm = kperm & perm & KEY_NEED_ALL;
> +	return security_key_permission(key_ref, cred, desired_perm);
>   
> -	if (kperm != perm)
> -		return -EACCES;
> -
> -	/* let LSM be the final arbiter */
> -	return security_key_permission(key_ref, cred, perm);
> +no_access_rcu:
> +	rcu_read_unlock();
> +no_access:
> +	return -EACCES;
>   }
>   EXPORT_SYMBOL(key_task_permission);
>   
> @@ -108,3 +177,100 @@ int key_validate(const struct key *key)
>   	return 0;
>   }
>   EXPORT_SYMBOL(key_validate);
> +
> +/*
> + * Roughly render an ACL to an old-style permissions mask.  We cannot
> + * accurately render what the ACL, particularly if it has ACEs that represent
> + * subjects outside of { poss, user, group, other }.
> + */
> +unsigned int key_acl_to_perm(const struct key_acl *acl)
> +{
> +	unsigned int perm = 0, tperm;
> +	int i;
> +
> +	BUILD_BUG_ON(KEY_OTH_VIEW	!= KEY_ACE_VIEW		||
> +		     KEY_OTH_READ	!= KEY_ACE_READ		||
> +		     KEY_OTH_WRITE	!= KEY_ACE_WRITE	||
> +		     KEY_OTH_SEARCH	!= KEY_ACE_SEARCH	||
> +		     KEY_OTH_LINK	!= KEY_ACE_LINK		||
> +		     KEY_OTH_SETATTR	!= KEY_ACE_SET_SECURITY);
> +
> +	if (!acl || acl->nr_ace == 0)
> +		return 0;
> +
> +	for (i = 0; i < acl->nr_ace; i++) {
> +		const struct key_ace *ace = &acl->aces[i];
> +
> +		switch (ace->type) {
> +		case KEY_ACE_SUBJ_STANDARD:
> +			tperm = ace->perm & KEY_OTH_ALL;
> +
> +			/* Invalidation and joining were allowed by SEARCH */
> +			if (ace->perm & (KEY_ACE_INVAL | KEY_ACE_JOIN))
> +				tperm |= KEY_OTH_SEARCH;
> +
> +			/* Revocation was allowed by either SETATTR or WRITE */
> +			if ((ace->perm & KEY_ACE_REVOKE) && !(tperm & KEY_OTH_SETATTR))
> +				tperm |= KEY_OTH_WRITE;
> +
> +			/* Clearing was allowed by WRITE */
> +			if (ace->perm & KEY_ACE_CLEAR)
> +				tperm |= KEY_OTH_WRITE;
> +
> +			switch (ace->subject_id) {
> +			case KEY_ACE_POSSESSOR:
> +				perm |= tperm << 24;
> +				break;
> +			case KEY_ACE_OWNER:
> +				perm |= tperm << 16;
> +				break;
> +			case KEY_ACE_GROUP:
> +				perm |= tperm << 8;
> +				break;
> +			case KEY_ACE_EVERYONE:
> +				perm |= tperm << 0;
> +				break;
> +			}
> +		}
> +	}
> +
> +	return perm;
> +}
> +
> +/*
> + * Destroy a key's ACL.
> + */
> +void key_put_acl(struct key_acl *acl)
> +{
> +	if (acl && refcount_dec_and_test(&acl->usage))
> +		kfree_rcu(acl, rcu);
> +}
> +
> +/*
> + * Try to set the ACL.  This either attaches or discards the proposed ACL.
> + */
> +long key_set_acl(struct key *key, struct key_acl *acl)
> +{
> +	int i;
> +
> +	/* If we're not the sysadmin, we can only change a key that we own. */
> +	if (!capable(CAP_SYS_ADMIN) && !uid_eq(key->uid, current_fsuid())) {
> +		key_put_acl(acl);
> +		return -EACCES;
> +	}
> +
> +	for (i = 0; i < acl->nr_ace; i++) {
> +		const struct key_ace *ace = &acl->aces[i];
> +		if (ace->type == KEY_ACE_SUBJ_STANDARD &&
> +		    ace->subject_id == KEY_ACE_POSSESSOR) {
> +			if (ace->perm & KEY_ACE_VIEW)
> +				acl->possessor_viewable = true;
> +			break;
> +		}
> +	}
> +
> +	rcu_swap_protected(key->acl, acl, lockdep_is_held(&key->sem));
> +	notify_key(key, NOTIFY_KEY_SETATTR, 0);
> +	key_put_acl(acl);
> +	return 0;
> +}
> diff --git a/security/keys/persistent.c b/security/keys/persistent.c
> index c9fbe63adc58..0a115cc543df 100644
> --- a/security/keys/persistent.c
> +++ b/security/keys/persistent.c
> @@ -16,6 +16,27 @@
>   
>   unsigned persistent_keyring_expiry = 3 * 24 * 3600; /* Expire after 3 days of non-use */
>   
> +static struct key_acl persistent_register_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> +	}
> +};
> +
> +static struct key_acl persistent_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE |
> +				  KEY_ACE_SEARCH | KEY_ACE_LINK |
> +				  KEY_ACE_CLEAR | KEY_ACE_INVAL),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> +	}
> +};
> +
>   /*
>    * Create the persistent keyring register for the current user namespace.
>    *
> @@ -26,8 +47,7 @@ static int key_create_persistent_register(struct user_namespace *ns)
>   	struct key *reg = keyring_alloc(".persistent_register",
>   					KUIDT_INIT(0), KGIDT_INIT(0),
>   					current_cred(),
> -					((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -					 KEY_USR_VIEW | KEY_USR_READ),
> +					&persistent_register_keyring_acl,
>   					KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
>   	if (IS_ERR(reg))
>   		return PTR_ERR(reg);
> @@ -60,8 +80,7 @@ static key_ref_t key_create_persistent(struct user_namespace *ns, kuid_t uid,
>   
>   	persistent = keyring_alloc(index_key->description,
>   				   uid, INVALID_GID, current_cred(),
> -				   ((KEY_POS_ALL & ~KEY_POS_SETATTR) |
> -				    KEY_USR_VIEW | KEY_USR_READ),
> +				   &persistent_keyring_acl,
>   				   KEY_ALLOC_NOT_IN_QUOTA, NULL,
>   				   ns->persistent_keyring_register);
>   	if (IS_ERR(persistent))
> diff --git a/security/keys/proc.c b/security/keys/proc.c
> index d2b802072693..d697a2e95217 100644
> --- a/security/keys/proc.c
> +++ b/security/keys/proc.c
> @@ -154,6 +154,7 @@ static void proc_keys_stop(struct seq_file *p, void *v)
>   
>   static int proc_keys_show(struct seq_file *m, void *v)
>   {
> +	const struct key_acl *acl;
>   	struct rb_node *_p = v;
>   	struct key *key = rb_entry(_p, struct key, serial_node);
>   	unsigned long flags;
> @@ -161,6 +162,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
>   	time64_t now, expiry;
>   	char xbuf[16];
>   	short state;
> +	bool check_pos;
>   	u64 timo;
>   	int rc;
>   
> @@ -174,12 +176,16 @@ static int proc_keys_show(struct seq_file *m, void *v)
>   		.flags			= KEYRING_SEARCH_NO_STATE_CHECK,
>   	};
>   
> -	key_ref = make_key_ref(key, 0);
> +	rcu_read_lock();
> +
> +	acl = rcu_dereference(key->acl);
> +	check_pos = acl->possessor_viewable;
>   
>   	/* determine if the key is possessed by this process (a test we can
>   	 * skip if the key does not indicate the possessor can view it
>   	 */
> -	if (key->perm & KEY_POS_VIEW) {
> +	key_ref = make_key_ref(key, 0);
> +	if (check_pos) {
>   		skey_ref = search_my_process_keyrings(&ctx);
>   		if (!IS_ERR(skey_ref)) {
>   			key_ref_put(skey_ref);
> @@ -190,12 +196,10 @@ static int proc_keys_show(struct seq_file *m, void *v)
>   	/* check whether the current task is allowed to view the key */
>   	rc = key_task_permission(key_ref, ctx.cred, KEY_NEED_VIEW);
>   	if (rc < 0)
> -		return 0;
> +		goto out;
>   
>   	now = ktime_get_real_seconds();
>   
> -	rcu_read_lock();
> -
>   	/* come up with a suitable timeout value */
>   	expiry = READ_ONCE(key->expiry);
>   	if (expiry == 0) {
> @@ -234,7 +238,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
>   		   showflag(flags, 'i', KEY_FLAG_INVALIDATED),
>   		   refcount_read(&key->usage),
>   		   xbuf,
> -		   key->perm,
> +		   key_acl_to_perm(acl),
>   		   from_kuid_munged(seq_user_ns(m), key->uid),
>   		   from_kgid_munged(seq_user_ns(m), key->gid),
>   		   key->type->name);
> @@ -245,6 +249,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
>   		key->type->describe(key, m);
>   	seq_putc(m, '\n');
>   
> +out:
>   	rcu_read_unlock();
>   	return 0;
>   }
> diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
> index 39d3cbac920c..0a231ede4d2b 100644
> --- a/security/keys/process_keys.c
> +++ b/security/keys/process_keys.c
> @@ -39,6 +39,37 @@ struct key_user root_key_user = {
>   	.uid		= GLOBAL_ROOT_UID,
>   };
>   
> +static struct key_acl user_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_WRITE |
> +				  KEY_ACE_SEARCH | KEY_ACE_LINK),
> +		KEY_OWNER_ACE(KEY_ACE__PERMS & ~(KEY_ACE_JOIN | KEY_ACE_SET_SECURITY)),
> +	}
> +};
> +
> +static struct key_acl session_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> +	}
> +};
> +
> +static struct key_acl thread_and_process_keyring_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~(KEY_ACE_JOIN | KEY_ACE_SET_SECURITY)),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
>   /*
>    * Install the user and user session keyrings for the current process's UID.
>    */
> @@ -47,12 +78,10 @@ int install_user_keyrings(void)
>   	struct user_struct *user;
>   	const struct cred *cred;
>   	struct key *uid_keyring, *session_keyring;
> -	key_perm_t user_keyring_perm;
>   	char buf[20];
>   	int ret;
>   	uid_t uid;
>   
> -	user_keyring_perm = (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_ALL;
>   	cred = current_cred();
>   	user = cred->user;
>   	uid = from_kuid(cred->user_ns, user->uid);
> @@ -77,9 +106,9 @@ int install_user_keyrings(void)
>   		uid_keyring = find_keyring_by_name(buf, true);
>   		if (IS_ERR(uid_keyring)) {
>   			uid_keyring = keyring_alloc(buf, user->uid, INVALID_GID,
> -						    cred, user_keyring_perm,
> +						    cred, &user_keyring_acl,
>   						    KEY_ALLOC_UID_KEYRING |
> -							KEY_ALLOC_IN_QUOTA,
> +						    KEY_ALLOC_IN_QUOTA,
>   						    NULL, NULL);
>   			if (IS_ERR(uid_keyring)) {
>   				ret = PTR_ERR(uid_keyring);
> @@ -95,9 +124,9 @@ int install_user_keyrings(void)
>   		if (IS_ERR(session_keyring)) {
>   			session_keyring =
>   				keyring_alloc(buf, user->uid, INVALID_GID,
> -					      cred, user_keyring_perm,
> +					      cred, &user_keyring_acl,
>   					      KEY_ALLOC_UID_KEYRING |
> -						  KEY_ALLOC_IN_QUOTA,
> +					      KEY_ALLOC_IN_QUOTA,
>   					      NULL, NULL);
>   			if (IS_ERR(session_keyring)) {
>   				ret = PTR_ERR(session_keyring);
> @@ -144,7 +173,7 @@ int install_thread_keyring_to_cred(struct cred *new)
>   		return 0;
>   
>   	keyring = keyring_alloc("_tid", new->uid, new->gid, new,
> -				KEY_POS_ALL | KEY_USR_VIEW,
> +				&thread_and_process_keyring_acl,
>   				KEY_ALLOC_QUOTA_OVERRUN,
>   				NULL, NULL);
>   	if (IS_ERR(keyring))
> @@ -191,7 +220,7 @@ int install_process_keyring_to_cred(struct cred *new)
>   		return 0;
>   
>   	keyring = keyring_alloc("_pid", new->uid, new->gid, new,
> -				KEY_POS_ALL | KEY_USR_VIEW,
> +				&thread_and_process_keyring_acl,
>   				KEY_ALLOC_QUOTA_OVERRUN,
>   				NULL, NULL);
>   	if (IS_ERR(keyring))
> @@ -245,8 +274,7 @@ int install_session_keyring_to_cred(struct cred *cred, struct key *keyring)
>   			flags = KEY_ALLOC_IN_QUOTA;
>   
>   		keyring = keyring_alloc("_ses", cred->uid, cred->gid, cred,
> -					KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ,
> -					flags, NULL, NULL);
> +					&session_keyring_acl, flags, NULL, NULL);
>   		if (IS_ERR(keyring))
>   			return PTR_ERR(keyring);
>   	} else {
> @@ -554,7 +582,7 @@ bool lookup_user_key_possessed(const struct key *key,
>    * returned key reference.
>    */
>   key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
> -			  key_perm_t perm)
> +			  unsigned int desired_perm)
>   {
>   	struct keyring_search_context ctx = {
>   		.match_data.cmp		= lookup_user_key_possessed,
> @@ -740,12 +768,12 @@ key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
>   		case -ERESTARTSYS:
>   			goto invalid_key;
>   		default:
> -			if (perm)
> +			if (desired_perm)
>   				goto invalid_key;
>   		case 0:
>   			break;
>   		}
> -	} else if (perm) {
> +	} else if (desired_perm) {
>   		ret = key_validate(key);
>   		if (ret < 0)
>   			goto invalid_key;
> @@ -757,9 +785,11 @@ key_ref_t lookup_user_key(key_serial_t id, unsigned long lflags,
>   		goto invalid_key;
>   
>   	/* check the permissions */
> -	ret = key_task_permission(key_ref, ctx.cred, perm);
> -	if (ret < 0)
> -		goto invalid_key;
> +	if (desired_perm) {
> +		ret = key_task_permission(key_ref, ctx.cred, desired_perm);
> +		if (ret < 0)
> +			goto invalid_key;
> +	}
>   
>   	key->last_used_at = ktime_get_real_seconds();
>   
> @@ -824,13 +854,13 @@ long join_session_keyring(const char *name)
>   	if (PTR_ERR(keyring) == -ENOKEY) {
>   		/* not found - try and create a new one */
>   		keyring = keyring_alloc(
> -			name, old->uid, old->gid, old,
> -			KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ | KEY_USR_LINK,
> +			name, old->uid, old->gid, old, &joinable_keyring_acl,
>   			KEY_ALLOC_IN_QUOTA, NULL, NULL);
>   		if (IS_ERR(keyring)) {
>   			ret = PTR_ERR(keyring);
>   			goto error2;
>   		}
> +		goto no_perm_test;
>   	} else if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
>   		goto error2;
> @@ -839,6 +869,12 @@ long join_session_keyring(const char *name)
>   		goto error3;
>   	}
>   
> +	ret = key_task_permission(make_key_ref(keyring, false), old,
> +				  KEY_NEED_JOIN);
> +	if (ret < 0)
> +		goto error3;
> +
> +no_perm_test:
>   	/* we've got a keyring - now to install it */
>   	ret = install_session_keyring_to_cred(new, keyring);
>   	if (ret < 0)
> diff --git a/security/keys/request_key.c b/security/keys/request_key.c
> index 10244b6fbf5d..0d609c1efece 100644
> --- a/security/keys/request_key.c
> +++ b/security/keys/request_key.c
> @@ -115,8 +115,7 @@ static int call_sbin_request_key(struct key *authkey)
>   
>   	cred = get_current_cred();
>   	keyring = keyring_alloc(desc, cred->fsuid, cred->fsgid, cred,
> -				KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ,
> -				KEY_ALLOC_QUOTA_OVERRUN, NULL, NULL);
> +				NULL, KEY_ALLOC_QUOTA_OVERRUN, NULL, NULL);
>   	put_cred(cred);
>   	if (IS_ERR(keyring)) {
>   		ret = PTR_ERR(keyring);
> @@ -344,11 +343,11 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
>   			       struct key *dest_keyring,
>   			       unsigned long flags,
>   			       struct key_user *user,
> +			       struct key_acl *acl,
>   			       struct key **_key)
>   {
>   	struct assoc_array_edit *edit;
>   	struct key *key;
> -	key_perm_t perm;
>   	key_ref_t key_ref;
>   	int ret;
>   
> @@ -358,17 +357,9 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
>   	*_key = NULL;
>   	mutex_lock(&user->cons_lock);
>   
> -	perm = KEY_POS_VIEW | KEY_POS_SEARCH | KEY_POS_LINK | KEY_POS_SETATTR;
> -	perm |= KEY_USR_VIEW;
> -	if (ctx->index_key.type->read)
> -		perm |= KEY_POS_READ;
> -	if (ctx->index_key.type == &key_type_keyring ||
> -	    ctx->index_key.type->update)
> -		perm |= KEY_POS_WRITE;
> -
>   	key = key_alloc(ctx->index_key.type, ctx->index_key.description,
>   			ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred,
> -			perm, flags, NULL);
> +			acl, flags, NULL);
>   	if (IS_ERR(key))
>   		goto alloc_failed;
>   
> @@ -444,6 +435,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
>   					  const char *callout_info,
>   					  size_t callout_len,
>   					  void *aux,
> +					  struct key_acl *acl,
>   					  struct key *dest_keyring,
>   					  unsigned long flags)
>   {
> @@ -466,7 +458,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
>   		goto error_put_dest_keyring;
>   	}
>   
> -	ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key);
> +	ret = construct_alloc_key(ctx, dest_keyring, flags, user, acl, &key);
>   	key_user_put(user);
>   
>   	if (ret == 0) {
> @@ -504,6 +496,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
>    * @callout_info: The data to pass to the instantiation upcall (or NULL).
>    * @callout_len: The length of callout_info.
>    * @aux: Auxiliary data for the upcall.
> + * @acl: The ACL to attach if a new key is created.
>    * @dest_keyring: Where to cache the key.
>    * @flags: Flags to key_alloc().
>    *
> @@ -531,6 +524,7 @@ struct key *request_key_and_link(struct key_type *type,
>   				 const void *callout_info,
>   				 size_t callout_len,
>   				 void *aux,
> +				 struct key_acl *acl,
>   				 struct key *dest_keyring,
>   				 unsigned long flags)
>   {
> @@ -593,7 +587,7 @@ struct key *request_key_and_link(struct key_type *type,
>   			goto error_free;
>   
>   		key = construct_key_and_link(&ctx, callout_info, callout_len,
> -					     aux, dest_keyring, flags);
> +					     aux, acl, dest_keyring, flags);
>   	}
>   
>   error_free:
> @@ -635,6 +629,7 @@ EXPORT_SYMBOL(wait_for_key_construction);
>    * @type: Type of key.
>    * @description: The searchable description of the key.
>    * @callout_info: The data to pass to the instantiation upcall (or NULL).
> + * @acl: The ACL to attach if a new key is created.
>    *
>    * As for request_key_and_link() except that it does not add the returned key
>    * to a keyring if found, new keys are always allocated in the user's quota,
> @@ -646,7 +641,8 @@ EXPORT_SYMBOL(wait_for_key_construction);
>    */
>   struct key *request_key(struct key_type *type,
>   			const char *description,
> -			const char *callout_info)
> +			const char *callout_info,
> +			struct key_acl *acl)
>   {
>   	struct key *key;
>   	size_t callout_len = 0;
> @@ -656,7 +652,7 @@ struct key *request_key(struct key_type *type,
>   		callout_len = strlen(callout_info);
>   	key = request_key_and_link(type, description, NULL,
>   				   callout_info, callout_len,
> -				   NULL, NULL, KEY_ALLOC_IN_QUOTA);
> +				   NULL, acl, NULL, KEY_ALLOC_IN_QUOTA);
>   	if (!IS_ERR(key)) {
>   		ret = wait_for_key_construction(key, false);
>   		if (ret < 0) {
> @@ -675,6 +671,7 @@ EXPORT_SYMBOL(request_key);
>    * @callout_info: The data to pass to the instantiation upcall (or NULL).
>    * @callout_len: The length of callout_info.
>    * @aux: Auxiliary data for the upcall.
> + * @acl: The ACL to attach if a new key is created.
>    *
>    * As for request_key_and_link() except that it does not add the returned key
>    * to a keyring if found and new keys are always allocated in the user's quota.
> @@ -686,14 +683,15 @@ struct key *request_key_with_auxdata(struct key_type *type,
>   				     const char *description,
>   				     const void *callout_info,
>   				     size_t callout_len,
> -				     void *aux)
> +				     void *aux,
> +				     struct key_acl *acl)
>   {
>   	struct key *key;
>   	int ret;
>   
>   	key = request_key_and_link(type, description, NULL,
>   				   callout_info, callout_len,
> -				   aux, NULL, KEY_ALLOC_IN_QUOTA);
> +				   aux, acl, NULL, KEY_ALLOC_IN_QUOTA);
>   	if (!IS_ERR(key)) {
>   		ret = wait_for_key_construction(key, false);
>   		if (ret < 0) {
> @@ -711,6 +709,7 @@ EXPORT_SYMBOL(request_key_with_auxdata);
>    * @description: The searchable description of the key.
>    * @net: The network namespace that is the key's domain of operation.
>    * @callout_info: The data to pass to the instantiation upcall (or NULL).
> + * @acl: The ACL to attach if a new key is created.
>    *
>    * As for request_key() except that it does not add the returned key to a
>    * keyring if found, new keys are always allocated in the user's quota, the
> @@ -723,7 +722,8 @@ EXPORT_SYMBOL(request_key_with_auxdata);
>   struct key *request_key_net(struct key_type *type,
>   			    const char *description,
>   			    struct net *net,
> -			    const char *callout_info)
> +			    const char *callout_info,
> +			    struct key_acl *acl)
>   {
>   	struct key *key;
>   	size_t callout_len = 0;
> @@ -733,7 +733,7 @@ struct key *request_key_net(struct key_type *type,
>   		callout_len = strlen(callout_info);
>   	key = request_key_and_link(type, description, net->key_domain,
>   				   callout_info, callout_len,
> -				   NULL, NULL, KEY_ALLOC_IN_QUOTA);
> +				   NULL, acl, NULL, KEY_ALLOC_IN_QUOTA);
>   	if (!IS_ERR(key)) {
>   		ret = wait_for_key_construction(key, false);
>   		if (ret < 0) {
> diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
> index 726555a0639c..790c809844ac 100644
> --- a/security/keys/request_key_auth.c
> +++ b/security/keys/request_key_auth.c
> @@ -28,6 +28,17 @@ static void request_key_auth_revoke(struct key *);
>   static void request_key_auth_destroy(struct key *);
>   static long request_key_auth_read(const struct key *, char __user *, size_t);
>   
> +static struct key_acl request_key_auth_acl = {
> +	.usage	= REFCOUNT_INIT(1),
> +	.nr_ace	= 2,
> +	.possessor_viewable = true,
> +	.aces = {
> +		KEY_POSSESSOR_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH |
> +				  KEY_ACE_LINK),
> +		KEY_OWNER_ACE(KEY_ACE_VIEW),
> +	}
> +};
> +
>   /*
>    * The request-key authorisation key type definition.
>    */
> @@ -208,8 +219,8 @@ struct key *request_key_auth_new(struct key *target, const char *op,
>   
>   	authkey = key_alloc(&key_type_request_key_auth, desc,
>   			    cred->fsuid, cred->fsgid, cred,
> -			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH | KEY_POS_LINK |
> -			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA, NULL);
> +			    &request_key_auth_acl,
> +			    KEY_ALLOC_NOT_IN_QUOTA, NULL);
>   	if (IS_ERR(authkey)) {
>   		ret = PTR_ERR(authkey);
>   		goto error_free_rka;
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index fd845063b692..616b7c292eb6 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -6560,6 +6560,7 @@ static int selinux_key_permission(key_ref_t key_ref,
>   {
>   	struct key *key;
>   	struct key_security_struct *ksec;
> +	unsigned oldstyle_perm;
>   	u32 sid;
>   
>   	/* if no specific permissions are requested, we skip the
> @@ -6568,13 +6569,26 @@ static int selinux_key_permission(key_ref_t key_ref,
>   	if (perm == 0)
>   		return 0;
>   
> +	oldstyle_perm = perm & (KEY_NEED_VIEW | KEY_NEED_READ | KEY_NEED_WRITE |
> +				KEY_NEED_SEARCH | KEY_NEED_LINK);
> +	if (perm & KEY_NEED_SETSEC)
> +		oldstyle_perm |= OLD_KEY_NEED_SETATTR;
> +	if (perm & KEY_NEED_INVAL)
> +		oldstyle_perm |= KEY_NEED_SEARCH;
> +	if (perm & KEY_NEED_REVOKE && !(perm & OLD_KEY_NEED_SETATTR))
> +		oldstyle_perm |= KEY_NEED_WRITE;
> +	if (perm & KEY_NEED_JOIN)
> +		oldstyle_perm |= KEY_NEED_SEARCH;
> +	if (perm & KEY_NEED_CLEAR)
> +		oldstyle_perm |= KEY_NEED_WRITE;
> +
>   	sid = cred_sid(cred);
>   
>   	key = key_ref_to_ptr(key_ref);
>   	ksec = key->security;
>   
>   	return avc_has_perm(&selinux_state,
> -			    sid, ksec->sid, SECCLASS_KEY, perm, NULL);
> +			    sid, ksec->sid, SECCLASS_KEY, oldstyle_perm, NULL);

This might be ok temporarily for compatibility but we'll want to 
ultimately define the new permissions in SELinux and switch over to 
using them if a new policy capability bit is set to indicate that the 
policy supports them.  We should probably decouple the SELinux 
permission bits from the KEY_NEED_* values and explicitly map them all 
at the same time.

>   }
>   
>   static int selinux_key_getsecurity(struct key *key, char **_buffer)
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index feaace1c24a2..c09133115769 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -4407,7 +4407,8 @@ static int smack_key_permission(key_ref_t key_ref,
>   #endif
>   	if (perm & (KEY_NEED_READ | KEY_NEED_SEARCH | KEY_NEED_VIEW))
>   		request |= MAY_READ;
> -	if (perm & (KEY_NEED_WRITE | KEY_NEED_LINK | KEY_NEED_SETATTR))
> +	if (perm & (KEY_NEED_WRITE | KEY_NEED_LINK | KEY_NEED_SETSEC |
> +		    KEY_NEED_INVAL | KEY_NEED_REVOKE | KEY_NEED_CLEAR))
>   		request |= MAY_WRITE;
>   	rc = smk_access(tkp, keyp->security, request, &ad);
>   	rc = smk_bu_note("key access", tkp, keyp->security, request, rc);
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL
  2019-02-15 16:11 ` [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL David Howells
  2019-02-15 17:32   ` Stephen Smalley
@ 2019-02-15 17:39   ` David Howells
  1 sibling, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-15 17:39 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, Paul Moore, SELinux

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -6560,6 +6560,7 @@ static int selinux_key_permission(key_ref_t key_ref,
> >   {
> >   	struct key *key;
> >   	struct key_security_struct *ksec;
> > +	unsigned oldstyle_perm;
> >   	u32 sid;
> >     	/* if no specific permissions are requested, we skip the
> > @@ -6568,13 +6569,26 @@ static int selinux_key_permission(key_ref_t key_ref,
> >   	if (perm == 0)
> >   		return 0;
> >   +	oldstyle_perm = perm & (KEY_NEED_VIEW | KEY_NEED_READ | KEY_NEED_WRITE
> > |
> > +				KEY_NEED_SEARCH | KEY_NEED_LINK);
> > +	if (perm & KEY_NEED_SETSEC)
> > +		oldstyle_perm |= OLD_KEY_NEED_SETATTR;
> > +	if (perm & KEY_NEED_INVAL)
> > +		oldstyle_perm |= KEY_NEED_SEARCH;
> > +	if (perm & KEY_NEED_REVOKE && !(perm & OLD_KEY_NEED_SETATTR))
> > +		oldstyle_perm |= KEY_NEED_WRITE;
> > +	if (perm & KEY_NEED_JOIN)
> > +		oldstyle_perm |= KEY_NEED_SEARCH;
> > +	if (perm & KEY_NEED_CLEAR)
> > +		oldstyle_perm |= KEY_NEED_WRITE;
> > +
> >   	sid = cred_sid(cred);
> >     	key = key_ref_to_ptr(key_ref);
> >   	ksec = key->security;
> >     	return avc_has_perm(&selinux_state,
> > -			    sid, ksec->sid, SECCLASS_KEY, perm, NULL);
> > +			    sid, ksec->sid, SECCLASS_KEY, oldstyle_perm, NULL);
> 
> This might be ok temporarily for compatibility but we'll want to ultimately
> define the new permissions in SELinux and switch over to using them if a new
> policy capability bit is set to indicate that the policy supports them.  We
> should probably decouple the SELinux permission bits from the KEY_NEED_*
> values and explicitly map them all at the same time.

Sounds reasonable.  I should probably detach the first two ACL patches from
the set and push them separately.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 04/27] containers: Allow a process to be forked into a container
  2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
@ 2019-02-15 17:39   ` Stephen Smalley
  2019-02-19 16:39   ` Eric W. Biederman
  2019-02-19 23:16   ` David Howells
  2 siblings, 0 replies; 61+ messages in thread
From: Stephen Smalley @ 2019-02-15 17:39 UTC (permalink / raw)
  To: David Howells, keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel

On 2/15/19 11:07 AM, David Howells wrote:
> Allow a single process to be forked directly into a container using a new
> syscall, thereby 'booting' the container:
> 
> 	pid_t pid = fork_into_container(int container_fd);
> 
> This process will be the 'init' process of the container.
> 
> Further attempts to fork into the container will be rejected.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>   arch/x86/entry/syscalls/syscall_32.tbl |    1
>   arch/x86/entry/syscalls/syscall_64.tbl |    1
>   arch/x86/ia32/sys_ia32.c               |    2 -
>   include/linux/cred.h                   |    3 +
>   include/linux/nsproxy.h                |    7 ++
>   include/linux/sched/task.h             |    3 +
>   include/linux/syscalls.h               |    1
>   kernel/cred.c                          |   45 +++++++++++++
>   kernel/fork.c                          |  110 ++++++++++++++++++++++++++------
>   kernel/nsproxy.c                       |   11 +++
>   kernel/sys_ni.c                        |    1
>   11 files changed, 157 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3564814a5d21..8666693510f9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -408,3 +408,4 @@
>   394	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
>   395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
>   396	i386	container_create	sys_container_create		__ia32_sys_container_create
> +397	i386	fork_into_container	sys_fork_into_container		__ia32_sys_fork_into_container
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index aa6cccbe5271..d40d4790fcb2 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -353,6 +353,7 @@
>   342	common	mount_notify		__x64_sys_mount_notify
>   343	common	sb_notify		__x64_sys_sb_notify
>   344	common	container_create	__x64_sys_container_create
> +345	common	fork_into_container	__x64_sys_fork_into_container
>   
>   #
>   # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
> index a43212036257..080d9e21b697 100644
> --- a/arch/x86/ia32/sys_ia32.c
> +++ b/arch/x86/ia32/sys_ia32.c
> @@ -238,5 +238,5 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
>   		       unsigned long, tls_val, int __user *, child_tidptr)
>   {
>   	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
> -			tls_val);
> +			tls_val, NULL);
>   }
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 4907c9df86b3..357e743d5d4a 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -23,6 +23,7 @@
>   
>   struct cred;
>   struct inode;
> +struct container;
>   
>   /*
>    * COW Supplementary groups list
> @@ -155,7 +156,7 @@ struct cred {
>   
>   extern void __put_cred(struct cred *);
>   extern void exit_creds(struct task_struct *);
> -extern int copy_creds(struct task_struct *, unsigned long);
> +extern int copy_creds(struct task_struct *, unsigned long, struct container *);
>   extern const struct cred *get_task_cred(struct task_struct *);
>   extern struct cred *cred_alloc_blank(void);
>   extern struct cred *prepare_creds(void);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 2ae1b1a4d84d..81838ae24a92 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -11,6 +11,7 @@ struct ipc_namespace;
>   struct pid_namespace;
>   struct cgroup_namespace;
>   struct fs_struct;
> +struct container;
>   
>   /*
>    * A structure to contain pointers to all per-process
> @@ -63,9 +64,13 @@ extern struct nsproxy init_nsproxy;
>    *         * /
>    *     task_unlock(task);
>    *
> + *  4. Container namespaces are set at container creation and cannot be
> + *     changed.
> + *
>    */
>   
> -int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> +int copy_namespaces(unsigned long flags, struct task_struct *tsk,
> +		    struct container *dest_container);
>   void exit_task_namespaces(struct task_struct *tsk);
>   void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
>   void free_nsproxy(struct nsproxy *ns);
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 44c6f15800ff..bdff71b0fb66 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -73,7 +73,8 @@ extern void do_group_exit(int);
>   extern void exit_files(struct task_struct *);
>   extern void exit_itimers(struct signal_struct *);
>   
> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
> +		     int __user *, unsigned long, struct container *);
>   extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>   struct task_struct *fork_idle(int);
>   extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index dac42098c2dd..15e5cc704df3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -946,6 +946,7 @@ asmlinkage long sys_sb_notify(int dfd, const char __user *path,
>   asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
>   				     unsigned long spare3, unsigned long spare4,
>   				     unsigned long spare5);
> +asmlinkage long sys_fork_into_container(int containerfd);
>   
>   /*
>    * Architecture-specific system calls
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 21f4a97085b4..f0ee5cec533d 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -313,6 +313,43 @@ struct cred *prepare_exec_creds(void)
>   	return new;
>   }
>   
> +/*
> + * Handle forking a process into a container.
> + */
> +static struct cred *copy_container_creds(struct container *dest_container)
> +{
> +	struct cred *new;
> +
> +	validate_process_creds();
> +
> +	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
> +	if (!new)
> +		return NULL;
> +
> +	kdebug("prepare_creds() alloc %p", new);
> +
> +	memcpy(new, dest_container->cred, sizeof(struct cred));
> +
> +	atomic_set(&new->usage, 1);
> +	set_cred_subscribers(new, 0);
> +	get_group_info(new->group_info);
> +	get_uid(new->user);
> +	get_user_ns(new->user_ns);
> +
> +#ifdef CONFIG_SECURITY
> +	new->security = NULL;
> +#endif
> +
> +	if (security_prepare_creds(new, dest_container->cred, GFP_KERNEL) < 0)
> +		goto error;
> +	validate_creds(new);
> +	return new;
> +
> +error:
> +	abort_creds(new);
> +	return NULL;
> +}
> +
>   /*
>    * Copy credentials for the new process created by fork()
>    *
> @@ -322,7 +359,8 @@ struct cred *prepare_exec_creds(void)
>    * The new process gets the current process's subjective credentials as its
>    * objective and subjective credentials
>    */
> -int copy_creds(struct task_struct *p, unsigned long clone_flags)
> +int copy_creds(struct task_struct *p, unsigned long clone_flags,
> +	       struct container *dest_container)
>   {
>   	struct cred *new;
>   	int ret;
> @@ -343,7 +381,10 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>   		return 0;
>   	}
>   
> -	new = prepare_creds();
> +	if (dest_container)
> +		new = copy_container_creds(dest_container);

Shouldn't there be a check between the current process' credentials and 
the destination container's credentials before allowing this to occur?

> +	else
> +		new = prepare_creds();
>   	if (!new)
>   		return -ENOMEM;
>   
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 009cf7e63894..71401deb4434 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1385,9 +1385,33 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>   	return retval;
>   }
>   
> -static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
> +static int copy_fs(unsigned long clone_flags, struct task_struct *tsk,
> +		   struct container *dest_container)
>   {
>   	struct fs_struct *fs = current->fs;
> +
> +#ifdef CONFIG_CONTAINERS
> +	if (dest_container) {
> +		fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);
> +		if (!fs)
> +			return -ENOMEM;
> +
> +		fs->users = 1;
> +		fs->in_exec = 0;
> +		spin_lock_init(&fs->lock);
> +		seqcount_init(&fs->seq);
> +		fs->umask = 0022;
> +
> +		spin_lock(&dest_container->lock);
> +		fs->pwd = fs->root = dest_container->root;
> +		path_get(&fs->root);
> +		path_get(&fs->pwd);
> +		spin_unlock(&dest_container->lock);
> +		tsk->fs = fs;
> +		return 0;
> +	}
> +#endif
> +
>   	if (clone_flags & CLONE_FS) {
>   		/* tsk->fs is already what we want */
>   		spin_lock(&fs->lock);
> @@ -1679,7 +1703,8 @@ static __latent_entropy struct task_struct *copy_process(
>   					struct pid *pid,
>   					int trace,
>   					unsigned long tls,
> -					int node)
> +					int node,
> +					struct container *dest_container)
>   {
>   	int retval;
>   	struct task_struct *p;
> @@ -1783,7 +1808,7 @@ static __latent_entropy struct task_struct *copy_process(
>   	}
>   	current->flags &= ~PF_NPROC_EXCEEDED;
>   
> -	retval = copy_creds(p, clone_flags);
> +	retval = copy_creds(p, clone_flags, dest_container);
>   	if (retval < 0)
>   		goto bad_fork_free;
>   
> @@ -1905,7 +1930,7 @@ static __latent_entropy struct task_struct *copy_process(
>   	retval = copy_files(clone_flags, p);
>   	if (retval)
>   		goto bad_fork_cleanup_semundo;
> -	retval = copy_fs(clone_flags, p);
> +	retval = copy_fs(clone_flags, p, dest_container);
>   	if (retval)
>   		goto bad_fork_cleanup_files;
>   	retval = copy_sighand(clone_flags, p);
> @@ -1917,15 +1942,15 @@ static __latent_entropy struct task_struct *copy_process(
>   	retval = copy_mm(clone_flags, p);
>   	if (retval)
>   		goto bad_fork_cleanup_signal;
> -	retval = copy_namespaces(clone_flags, p);
> +	retval = copy_container(clone_flags, p, dest_container);
>   	if (retval)
>   		goto bad_fork_cleanup_mm;
> -	retval = copy_container(clone_flags, p, NULL);
> +	retval = copy_namespaces(clone_flags, p, dest_container);
>   	if (retval)
> -		goto bad_fork_cleanup_namespaces;
> +		goto bad_fork_cleanup_container;
>   	retval = copy_io(clone_flags, p);
>   	if (retval)
> -		goto bad_fork_cleanup_container;
> +		goto bad_fork_cleanup_namespaces;
>   	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
>   	if (retval)
>   		goto bad_fork_cleanup_io;
> @@ -2124,10 +2149,10 @@ static __latent_entropy struct task_struct *copy_process(
>   bad_fork_cleanup_io:
>   	if (p->io_context)
>   		exit_io_context(p);
> -bad_fork_cleanup_container:
> -	exit_container(p);
>   bad_fork_cleanup_namespaces:
>   	exit_task_namespaces(p);
> +bad_fork_cleanup_container:
> +	exit_container(p);
>   bad_fork_cleanup_mm:
>   	if (p->mm)
>   		mmput(p->mm);
> @@ -2183,7 +2208,7 @@ struct task_struct *fork_idle(int cpu)
>   {
>   	struct task_struct *task;
>   	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
> -			    cpu_to_node(cpu));
> +			    cpu_to_node(cpu), NULL);
>   	if (!IS_ERR(task)) {
>   		init_idle_pids(task);
>   		init_idle(task, cpu);
> @@ -2195,15 +2220,16 @@ struct task_struct *fork_idle(int cpu)
>   /*
>    *  Ok, this is the main fork-routine.
>    *
> - * It copies the process, and if successful kick-starts
> - * it and waits for it to finish using the VM if required.
> + * It copies the process into the specified container, and if successful
> + * kick-starts it and waits for it to finish using the VM if required.
>    */
>   long _do_fork(unsigned long clone_flags,
>   	      unsigned long stack_start,
>   	      unsigned long stack_size,
>   	      int __user *parent_tidptr,
>   	      int __user *child_tidptr,
> -	      unsigned long tls)
> +	      unsigned long tls,
> +	      struct container *dest_container)
>   {
>   	struct completion vfork;
>   	struct pid *pid;
> @@ -2229,8 +2255,32 @@ long _do_fork(unsigned long clone_flags,
>   			trace = 0;
>   	}
>   
> +	if (dest_container) {
> +		/* A process spawned into a container doesn't share anything
> +		 * with the parent other than namespaces.
> +		 */
> +		if (clone_flags & (CLONE_CHILD_CLEARTID |
> +				   CLONE_CHILD_SETTID |
> +				   CLONE_FILES |
> +				   CLONE_FS |
> +				   CLONE_IO |
> +				   CLONE_PARENT |
> +				   CLONE_PARENT_SETTID |
> +				   CLONE_PTRACE |
> +				   CLONE_SETTLS |
> +				   CLONE_SIGHAND |
> +				   CLONE_SYSVSEM |
> +				   CLONE_THREAD))
> +			return -EINVAL;
> +
> +		/* However, we do have to let kernel threads borrow a VM. */
> +		if ((clone_flags & CLONE_VM) && current->mm)
> +			return -EINVAL;
> +	}
> +	
>   	p = copy_process(clone_flags, stack_start, stack_size,
> -			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
> +			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE,
> +			 dest_container);
>   	add_latent_entropy();
>   
>   	if (IS_ERR(p))
> @@ -2279,7 +2329,7 @@ long do_fork(unsigned long clone_flags,
>   	      int __user *child_tidptr)
>   {
>   	return _do_fork(clone_flags, stack_start, stack_size,
> -			parent_tidptr, child_tidptr, 0);
> +			parent_tidptr, child_tidptr, 0, NULL);
>   }
>   #endif
>   
> @@ -2289,14 +2339,14 @@ long do_fork(unsigned long clone_flags,
>   pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>   {
>   	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +			(unsigned long)arg, NULL, NULL, 0, NULL);
>   }
>   
>   #ifdef __ARCH_WANT_SYS_FORK
>   SYSCALL_DEFINE0(fork)
>   {
>   #ifdef CONFIG_MMU
> -	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
> +	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, NULL);
>   #else
>   	/* can not support in nommu mode */
>   	return -EINVAL;
> @@ -2308,7 +2358,26 @@ SYSCALL_DEFINE0(fork)
>   SYSCALL_DEFINE0(vfork)
>   {
>   	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
> -			0, NULL, NULL, 0);
> +			0, NULL, NULL, 0, NULL);
> +}
> +#endif
> +
> +#ifdef CONFIG_CONTAINERS
> +SYSCALL_DEFINE1(fork_into_container, int, containerfd)
> +{
> +	struct fd f = fdget(containerfd);
> +	int ret;
> +
> +	if (!f.file)
> +		return -EBADF;
> +	ret = -EINVAL;
> +	if (is_container_file(f.file)) {
> +		struct container *dest_container = f.file->private_data;
> +
> +		ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, dest_container);
> +	}
> +	fdput(f);
> +	return ret;
>   }
>   #endif
>   
> @@ -2336,7 +2405,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>   		 unsigned long, tls)
>   #endif
>   {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
> +	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls,
> +			NULL);
>   }
>   #endif
>   
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 4bb5184b3a80..4031075300a4 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -136,12 +136,19 @@ struct nsproxy *create_new_namespaces(unsigned long flags,
>    * called from clone.  This now handles copy for nsproxy and all
>    * namespaces therein.
>    */
> -int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> +int copy_namespaces(unsigned long flags, struct task_struct *tsk,
> +		    struct container *dest_container)
>   {
>   	struct nsproxy *old_ns = tsk->nsproxy;
>   	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
>   	struct nsproxy *new_ns;
>   
> +	if (dest_container) {
> +		get_nsproxy(dest_container->ns);
> +		tsk->nsproxy = dest_container->ns;
> +		return 0;
> +	}
> +
>   	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>   			      CLONE_NEWPID | CLONE_NEWNET |
>   			      CLONE_NEWCGROUP)))) {
> @@ -163,7 +170,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>   		(CLONE_NEWIPC | CLONE_SYSVSEM))
>   		return -EINVAL;
>   
> -	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
> +	new_ns = create_new_namespaces(flags, old_ns, user_ns, tsk->fs);
>   	if (IS_ERR(new_ns))
>   		return  PTR_ERR(new_ns);
>   
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index f0455cbb91cf..a23ad529d548 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -144,6 +144,7 @@ COND_SYSCALL(container_create);
>   /* kernel/exit.c */
>   
>   /* kernel/fork.c */
> +COND_SYSCALL(fork_into_container);
>   
>   /* kernel/futex.c */
>   COND_SYSCALL(futex);
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 20/27] container, keys: Add a container keyring
  2019-02-15 16:10 ` [RFC PATCH 20/27] container, keys: Add a container keyring David Howells
@ 2019-02-15 21:46   ` Eric Biggers
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Biggers @ 2019-02-15 21:46 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	linux-fscrypt

[+Cc linux-fscrypt]

Hi David,

On Fri, Feb 15, 2019 at 04:10:45PM +0000, David Howells wrote:
> Allow a container manager to attach keyrings to a container such that the
> keys contained therein are searched by request_key() in addition to a
> process's normal keyrings.  This allows the manager to install keys to
> support filesystem decryption and authentication for superblocks inside the
> container without requiring any active role being played by processes
> inside of the container.
> 
> So, for example, a container could be created, a keyring added and then an
> rxrpc-type key added to the keyring such that a container's root filesystem
> and data filesystems can be brought in from secure AFS volumes.  It would
> also be possible to put filesystem crypto keys in there such that Ext4
> encrypted files could be decrypted - without the need to share the key
> between other containers or let the key leak into the container.

For fscrypt (aka ext4/f2fs/ubifs encryption), rather than a "container keyring",
I think it's much better served by ioctls to add/remove keys directly to/from
the filesystem, as I'm proposing here:
https://patchwork.kernel.org/cover/10806425/.  My proposed API implements all
the semantics people actually need for fscrypt, including:

- Making the filesystem's ability to use keys match the locked/unlocked state of
  encrypted files, which is a filesystem-wide thing not a per-process thing.

- Allowing a key to be removed and wiped, *and* the corresponding encrypted
  files locked efficiently.

- Still permitting non-root users to use fscrypt, subject to limitations; e.g.
  keys are identified by cryptographic hash, users are limited by the keys
  quotas, and a user can't directly remove a key another user has added or
  create a new encrypted directory without proving they know/knew the key.

A "container keyring" would only address the first problem.

I don't think it's the right semantics to have the kernel's ability to use
fscrypt keys be conditional on which process is doing the filesystem access --
even if the processes are divided into different sessions, users, or containers.
Doing so may sound good, but it plays into common misconceptions about the
purpose of storage encryption.  It would actually be an OS-level access control
policy that has nothing to do with the encryption itself.  The kernel already
has a wide variety of file access control mechanisms to choose from: file mode
bits, ACLs, SELinux, mount namespaces, etc...

The purpose of fscrypt is actually very different.  It's designed to protect
data locally stored on-disk from two classes of attackers: (1) attackers who can
read directly from disk, and (2) attackers who fully compromise the system
on-line including all memory, provided that the key isn't currently added.

In these cases, the notion of a "container" is meaningless as the operating
system is already out of the picture...

I also don't see much benefit to namespacing fscrypt keys for container
isolation purposes.  If it's at all computationally feasible for keys to
collide, then the encryption has already been massively screwed up.

Also, I don't think that fscrypt should have a de-facto dependency on
CONFIG_CONTAINERS in order to have sane semantics.  fscrypt is used on many
systems where containers support would be unnecessary bloat and attack surface.

So while there probably are still good arguments for adding a container keyring,
I don't think it's the best way forward for fscrypt.

- Eric

> 
> Because the container manager retains control of the keyring, it can update
> the contained keys as necessary to prevent expiration.  Note that the
> keyring and keys in the keyring must grant Search permission directly to
> the container object.
> 
> [!] Note that NFS, CIFS and other filesystems wishing to make use of this
>     would have to get the token to use by calling request_key() on entry to
>     its VFS methods and retain it in its file struct.
> 
> [!] Note that request_key() called from userspace does not look in the
>     container keyring.
> 
> [!] Note that keys are now tagged with a tag that identifies the network
>     namespace (or other domain of operation).  This allows keys to be
>     provided in one keyring that allow the same thing but in different
>     network namespaces.
> 
> The keyring should be created by the container manager and then set using:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, int containerfd,
> 	       key_serial_t keyring);
> 
> With this, request_key() inside the kernel searches:
> 
> 	thread-keyring, process-keyring, session-keyring, container-keyring
> 
> [!] It may be worth setting a flag on a mountpoint to indicate whether to
>     search the container keyring first or last.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (26 preceding siblings ...)
  2019-02-15 16:12 ` [RFC PATCH 27/27] containers: Sample to grant access to a key in a container David Howells
@ 2019-02-15 22:36 ` James Morris
  2019-02-19 16:35 ` Eric W. Biederman
  2019-02-19 23:42 ` David Howells
  29 siblings, 0 replies; 61+ messages in thread
From: James Morris @ 2019-02-15 22:36 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	ebiederm

On Fri, 15 Feb 2019, David Howells wrote:

> 
> Here's a collection of patches that containerises the kernel keys and makes
> it possible to separate keys by namespace.  This can be extended to any
> filesystem that uses request_key() to obtain the pertinent authentication
> token on entry to VFS or socket methods.

Shouldn't Eric Biederman be cc'd on this?

-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS
  2019-02-15 16:08 ` [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS David Howells
@ 2019-02-17  0:11   ` Al Viro
  0 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2019-02-17  0:11 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

On Fri, Feb 15, 2019 at 04:08:29PM +0000, David Howells wrote:

> +	mnt_ns = alloc_mnt_ns(container->cred->user_ns, false);
> +	if (IS_ERR(mnt_ns)) {
> +		ret = PTR_ERR(mnt_ns);
> +		goto out_fd;
> +	}
> +
> +	mnt = real_mount(path->mnt);
> +	mnt_add_count(mnt, 1);
> +	mnt->mnt_ns = mnt_ns;
> +	mnt_ns->root = mnt;
> +	mnt_ns->mounts++;
> +	list_add(&mnt->mnt_list, &mnt_ns->list);
> +
> +	ret = -EBUSY;
> +	spin_lock(&container->lock);
> +	if (!container->ns->mnt_ns) {
> +		container->ns->mnt_ns = mnt_ns;
> +		write_seqcount_begin(&container->seq);
> +		container->root.mnt = path->mnt;
> +		container->root.dentry = path->dentry;
> +		write_seqcount_end(&container->seq);
> +		path_get(&container->root);
> +		mnt_ns = NULL;
> +		ret = 0;
> +	}

Almost certainly buggered.  Assumptions that we _won't_ get
to absolute root of namespace (it's overmounted and we are
chrooted into it, basically) had been made in quite a few
places.  The thing you are creating is *not* like normal
namespaces in that respect.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 09/27] vfs: Allow mounting to other namespaces
  2019-02-15 16:08 ` [RFC PATCH 09/27] vfs: Allow mounting to other namespaces David Howells
@ 2019-02-17  0:14   ` Al Viro
  0 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2019-02-17  0:14 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

On Fri, Feb 15, 2019 at 04:08:46PM +0000, David Howells wrote:
> Currently sys_move_mount() and sys_mount(MS_MOVE) prevent the caller from
> moving a mount into a namespace not their own.  Relax this such that any
> mount can be mounted onto any given mountpoint provided that the source
> mount is either detached or the same namespace as the destination.
> 
> This permits container namespaces to be built from the outside rather than
> from the inside.

I'm looking forward to your analysis of security implications, as well as
the proof that attach_recursive_mnt() won't get confused by that...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
@ 2019-02-17 18:57   ` Trond Myklebust
  2019-02-17 19:39   ` James Bottomley
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Trond Myklebust @ 2019-02-17 18:57 UTC (permalink / raw)
  To: sfrench, dhowells, keyrings
  Cc: rgb, linux-kernel, linux-security-module, linux-nfs, linux-cifs,
	linux-fsdevel

Hi David,

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the
> following
> things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.
> 
>  (3) A set of processes, including one designated as the 'init'
> process.
> 
> A container is created and attached to a file descriptor by:
> 
> 	int cfd = container_create(const char *name, unsigned int
> flags);
> 
> this inherits all the namespaces of the parent container unless
> otherwise
> the mask calls for new namespaces.
> 
> 	CONTAINER_NEW_FS_NS
> 	CONTAINER_NEW_EMPTY_FS_NS
> 	CONTAINER_NEW_CGROUP_NS [root only]
> 	CONTAINER_NEW_UTS_NS
> 	CONTAINER_NEW_IPC_NS
> 	CONTAINER_NEW_USER_NS
> 	CONTAINER_NEW_PID_NS
> 	CONTAINER_NEW_NET_NS
> 
> Other flags include:
> 
> 	CONTAINER_KILL_ON_CLOSE
> 	CONTAINER_CLOSE_ON_EXEC
> 
> Note that I've added a pointer to the current container to
> task_struct.
> This doesn't make the nsproxy pointer redundant as you can still make
> new
> namespaces with clone().
> 
> I've also added a list_head to task_struct to form a list in the
> container
> of its member processes.  This is convenient, but redundant since the
> code
> could iterate over all the tasks looking for ones that have a
> matching
> task->container.
> 
> It might make sense to use fsconfig() to configure the container:
> 
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
> 	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);
> 
> 
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 
>      Once created, it should then be possible for the supervising
> process
>      to modify the new container.  Mounts can be created inside of
> the
>      container's namespaces:
> 
> 	fsfd = fsopen("ext4", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
> 	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> 	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(fsfd, 0, 0);
> 
>      and then mounted into the namespace:
> 
> 	move_mount(mfd, "", cfd, "/",
> 		   MOVE_MOUNT_F_EMPTY_PATH |
> MOVE_MOUNT_T_CONTAINER_ROOT);
> 
>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
>      Files and devices can be created by supplying the container fd
> as the
>      dirfd argument:
> 
> 	mkdirat(int cfd, const char *path, mode_t mode);
> 	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> 	int fd = openat(int cfd, const char *path,
> 			unsigned int flags, mode_t mode);
> 
>      [*] Note that when using cfd as dirfd, the path must not contain
> a '/'
>      	 at the front.
> 
>      Sockets, such as netlink, can be opened inside of the
> container's
>      namespaces:
> 
> 	int fd = container_socket(int cfd, int domain, int type,
> 				  int protocol);
> 
>      This should allow management of the container's network
> namespace from
>      outside.
> 
>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init'
> process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree
> within
>      the container.  Before this point, the container is simply
> destroyed
>      if the container fd is closed.
> 
>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process
> therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int
> wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed
> off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are
> preemptively
>      SIGKILL'd by the kernel.
> 
>      By default, if the container is active and its fd is closed, the
>      container is left running and wil be cleaned up when its 'init'
> exits.
>      The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>      Given that we have an fd attached to the container, we could
> make it
>      such that the supervising process could monitor and override
> EPERM
>      returns for mount and other privileged operations within the
>      container.
> 
>  (5) Per-container keyring.
> 
>      Each container can point to a per-container keyring for the
> holding of
>      integrity keys and filesystem keys for use inside the
> container.  This
>      would be attached:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>      This keyring would be searched by request_key() after it has
> searched
>      the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>      make sense with something like Apparmor where different path-
> based
>      rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---

Do we really need a new system call to set up containers? That would
force changes to all existing orchestration software.

Given that the main thing we want to achieve is to direct messages from
the kernel to an appropriate handler, why not focus on adding
functionality to do just that?

Is there any reason why a syscall to allow an appropriately privileged
process to add a keyring-specific message queue to its own
user_namespace and obtain a file descriptor to that message queue might
not work? That forces the container to use a daemon if it cares to
intercept keyring traffic, rather than worrying about the kernel
running request_key (in fact, it might make sense to allow a trivial
implementation of the daemon to be to just read the messages, parse
them and run request_key).

With such an implementation, the fallback mechanism could be to walk
back up the hierarchy of user_namespaces until a message queue is
found, and to invoke the existing request_key mechanism if not.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
  2019-02-17 18:57   ` Trond Myklebust
@ 2019-02-17 19:39   ` James Bottomley
  2019-02-19 16:56   ` Eric W. Biederman
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: James Bottomley @ 2019-02-17 19:39 UTC (permalink / raw)
  To: David Howells, keyrings, trond.myklebust, sfrench
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, containers, cgroups

Added containers and cgroups list, which somehow got lost since they
might have a slight interest in a complete rewrite of the container
API.

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the
> following things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.

Doesn't this conflict with how the mount namespace works today?  It
contains the notion of unescapable root and we shouldn't have two of
those in different locations.

>  (3) A set of processes, including one designated as the 'init'
> process.

This is a violation of a fundamental tenet: I can create a "container"
as simply a set of unoccupied namespaces and bind them into the
filesystem with a mount.  This mechanism is what I use for
architectural emulation containers and how network namespaces currently
work.  For all of these cases, the container is empty of processes when
it is created and is selectively filled and emptied of processes as you
use it.

If I create a container without a PID namespace, I definitely wouldn't
want the notion of an "init" process because I'm deliberately avoiding
that.

> A container is created and attached to a file descriptor by:
> 
> 	int cfd = container_create(const char *name, unsigned int
> flags);

I thought we got agreement years ago that containers don't exist in
Linux as a single entity: they're currently a collection of cgroups and
namespaces some of which may and some of which may not be local to the
entity the orchestration system thinks of as a "container".

> this inherits all the namespaces of the parent container unless
> otherwise the mask calls for new namespaces.
> 
> 	CONTAINER_NEW_FS_NS
> 	CONTAINER_NEW_EMPTY_FS_NS
> 	CONTAINER_NEW_CGROUP_NS [root only]
> 	CONTAINER_NEW_UTS_NS
> 	CONTAINER_NEW_IPC_NS
> 	CONTAINER_NEW_USER_NS
> 	CONTAINER_NEW_PID_NS
> 	CONTAINER_NEW_NET_NS
> 
> Other flags include:
> 
> 	CONTAINER_KILL_ON_CLOSE
> 	CONTAINER_CLOSE_ON_EXEC
> 
> Note that I've added a pointer to the current container to
> task_struct. This doesn't make the nsproxy pointer redundant as you
> can still make new namespaces with clone().
> 
> I've also added a list_head to task_struct to form a list in the
> container of its member processes.  This is convenient, but redundant
> since the code could iterate over all the tasks looking for ones that
> have a matching task->container.
> 
> It might make sense to use fsconfig() to configure the container:
> 
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
> 	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);

You're trying to introduce a new set of container APIs that don't quite
align with how containers work today.  If I look at the justification
below the whole thing seems to require the notion of a container as an
atomic entity with an exclusive process list.  You can argue that's how
you want it to work, but it looks like this notion would have
difficulty working with the standard kubernetes pod/container notion,
let alone all of the other esoteric ways we use containers today.

James

> 
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 
>      Once created, it should then be possible for the supervising
> process
>      to modify the new container.  Mounts can be created inside of
> the
>      container's namespaces:
> 
> 	fsfd = fsopen("ext4", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
> 	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> 	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(fsfd, 0, 0);
> 
>      and then mounted into the namespace:
> 
> 	move_mount(mfd, "", cfd, "/",
> 		   MOVE_MOUNT_F_EMPTY_PATH |
> MOVE_MOUNT_T_CONTAINER_ROOT);
> 
>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
>      Files and devices can be created by supplying the container fd
> as the
>      dirfd argument:
> 
> 	mkdirat(int cfd, const char *path, mode_t mode);
> 	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> 	int fd = openat(int cfd, const char *path,
> 			unsigned int flags, mode_t mode);
> 
>      [*] Note that when using cfd as dirfd, the path must not contain
> a '/'
>      	 at the front.
> 
>      Sockets, such as netlink, can be opened inside of the
> container's
>      namespaces:
> 
> 	int fd = container_socket(int cfd, int domain, int type,
> 				  int protocol);
> 
>      This should allow management of the container's network
> namespace from
>      outside.
> 
>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init'
> process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree
> within
>      the container.  Before this point, the container is simply
> destroyed
>      if the container fd is closed.
> 
>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process
> therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int
> wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed
> off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are
> preemptively
>      SIGKILL'd by the kernel.
> 
>      By default, if the container is active and its fd is closed, the
>      container is left running and wil be cleaned up when its 'init'
> exits.
>      The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>      Given that we have an fd attached to the container, we could
> make it
>      such that the supervising process could monitor and override
> EPERM
>      returns for mount and other privileged operations within the
>      container.
> 
>  (5) Per-container keyring.
> 
>      Each container can point to a per-container keyring for the
> holding of
>      integrity keys and filesystem keys for use inside the
> container.  This
>      would be attached:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>      This keyring would be searched by request_key() after it has
> searched
>      the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>      make sense with something like Apparmor where different path-
> based
>      rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  fs/namespace.c                         |    5 
>  include/linux/container.h              |   86 ++++++++
>  include/linux/init_task.h              |    1 
>  include/linux/lsm_hooks.h              |   20 ++
>  include/linux/sched.h                  |    3 
>  include/linux/security.h               |   15 +
>  include/linux/syscalls.h               |    3 
>  include/uapi/linux/container.h         |   28 +++
>  init/Kconfig                           |    7 +
>  init/init_task.c                       |    3 
>  kernel/Makefile                        |    2 
>  kernel/container.c                     |  348
> ++++++++++++++++++++++++++++++++
>  kernel/exit.c                          |    1 
>  kernel/fork.c                          |    7 +
>  kernel/namespaces.h                    |   15 +
>  kernel/nsproxy.c                       |   23 +-
>  kernel/sys_ni.c                        |    3 
>  security/security.c                    |   12 +
>  20 files changed, 571 insertions(+), 13 deletions(-)
>  create mode 100644 include/linux/container.h
>  create mode 100644 include/uapi/linux/container.h
>  create mode 100644 kernel/container.c
>  create mode 100644 kernel/namespaces.h
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index c9db9d51a7df..3564814a5d21 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -407,3 +407,4 @@
>  393	i386	fsinfo			sys_fsinfo	
> 		__ia32_sys_fsinfo
>  394	i386	mount_notify		sys_mount_notify	
> 	__ia32_sys_mount_notify
>  395	i386	sb_notify		sys_sb_notify	
> 		__ia32_sys_sb_notify
> +396	i386	container_create	sys_container_create	
> 	__ia32_sys_container_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 17869bf7788a..aa6cccbe5271 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -352,6 +352,7 @@
>  341	common	fsinfo			__x64_sys_fsi
> nfo
>  342	common	mount_notify		__x64_sys_mount
> _notify
>  343	common	sb_notify		__x64_sys_sb_notif
> y
> +344	common	container_create	__x64_sys_container
> _create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache
> impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index f378cfc63043..ea005f55ec4c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -30,6 +30,7 @@
>  #include <uapi/linux/mount.h>
>  #include <linux/fs_context.h>
>  #include <linux/fsinfo.h>
> +#include <linux/container.h>
>  
>  #include "pnode.h"
>  #include "internal.h"
> @@ -3742,6 +3743,10 @@ static void __init init_mount_tree(void)
>  
>  	set_fs_pwd(current->fs, &root);
>  	set_fs_root(current->fs, &root);
> +#ifdef CONFIG_CONTAINERS
> +	path_get(&root);
> +	init_container.root = root;
> +#endif
>  }
>  
>  void __init mnt_init(void)
> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..0a8918435097
> --- /dev/null
> +++ b/include/linux/container.h
> @@ -0,0 +1,86 @@
> +/* Container objects
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_CONTAINER_H
> +#define _LINUX_CONTAINER_H
> +
> +#include <uapi/linux/container.h>
> +#include <linux/refcount.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/path.h>
> +#include <linux/seqlock.h>
> +
> +struct fs_struct;
> +struct nsproxy;
> +struct task_struct;
> +
> +/*
> + * The container object.
> + */
> +struct container {
> +	char			name[24];
> +	u64			id;		/* Container
> ID */
> +	refcount_t		usage;
> +	int			exit_code;	/* The exit
> code of 'init' */
> +	const struct cred	*cred;		/* Creds for
> this container, including userns */
> +	struct nsproxy		*ns;		/* This
> container's namespaces */
> +	struct path		root;		/* The root
> of the container's fs namespace */
> +	struct task_struct	*init;		/* The
> 'init' task for this container */
> +	struct container	*parent;	/* Parent of this
> container. */
> +	void			*security;	/* LSM data */
> +	struct list_head	members;	/* Member processes,
> guarded with ->lock */
> +	struct list_head	child_link;	/* Link in
> parent->children */
> +	struct list_head	children;	/* Child containers
> */
> +	wait_queue_head_t	waitq;		/* Someone
> waiting for init to exit waits here */
> +	unsigned long		flags;
> +#define CONTAINER_FLAG_INIT_STARTED	0	/* Init is
> started - certain ops now prohibited */
> +#define CONTAINER_FLAG_DEAD		1	/* Init has died
> */
> +#define CONTAINER_FLAG_KILL_ON_CLOSE	2	/* Kill init if
> container handle closed */
> +	spinlock_t		lock;
> +	seqcount_t		seq;		/* Track
> changes in ->root */
> +};
> +
> +extern struct container init_container;
> +
> +#ifdef CONFIG_CONTAINERS
> +extern const struct file_operations container_fops;
> +
> +extern int copy_container(unsigned long flags, struct task_struct
> *tsk,
> +			  struct container *container);
> +extern void exit_container(struct task_struct *tsk);
> +extern void put_container(struct container *c);
> +
> +static inline struct container *get_container(struct container *c)
> +{
> +	refcount_inc(&c->usage);
> +	return c;
> +}
> +
> +static inline bool is_container_file(struct file *file)
> +{
> +	return file->f_op == &container_fops;
> +}
> +
> +#else
> +
> +static inline int copy_container(unsigned long flags, struct
> task_struct *tsk,
> +				 struct container *container)
> +{ return 0; }
> +static inline void exit_container(struct task_struct *tsk) { }
> +static inline void put_container(struct container *c) {}
> +static inline struct container *get_container(struct container *c) {
> return NULL; }
> +static inline bool is_container_file(struct file *file) { return
> false; }
> +
> +#endif /* CONFIG_CONTAINERS */
> +
> +#endif /* _LINUX_CONTAINER_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index a7083a45a26c..f016cadece24 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -10,6 +10,7 @@
>  #include <linux/ipc.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
> +#include <linux/container.h>
>  #include <linux/securebits.h>
>  #include <linux/seqlock.h>
>  #include <linux/rbtree.h>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 52d0f3f4c786..0f310d911815 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1460,6 +1460,16 @@
>   * @bpf_prog_free_security:
>   *	Clean up the security information stored inside bpf prog.
>   *
> + * Security hooks for containers:
> + *
> + * @container_alloc:
> + *	Permit creation of a new container and assign security
> data.
> + *	@container: The new container.
> + *
> + * @container_free:
> + *	Free security data attached to a container.
> + *	@container: The container.
> + *
>   */
>  union security_list_options {
>  	int (*binder_set_context_mgr)(struct task_struct *mgr);
> @@ -1825,6 +1835,12 @@ union security_list_options {
>  	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
>  	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
>  #endif /* CONFIG_BPF_SYSCALL */
> +
> +	/* Container management security hooks */
> +#ifdef CONFIG_CONTAINERS
> +	int (*container_alloc)(struct container *container, unsigned
> int flags);
> +	void (*container_free)(struct container *container);
> +#endif
>  };
>  
>  struct security_hook_heads {
> @@ -2069,6 +2085,10 @@ struct security_hook_heads {
>  	struct hlist_head bpf_prog_alloc_security;
>  	struct hlist_head bpf_prog_free_security;
>  #endif /* CONFIG_BPF_SYSCALL */
> +#ifdef CONFIG_CONTAINERS
> +	struct hlist_head container_alloc;
> +	struct hlist_head container_free;
> +#endif /* CONFIG_CONTAINERS */
>  } __randomize_layout;
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d2f90fa92468..073a3a930514 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@ struct backing_dev_info;
>  struct bio_list;
>  struct blk_plug;
>  struct cfs_rq;
> +struct container;
>  struct fs_struct;
>  struct futex_pi_state;
>  struct io_context;
> @@ -870,6 +871,8 @@ struct task_struct {
>  
>  	/* Namespaces: */
>  	struct nsproxy			*nsproxy;
> +	struct container		*container;
> +	struct list_head		container_link;
>  
>  	/* Signal handlers: */
>  	struct signal_struct		*signal;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index da538c06766f..acd0c14c6e95 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -70,6 +70,7 @@ struct ctl_table;
>  struct audit_krule;
>  struct user_namespace;
>  struct timezone;
> +struct container;
>  
>  enum lsm_event {
>  	LSM_POLICY_CHANGE,
> @@ -1751,6 +1752,20 @@ static inline void
> security_audit_rule_free(void *lsmrule)
>  #endif /* CONFIG_SECURITY */
>  #endif /* CONFIG_AUDIT */
>  
> +#ifdef CONFIG_CONTAINERS
> +#ifdef CONFIG_SECURITY
> +int security_container_alloc(struct container *container, unsigned
> int flags);
> +void security_container_free(struct container *container);
> +#else
> +static inline int security_container_alloc(struct container
> *container,
> +					   unsigned int flags)
> +{
> +	return 0;
> +}
> +static inline void security_container_free(struct container
> *container) {}
> +#endif
> +#endif /* CONFIG_CONTAINERS */
> +
>  #ifdef CONFIG_SECURITYFS
>  
>  extern struct dentry *securityfs_create_file(const char *name,
> umode_t mode,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 10127b1d923b..dac42098c2dd 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -943,6 +943,9 @@ asmlinkage long sys_mount_notify(int dfd, const
> char __user *path,
>  				 unsigned int at_flags, int
> watch_fd, int watch_id);
>  asmlinkage long sys_sb_notify(int dfd, const char __user *path,
>  			      unsigned int at_flags, int watch_fd,
> int watch_id);
> +asmlinkage long sys_container_create(const char __user *name,
> unsigned int flags,
> +				     unsigned long spare3, unsigned
> long spare4,
> +				     unsigned long spare5);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/linux/container.h
> b/include/uapi/linux/container.h
> new file mode 100644
> index 000000000000..43748099b28d
> --- /dev/null
> +++ b/include/uapi/linux/container.h
> @@ -0,0 +1,28 @@
> +/* Container UAPI
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_CONTAINER_H
> +#define _UAPI_LINUX_CONTAINER_H
> +
> +
> +#define CONTAINER_NEW_FS_NS		0x00000001 /* Dup current
> fs namespace */
> +#define CONTAINER_NEW_EMPTY_FS_NS	0x00000002 /* Provide new
> empty fs namespace */
> +#define CONTAINER_NEW_CGROUP_NS		0x00000004 /* Dup
> current cgroup namespace */
> +#define CONTAINER_NEW_UTS_NS		0x00000008 /* Dup
> current uts namespace */
> +#define CONTAINER_NEW_IPC_NS		0x00000010 /* Dup
> current ipc namespace */
> +#define CONTAINER_NEW_USER_NS		0x00000020 /* Dup
> current user namespace */
> +#define CONTAINER_NEW_PID_NS		0x00000040 /* Dup
> current pid namespace */
> +#define CONTAINER_NEW_NET_NS		0x00000080 /* Dup
> current net namespace */
> +#define CONTAINER_KILL_ON_CLOSE		0x00000100 /* Kill
> all member processes when fd closed */
> +#define CONTAINER_FD_CLOEXEC		0x00000200 /* Close the
> fd on exec */
> +#define CONTAINER__FLAG_MASK		0x000003ff
> +
> +#endif /* _UAPI_LINUX_CONTAINER_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 5984dd7f2156..ab37c3a55aa1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -992,6 +992,13 @@ config NET_NS
>  	  Allow user space to create what appear to be multiple
> instances
>  	  of the network stack.
>  
> +config CONTAINERS
> +	bool "Container support"
> +	default y
> +	help
> +	  Allow userspace to create and manipulate containers as
> objects that
> +	  have namespaces and hold a set of processes.
> +
>  endif # NAMESPACES
>  
>  config CHECKPOINT_RESTORE
> diff --git a/init/init_task.c b/init/init_task.c
> index 5aebe3be4d7c..90c7439a195b 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -108,6 +108,9 @@ struct task_struct init_task
>  	.signal		= &init_signals,
>  	.sighand	= &init_sighand,
>  	.nsproxy	= &init_nsproxy,
> +	.container	= &init_container,
> +	.container_link.next = &init_container.members,
> +	.container_link.prev = &init_container.members,
>  	.pending	= {
>  		.list = LIST_HEAD_INIT(init_task.pending.list),
>  		.signal = {{0}}
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 6aa7543bcdb2..98cdd18cecef 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -8,7 +8,7 @@ obj-y     = fork.o exec_domain.o panic.o \
>  	    sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
>  	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
>  	    extable.o params.o \
> -	    kthread.o sys_ni.o nsproxy.o \
> +	    kthread.o sys_ni.o nsproxy.o container.o \
>  	    notifier.o ksysfs.o cred.o reboot.o \
>  	    async.o range.o smpboot.o ucount.o
>  
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..ca4012632cfa
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,348 @@
> +/* Implement container objects.
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/init_task.h>
> +#include <linux/fs.h>
> +#include <linux/fs_struct.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/container.h>
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/security.h>
> +#include "namespaces.h"
> +
> +struct container init_container = {
> +	.name		= ".init",
> +	.id		= 1,
> +	.usage		= REFCOUNT_INIT(2),
> +	.cred		= &init_cred,
> +	.ns		= &init_nsproxy,
> +	.init		= &init_task,
> +	.members.next	= &init_task.container_link,
> +	.members.prev	= &init_task.container_link,
> +	.children	= LIST_HEAD_INIT(init_container.children),
> +	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
> +	.lock		=
> __SPIN_LOCK_UNLOCKED(init_container.lock),
> +	.seq		= SEQCNT_ZERO(init_fs.seq),
> +};
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +static atomic64_t container_id_counter = ATOMIC_INIT(1);
> +
> +/*
> + * Drop a ref on a container and clear it if no longer in use.
> + */
> +void put_container(struct container *c)
> +{
> +	struct container *parent;
> +
> +	while (c && refcount_dec_and_test(&c->usage)) {
> +		BUG_ON(!list_empty(&c->members));
> +		if (c->ns)
> +			put_nsproxy(c->ns);
> +		path_put(&c->root);
> +
> +		parent = c->parent;
> +		if (parent) {
> +			spin_lock(&parent->lock);
> +			list_del(&c->child_link);
> +			spin_unlock(&parent->lock);
> +		}
> +
> +		if (c->cred)
> +			put_cred(c->cred);
> +		security_container_free(c);
> +		kfree(c);
> +		c = parent;
> +	}
> +}
> +
> +/*
> + * Allow the user to poll for the container dying.
> + */
> +static unsigned int container_poll(struct file *file, poll_table
> *wait)
> +{
> +	struct container *container = file->private_data;
> +	unsigned int mask = 0;
> +
> +	poll_wait(file, &container->waitq, wait);
> +
> +	if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
> +		mask |= POLLHUP;
> +
> +	return mask;
> +}
> +
> +static int container_release(struct inode *inode, struct file *file)
> +{
> +	struct container *container = file->private_data;
> +
> +	put_container(container);
> +	return 0;
> +}
> +
> +const struct file_operations container_fops = {
> +	.poll		= container_poll,
> +	.release	= container_release,
> +};
> +
> +/*
> + * Handle fork/clone.
> + *
> + * A process inherits its parent's container.  The first process
> into the
> + * container is its 'init' process and the life of everything else
> in there is
> + * dependent upon that.
> + */
> +int copy_container(unsigned long flags, struct task_struct *tsk,
> +		   struct container *container)
> +{
> +	struct container *c = container ?: tsk->container;
> +	int ret = -ECANCELED;
> +
> +	spin_lock(&c->lock);
> +
> +	if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
> +		list_add_tail(&tsk->container_link, &c->members);
> +		get_container(c);
> +		tsk->container = c;
> +		if (!c->init) {
> +			set_bit(CONTAINER_FLAG_INIT_STARTED, &c-
> >flags);
> +			c->init = tsk;
> +		}
> +		ret = 0;
> +	}
> +
> +	spin_unlock(&c->lock);
> +	return ret;
> +}
> +
> +/*
> + * Remove a dead process from a container.
> + *
> + * If the 'init' process in a container dies, we kill off all the
> other
> + * processes in the container.
> + */
> +void exit_container(struct task_struct *tsk)
> +{
> +	struct task_struct *p;
> +	struct container *c = tsk->container;
> +	struct kernel_siginfo si = {
> +		.si_signo = SIGKILL,
> +		.si_code  = SI_KERNEL,
> +	};
> +
> +	spin_lock(&c->lock);
> +
> +	list_del(&tsk->container_link);
> +
> +	if (c->init == tsk) {
> +		c->init = NULL;
> +		c->exit_code = tsk->exit_code;
> +		smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
> +		set_bit(CONTAINER_FLAG_DEAD, &c->flags);
> +		wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
> +
> +		list_for_each_entry(p, &c->members, container_link)
> {
> +			si.si_pid = task_tgid_vnr(p);
> +			send_sig_info(SIGKILL, &si, p);
> +		}
> +	}
> +
> +	spin_unlock(&c->lock);
> +	put_container(c);
> +}
> +
> +/*
> + * Allocate a container.
> + */
> +static struct container *alloc_container(const char __user *name)
> +{
> +	struct container *c;
> +	long len;
> +	int ret;
> +
> +	c = kzalloc(sizeof(struct container), GFP_KERNEL);
> +	if (!c)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&c->members);
> +	INIT_LIST_HEAD(&c->children);
> +	init_waitqueue_head(&c->waitq);
> +	spin_lock_init(&c->lock);
> +	refcount_set(&c->usage, 1);
> +
> +	ret = -EFAULT;
> +	len = strncpy_from_user(c->name, name, sizeof(c->name));
> +	if (len < 0)
> +		goto err;
> +	ret = -ENAMETOOLONG;
> +	if (len >= sizeof(c->name))
> +		goto err;
> +	ret = -EINVAL;
> +	if (strchr(c->name, '/'))
> +		goto err;
> +
> +	c->name[len] = 0;
> +	return c;
> +
> +err:
> +	kfree(c);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create some creds for the container.  We don't want to pin things
> we don't
> + * have to, so drop all keyrings from the new cred.  The LSM gets to
> audit the
> + * cred struct when security_container_alloc() is invoked.
> + */
> +static const struct cred *create_container_creds(unsigned int flags)
> +{
> +	struct cred *new;
> +	int ret;
> +
> +	new = prepare_creds();
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +#ifdef CONFIG_KEYS
> +	key_put(new->thread_keyring);
> +	new->thread_keyring = NULL;
> +	key_put(new->process_keyring);
> +	new->process_keyring = NULL;
> +	key_put(new->session_keyring);
> +	new->session_keyring = NULL;
> +	key_put(new->request_key_auth);
> +	new->request_key_auth = NULL;
> +#endif
> +
> +	if (flags & CONTAINER_NEW_USER_NS) {
> +		ret = create_user_ns(new);
> +		if (ret < 0)
> +			goto err;
> +		new->euid = new->user_ns->owner;
> +		new->egid = new->user_ns->group;
> +	}
> +
> +	new->fsuid = new->suid = new->uid = new->euid;
> +	new->fsgid = new->sgid = new->gid = new->egid;
> +	return new;
> +
> +err:
> +	abort_creds(new);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container.
> + */
> +static struct container *create_container(const char __user *name,
> unsigned int flags)
> +{
> +	struct container *parent, *c;
> +	struct fs_struct *fs;
> +	struct nsproxy *ns;
> +	const struct cred *cred;
> +	int ret;
> +
> +	c = alloc_container(name);
> +	if (IS_ERR(c))
> +		return c;
> +
> +	if (flags & CONTAINER_KILL_ON_CLOSE)
> +		__set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
> +
> +	cred = create_container_creds(flags);
> +	if (IS_ERR(cred)) {
> +		ret = PTR_ERR(cred);
> +		goto err_cont;
> +	}
> +	c->cred = cred;
> +
> +	ret = -ENOMEM;
> +	fs = copy_fs_struct(current->fs);
> +	if (!fs)
> +		goto err_cont;
> +
> +	ns = create_new_namespaces(
> +		(flags & CONTAINER_NEW_FS_NS	 ? CLONE_NEWNS :
> 0) |
> +		(flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP :
> 0) |
> +		(flags & CONTAINER_NEW_UTS_NS	 ? CLONE_NEWUTS
> : 0) |
> +		(flags & CONTAINER_NEW_IPC_NS	 ? CLONE_NEWIPC
> : 0) |
> +		(flags & CONTAINER_NEW_PID_NS	 ? CLONE_NEWPID
> : 0) |
> +		(flags & CONTAINER_NEW_NET_NS	 ? CLONE_NEWNET
> : 0),
> +		current->nsproxy, cred->user_ns, fs);
> +	if (IS_ERR(ns)) {
> +		ret = PTR_ERR(ns);
> +		goto err_fs;
> +	}
> +
> +	c->ns = ns;
> +	c->root = fs->root;
> +	c->seq = fs->seq;
> +	fs->root.mnt = NULL;
> +	fs->root.dentry = NULL;
> +
> +	ret = security_container_alloc(c, flags);
> +	if (ret < 0)
> +		goto err_fs;
> +
> +	parent = current->container;
> +	get_container(parent);
> +	c->parent = parent;
> +	c->id = atomic64_inc_return(&container_id_counter);
> +	spin_lock(&parent->lock);
> +	list_add_tail(&c->child_link, &parent->children);
> +	spin_unlock(&parent->lock);
> +	return c;
> +
> +err_fs:
> +	free_fs_struct(fs);
> +err_cont:
> +	put_container(c);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container object.
> + */
> +SYSCALL_DEFINE5(container_create,
> +		const char __user *, name,
> +		unsigned int, flags,
> +		unsigned long, spare3,
> +		unsigned long, spare4,
> +		unsigned long, spare5)
> +{
> +	struct container *c;
> +	int fd;
> +
> +	if (!name ||
> +	    flags & ~CONTAINER__FLAG_MASK ||
> +	    spare3 != 0 || spare4 != 0 || spare5 != 0)
> +		return -EINVAL;
> +	if ((flags & (CONTAINER_NEW_FS_NS |
> CONTAINER_NEW_EMPTY_FS_NS)) ==
> +	    (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
> +		return -EINVAL;
> +
> +	c = create_container(name, flags);
> +	if (IS_ERR(c))
> +		return PTR_ERR(c);
> +
> +	fd = anon_inode_getfd("container", &container_fops, c,
> +			      O_RDWR | (flags & CONTAINER_FD_CLOEXEC
> ? O_CLOEXEC : 0));
> +	if (fd < 0)
> +		put_container(c);
> +	return fd;
> +}
> +
> +#endif /* CONFIG_CONTAINERS */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 284f2fe9a293..78f6065ad799 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -864,6 +864,7 @@ void __noreturn do_exit(long code)
>  	if (group_dead)
>  		disassociate_ctty(1);
>  	exit_task_namespaces(tsk);
> +	exit_container(tsk);
>  	exit_task_work(tsk);
>  	exit_thread(tsk);
>  	exit_umh(tsk);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b69248e6f0e0..009cf7e63894 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1920,9 +1920,12 @@ static __latent_entropy struct task_struct
> *copy_process(
>  	retval = copy_namespaces(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_mm;
> -	retval = copy_io(clone_flags, p);
> +	retval = copy_container(clone_flags, p, NULL);
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
> +	retval = copy_io(clone_flags, p);
> +	if (retval)
> +		goto bad_fork_cleanup_container;
>  	retval = copy_thread_tls(clone_flags, stack_start,
> stack_size, p, tls);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
> @@ -2121,6 +2124,8 @@ static __latent_entropy struct task_struct
> *copy_process(
>  bad_fork_cleanup_io:
>  	if (p->io_context)
>  		exit_io_context(p);
> +bad_fork_cleanup_container:
> +	exit_container(p);
>  bad_fork_cleanup_namespaces:
>  	exit_task_namespaces(p);
>  bad_fork_cleanup_mm:
> diff --git a/kernel/namespaces.h b/kernel/namespaces.h
> new file mode 100644
> index 000000000000..c44e3cf0e254
> --- /dev/null
> +++ b/kernel/namespaces.h
> @@ -0,0 +1,15 @@
> +/* Local namespaces defs
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +extern struct nsproxy *create_new_namespaces(unsigned long flags,
> +					     struct nsproxy
> *nsproxy,
> +					     struct user_namespace
> *user_ns,
> +					     struct fs_struct
> *new_fs);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index f6c5d330059a..4bb5184b3a80 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -27,6 +27,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/cgroup.h>
>  #include <linux/perf_event.h>
> +#include "namespaces.h"
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
>   * Return the newly created nsproxy.  Do not attach this to the
> task,
>   * leave it to the caller to do proper locking and attach it to
> task.
>   */
> -static struct nsproxy *create_new_namespaces(unsigned long flags,
> -	struct task_struct *tsk, struct user_namespace *user_ns,
> +struct nsproxy *create_new_namespaces(unsigned long flags,
> +	struct nsproxy *nsproxy, struct user_namespace *user_ns,
>  	struct fs_struct *new_fs)
>  {
>  	struct nsproxy *new_nsp;
> @@ -72,39 +73,39 @@ static struct nsproxy
> *create_new_namespaces(unsigned long flags,
>  	if (!new_nsp)
>  		return ERR_PTR(-ENOMEM);
>  
> -	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns,
> user_ns, new_fs);
> +	new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns,
> user_ns, new_fs);
>  	if (IS_ERR(new_nsp->mnt_ns)) {
>  		err = PTR_ERR(new_nsp->mnt_ns);
>  		goto out_ns;
>  	}
>  
> -	new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy-
> >uts_ns);
> +	new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy-
> >uts_ns);
>  	if (IS_ERR(new_nsp->uts_ns)) {
>  		err = PTR_ERR(new_nsp->uts_ns);
>  		goto out_uts;
>  	}
>  
> -	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy-
> >ipc_ns);
> +	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy-
> >ipc_ns);
>  	if (IS_ERR(new_nsp->ipc_ns)) {
>  		err = PTR_ERR(new_nsp->ipc_ns);
>  		goto out_ipc;
>  	}
>  
>  	new_nsp->pid_ns_for_children =
> -		copy_pid_ns(flags, user_ns, tsk->nsproxy-
> >pid_ns_for_children);
> +		copy_pid_ns(flags, user_ns, nsproxy-
> >pid_ns_for_children);
>  	if (IS_ERR(new_nsp->pid_ns_for_children)) {
>  		err = PTR_ERR(new_nsp->pid_ns_for_children);
>  		goto out_pid;
>  	}
>  
>  	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> -					    tsk->nsproxy-
> >cgroup_ns);
> +					    nsproxy->cgroup_ns);
>  	if (IS_ERR(new_nsp->cgroup_ns)) {
>  		err = PTR_ERR(new_nsp->cgroup_ns);
>  		goto out_cgroup;
>  	}
>  
> -	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy-
> >net_ns);
> +	new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy-
> >net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
>  		goto out_net;
> @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct
> task_struct *tsk)
>  		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
>  		return -EINVAL;
>  
> -	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk-
> >fs);
> +	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns,
> tsk->fs);
>  	if (IS_ERR(new_ns))
>  		return  PTR_ERR(new_ns);
>  
> @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long
> unshare_flags,
>  	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> -	*new_nsp = create_new_namespaces(unshare_flags, current,
> user_ns,
> +	*new_nsp = create_new_namespaces(unshare_flags, current-
> >nsproxy, user_ns,
>  					 new_fs ? new_fs : current-
> >fs);
>  	if (IS_ERR(*new_nsp)) {
>  		err = PTR_ERR(*new_nsp);
> @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>  	if (nstype && (ns->ops->type != nstype))
>  		goto out;
>  
> -	new_nsproxy = create_new_namespaces(0, tsk,
> current_user_ns(), tsk->fs);
> +	new_nsproxy = create_new_namespaces(0, tsk->nsproxy,
> current_user_ns(), tsk->fs);
>  	if (IS_ERR(new_nsproxy)) {
>  		err = PTR_ERR(new_nsproxy);
>  		goto out;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a4e7131b2509..f0455cbb91cf 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -136,6 +136,9 @@ COND_SYSCALL(acct);
>  COND_SYSCALL(capget);
>  COND_SYSCALL(capset);
>  
> +/* kernel/container.c */
> +COND_SYSCALL(container_create);
> +
>  /* kernel/exec_domain.c */
>  
>  /* kernel/exit.c */
> diff --git a/security/security.c b/security/security.c
> index b49732c02e21..259be9a1746c 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1864,3 +1864,15 @@ void security_bpf_prog_free(struct
> bpf_prog_aux *aux)
>  	call_void_hook(bpf_prog_free_security, aux);
>  }
>  #endif /* CONFIG_BPF_SYSCALL */
> +
> +#ifdef CONFIG_CONTAINERS
> +int security_container_alloc(struct container *container, unsigned
> int flags)
> +{
> +	return call_int_hook(container_alloc, 0, container, flags);
> +}
> +
> +void security_container_free(struct container *container)
> +{
> +	call_void_hook(container_free, container);
> +}
> +#endif /* CONFIG_CONTAINERS */
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (27 preceding siblings ...)
  2019-02-15 22:36 ` [RFC PATCH 00/27] Containers and using authenticated filesystems James Morris
@ 2019-02-19 16:35 ` Eric W. Biederman
  2019-02-20 14:18   ` Christian Brauner
  2019-02-19 23:42 ` David Howells
  29 siblings, 1 reply; 61+ messages in thread
From: Eric W. Biederman @ 2019-02-19 16:35 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	Linux Containers, linux-api


So you missed the main mailing lists for discussion of this kind of
thing, and the maintainer.  So I have reservations about the quality of
your due diligence already.

Looking at your description you are introducing a container id.
You don't descibe which namespace your contianer id lives in.
Without the container id living in a container this breaks
nested containers and process migration aka CRIU.

So based on the your description.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>



David Howells <dhowells@redhat.com> writes:

> Here's a collection of patches that containerises the kernel keys and makes
> it possible to separate keys by namespace.  This can be extended to any
> filesystem that uses request_key() to obtain the pertinent authentication
> token on entry to VFS or socket methods.
>
> I have this working with AFS and AF_RXRPC so far, but it could be extended
> to other filesystems, such as NFS and CIFS.
>
> The following changes are made:
>
>  (1) Add optional namespace tags to a key's index_key.  This allows the
>      following:
>
>      (a) Automatic invalidation of all keys with that tag when the
>      	 namespace is removed.
>
>      (b) Mixing of keys with the same description, but different areas of
>      	 operation within a keyring.
>
>      (c) Sharing of cache keyrings, such as the DNS lookup cache.
>
>      (d) Diversion of upcalls based on namespace criteria.
>
>  (2) Provide each network namespace with a tag that can be used with (1).
>      This is used by the DNS query, rxrpc, nfs idmapper keys.
>
>      [!] Note that it might still be better to move these keyrings into the
>      	 network namespace.
>
>  (3) Provide key ACLs.  These allow:
>
>      (a) The permissions can be split more finely, in particular separating
>      	 out Invalidate and Join.
>
>      (b) Permits to be granted to non-standard subjects.  So, for instance,
>      	 Search permission could be granted to a container object, allowing
>      	 a search of the container keyring by a denizen of the container to
>      	 find a key that they can't otherwise see.
>
>  (4) Provide a kernel container object.  Currently, this is created with a
>      system call and passed flags that indicate the namespaces to be
>      inherited or replaced.  It might be better to actually use something
>      like fsconfig() to configure the container by setting key=val type
>      options.
>
>      The kernel container object provides the following facilities:
>
>      (a) request_key upcall interception.  The manager of a container can
>      	 intercept requests made inside the container and, using a series
>      	 of filters, can cause the authkeys to be placed into keyrings that
>      	 serve as queues for one or more upcall processing programs.  These
>      	 upcall programs use key notifications to monitor those keyrings.
>
>      (b) Per-container keyring.  A keyring can be attached to the container
>      	 such that this is searched by a request_key() performed by a
>      	 denizen of the container after searching the thread, process and
>      	 session keyrings.  The keyring and the keys contained therein must
>      	 be granted Search for that container.
>
> 	 This allows:
>
>  	 (i) Authenticated filesystems to be used transparently inside of
> 	     the container without any cooperation from the occupant
> 	     thereof.  All the key maintenance can be done by the manager.
>
>          (ii) Keys to be made available to the denizens of a container (by
>              granting extra permissions to the container subject).
>
>      (c) Per-container ID that can be used in audit messages.
>
>      (d) Container object creation gives the manager a file descriptor that
>      	 can:
>
> 	 (i) Be passed to a dirfd parameter to a VFS syscall, such as
>      	     mkdirat(), allowing an operation to be done inside the
>      	     container.
>
>          (ii) Be passed to fsopen()/fsconfig() to indicate that the target
>              filesystem is going to be created inside a container, in that
>              container's namespaces.
>
>          (iii) Be passed to the move_mount() syscall as a destination for
>              setting the root filesystem inside a new mount namespace made
>              upon container creation.
>
>      (e) The ability to configure the container with namespaces or
>      	 whatever, and then fork a process into that container to 'boot'
>      	 it.
>
>
> Three sample programs are provided:
>
>  (1) test-container.  This:
>
> 	- Creates a kernel container with a blank mount ns.
> 	- Creates its root mount and moves it to the container root.
> 	- Mounts /proc therein.
> 	- Creates a keyring called "_container"
> 	  - Sets that as the container keyring.
> 	  - Grants Search permission to the container on that keyring.
> 	  - Removes owner permission on that keyring.
> 	- Creates a sample user key "foobar" in the container keyring.
> 	  - Grants various permissions to the container on that key.
> 	- Creates a keyring called "upcall"
> 	  - Intercepts "user" key upcalls from the container to there.
> 	- Forks a process into the container
> 	  - Prints the container keyring ID if it can
> 	  - Exec's bash.
>
>      This program expects to be given the device name for a partition it
>      can mount as the root and expects it to contain things like /etc,
>      /bin, /sbin, /lib, /usr containing programs that can be run and /proc
>      to mount procfs upon.  E.g.:
>
> 	./test-container /dev/sda3
>
>  (2) test-upcall.  This is a service program that monitors the "upcall"
>      keyring created by test-container for authkeys appearing, which it
>      then hands off to /sbin/request-key.  This:
>
> 	- Opens /dev/watch_queue.
> 	  - Sets the size to 1 page.
> 	  - Sets a filter to watch for "Link creation" key events.
> 	  - Sets a watch on the upcall keyring.
> 	- Polls the watch queue for events
> 	- When an event comes in:
> 	  - Gets the authkey ID from the event buffer.
> 	  - Queries the authkey.
> 	  - Forks of a handler which:
> 	    - Moves the authkey to its thread keyring
> 	    - Sets up a new session keyring with the authkey in it.
> 	    - Execs /sbin/request-key.
>
>      This can be run in a shell that shares the session keyring with
>      test-container, from which it will find the upcall keyring.
>      Alternatively, the keyring ID can be provided on the command line:
>
> 	./test-upcall [<upcall-keyring>]
>
>      It can be triggered from inside of the container with something like:
>
> 	keyctl request2 user debug:e a @s
>
>      and something like:
>
> 	ptrs h=4 t=2 m=2000003
> 	NOTIFY[00000004-00000002] ty=0003 sy=0002 i=01000010
> 	KEY 78543393 change=2 aux=141053003
> 	Authentication key 141053003
> 	- create 779280685
> 	- uid=0 gid=0
> 	- rings=0,0,798528519
> 	- callout='a'
> 	RQDebug keyid: 779280685
> 	RQDebug desc: debug:e
> 	RQDebug callout: a
> 	RQDebug session keyring: 798528519
>
>      will appear on stdout/stderr from it and /sbin/request-key.
>
>  (3) test-cont-grant.  This is a program to make the nominated key
>      available to a container's denizens.  It:
>
> 	- Grants search permission to the nominated key.
> 	- Links the nominated key into the container keyring.
>
>      It can be run from outside of the keyring like so:
>
> 	./test-cont-grant <key> [<container-keyring>]
>
>      If the keyring isn't given, it will look for one called "_container"
>      in the session keyring where test-container is expected to have placed
>      it.
>
>      With kAFS, it can be used like follows:
>
> 	kinit dhowells@REDHAT.COM
> 	kafs-aklog redhat.com
>
>      which would log into kerberos and then get a key for accessing an AFS
>      cell called "redhat.com".  This can be seen in the session keyring by
>      calling "keyctl show":
>
> 	 120378984 --alswrv      0     0  keyring: _ses
> 	 474754113 ---lswrv      0 65534   \_ keyring: _uid.0
> 	  64049961 --alswrv      0     0   \_ rxrpc: afs@redhat.com
> 	  78543393 --alswrv      0     0   \_ keyring: upcall
> 	 661655334 --alswrv      0     0   \_ keyring: _container
> 	 639103010 --alswrv      0     0       \_ user: foobar
>
>      Then doing:
>
> 	./test-cont-grant 64049961
>
>      will result in:
>
> 	 120378984 --alswrv      0     0  keyring: _ses
> 	 474754113 ---lswrv      0 65534   \_ keyring: _uid.0
> 	  64049961 --alswrv      0     0   \_ rxrpc: afs@procyon.org.uk
> 	  78543393 --alswrv      0     0   \_ keyring: upcall
> 	 661655334 --alswrv      0     0   \_ keyring: _container
> 	 639103010 --alswrv      0     0       \_ user: foobar
> 	  64049961 --alswrv      0     0       \_ rxrpc: afs@procyon.org.uk
>
>      Inside the container, the cell could be mounted:
>
> 	mount -t afs "%redhat.com:root.cell" /mnt
>
>      and then operations in /mnt will be done using the token that has been
>      made available.  However, this can be overridden locally inside the
>      container by doing kinit and kafs-aklog there with a different user.
>
>      More to the point, the container manager could mount the container's
>      rootfs, say, over authenticated AFS and then attach the token to the
>      container and mount the rootfs into the container and the container's
>      inhabitant need not have any means to gain a kerberos login.
>
>      [?] I do wonder if the possibility to use container key searches for
>      	 direct mounts should be controlled by a mount option, say:
>
> 		fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
>
>          where you have to have the container handle available.
>
>      [!] Note that test-cont-grant picks the container by name and does not
>      	 require the container handle when setting the key ACL - but the
>      	 name must come from the set of children of the current container.
>
>
> The patches can be found here also:
>
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container
>
> Note that this is dependent on the mount-api-viro, fsinfo, notifications
> and keys-namespace branches.
>
> David
> ---
> David Howells (27):
>       containers: Rename linux/container.h to linux/container_dev.h
>       containers: Implement containers as kernel objects
>       containers: Provide /proc/containers
>       containers: Allow a process to be forked into a container
>       containers: Open a socket inside a container
>       containers, vfs: Allow syscall dirfd arguments to take a container fd
>       containers: Make fsopen() able to create a superblock in a container
>       containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS
>       vfs: Allow mounting to other namespaces
>       containers: Provide fs_context op for container setting
>       containers: Sample program for driving container objects
>       containers: Allow a daemon to intercept request_key upcalls in a container
>       keys: Provide a keyctl to query a request_key authentication key
>       keys: Break bits out of key_unlink()
>       keys: Make __key_link_begin() handle lockdep nesting
>       keys: Grant Link permission to possessers of request_key auth keys
>       keys: Add a keyctl to move a key between keyrings
>       keys: Find the least-recently used unseen key in a keyring.
>       containers: Sample: request_key upcall handling
>       container, keys: Add a container keyring
>       keys: Fix request_key() lack of Link perm check on found key
>       KEYS: Replace uid/gid/perm permissions checking with an ACL
>       KEYS: Provide KEYCTL_GRANT_PERMISSION
>       keys: Allow a container to be specified as a subject in a key's ACL
>       keys: Provide a way to ask for the container keyring
>       keys: Allow containers to be included in key ACLs by name
>       containers: Sample to grant access to a key in a container
>
>
>  arch/x86/entry/syscalls/syscall_32.tbl             |    3 
>  arch/x86/entry/syscalls/syscall_64.tbl             |    3 
>  arch/x86/ia32/sys_ia32.c                           |    2 
>  certs/blacklist.c                                  |    7 
>  certs/system_keyring.c                             |   12 
>  drivers/acpi/container.c                           |    2 
>  drivers/base/container.c                           |    2 
>  drivers/md/dm-crypt.c                              |    2 
>  drivers/nvdimm/security.c                          |    2 
>  fs/afs/security.c                                  |    2 
>  fs/afs/super.c                                     |   18 +
>  fs/cifs/cifs_spnego.c                              |   25 +
>  fs/cifs/cifsacl.c                                  |   28 +
>  fs/cifs/connect.c                                  |    4 
>  fs/crypto/keyinfo.c                                |    2 
>  fs/ecryptfs/ecryptfs_kernel.h                      |    2 
>  fs/ecryptfs/keystore.c                             |    2 
>  fs/fs_context.c                                    |   39 +
>  fs/fscache/object-list.c                           |    2 
>  fs/fsopen.c                                        |   54 ++
>  fs/namei.c                                         |   45 +-
>  fs/namespace.c                                     |  129 ++++-
>  fs/nfs/nfs4idmap.c                                 |   29 +
>  fs/proc/root.c                                     |   20 +
>  fs/ubifs/auth.c                                    |    2 
>  include/linux/container.h                          |  100 +++-
>  include/linux/container_dev.h                      |   25 +
>  include/linux/cred.h                               |    3 
>  include/linux/fs_context.h                         |    5 
>  include/linux/init_task.h                          |    1 
>  include/linux/key-type.h                           |    2 
>  include/linux/key.h                                |  122 +++--
>  include/linux/lsm_hooks.h                          |   20 +
>  include/linux/nsproxy.h                            |    7 
>  include/linux/pid.h                                |    5 
>  include/linux/proc_ns.h                            |    6 
>  include/linux/sched.h                              |    3 
>  include/linux/sched/task.h                         |    3 
>  include/linux/security.h                           |   15 +
>  include/linux/socket.h                             |    3 
>  include/linux/syscalls.h                           |    6 
>  include/uapi/linux/container.h                     |   28 +
>  include/uapi/linux/keyctl.h                        |   85 +++
>  include/uapi/linux/mount.h                         |    4 
>  init/Kconfig                                       |    7 
>  init/init_task.c                                   |    3 
>  ipc/mqueue.c                                       |   10 
>  kernel/Makefile                                    |    2 
>  kernel/container.c                                 |  532 ++++++++++++++++++++
>  kernel/cred.c                                      |   45 ++
>  kernel/exit.c                                      |    1 
>  kernel/fork.c                                      |  111 ++++
>  kernel/namespaces.h                                |   15 +
>  kernel/nsproxy.c                                   |   32 +
>  kernel/pid.c                                       |    4 
>  kernel/sys_ni.c                                    |    5 
>  lib/digsig.c                                       |    2 
>  net/ceph/ceph_common.c                             |    2 
>  net/compat.c                                       |    2 
>  net/dns_resolver/dns_key.c                         |   12 
>  net/dns_resolver/dns_query.c                       |   15 -
>  net/rxrpc/key.c                                    |   16 -
>  net/socket.c                                       |   34 +
>  samples/vfs/Makefile                               |   12 
>  samples/vfs/test-cont-grant.c                      |   84 +++
>  samples/vfs/test-container.c                       |  382 ++++++++++++++
>  samples/vfs/test-upcall.c                          |  243 +++++++++
>  security/integrity/digsig.c                        |   31 -
>  security/integrity/digsig_asymmetric.c             |    2 
>  security/integrity/evm/evm_crypto.c                |    2 
>  security/integrity/ima/ima_mok.c                   |   13 
>  security/integrity/integrity.h                     |    4 
>  .../integrity/platform_certs/platform_keyring.c    |   13 
>  security/keys/Makefile                             |    2 
>  security/keys/compat.c                             |   20 +
>  security/keys/container.c                          |  419 ++++++++++++++++
>  security/keys/encrypted-keys/encrypted.c           |    2 
>  security/keys/encrypted-keys/masterkey_trusted.c   |    2 
>  security/keys/gc.c                                 |    2 
>  security/keys/internal.h                           |   34 +
>  security/keys/key.c                                |   35 -
>  security/keys/keyctl.c                             |  176 +++++--
>  security/keys/keyring.c                            |  198 ++++++-
>  security/keys/permission.c                         |  446 +++++++++++++++--
>  security/keys/persistent.c                         |   27 +
>  security/keys/proc.c                               |   17 -
>  security/keys/process_keys.c                       |  102 +++-
>  security/keys/request_key.c                        |   70 ++-
>  security/keys/request_key_auth.c                   |   21 +
>  security/security.c                                |   12 
>  security/selinux/hooks.c                           |   16 +
>  security/smack/smack_lsm.c                         |    3 
>  92 files changed, 3696 insertions(+), 425 deletions(-)
>  create mode 100644 include/linux/container_dev.h
>  create mode 100644 include/uapi/linux/container.h
>  create mode 100644 kernel/container.c
>  create mode 100644 kernel/namespaces.h
>  create mode 100644 samples/vfs/test-cont-grant.c
>  create mode 100644 samples/vfs/test-container.c
>  create mode 100644 samples/vfs/test-upcall.c
>  create mode 100644 security/keys/container.c

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 04/27] containers: Allow a process to be forked into a container
  2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
  2019-02-15 17:39   ` Stephen Smalley
@ 2019-02-19 16:39   ` Eric W. Biederman
  2019-02-19 23:16   ` David Howells
  2 siblings, 0 replies; 61+ messages in thread
From: Eric W. Biederman @ 2019-02-19 16:39 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

David Howells <dhowells@redhat.com> writes:

> Allow a single process to be forked directly into a container using a new
> syscall, thereby 'booting' the container:
>
> 	pid_t pid = fork_into_container(int container_fd);
>
> This process will be the 'init' process of the container.
>
> Further attempts to fork into the container will be rejected.

So you are breaking nsenter, and it's like.

There are no technical reasons to disallow this, and may good practical
reasons to allow this.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  arch/x86/ia32/sys_ia32.c               |    2 -
>  include/linux/cred.h                   |    3 +
>  include/linux/nsproxy.h                |    7 ++
>  include/linux/sched/task.h             |    3 +
>  include/linux/syscalls.h               |    1 
>  kernel/cred.c                          |   45 +++++++++++++
>  kernel/fork.c                          |  110 ++++++++++++++++++++++++++------
>  kernel/nsproxy.c                       |   11 +++
>  kernel/sys_ni.c                        |    1 
>  11 files changed, 157 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3564814a5d21..8666693510f9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -408,3 +408,4 @@
>  394	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
>  395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
>  396	i386	container_create	sys_container_create		__ia32_sys_container_create
> +397	i386	fork_into_container	sys_fork_into_container		__ia32_sys_fork_into_container
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index aa6cccbe5271..d40d4790fcb2 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -353,6 +353,7 @@
>  342	common	mount_notify		__x64_sys_mount_notify
>  343	common	sb_notify		__x64_sys_sb_notify
>  344	common	container_create	__x64_sys_container_create
> +345	common	fork_into_container	__x64_sys_fork_into_container
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
> index a43212036257..080d9e21b697 100644
> --- a/arch/x86/ia32/sys_ia32.c
> +++ b/arch/x86/ia32/sys_ia32.c
> @@ -238,5 +238,5 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
>  		       unsigned long, tls_val, int __user *, child_tidptr)
>  {
>  	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
> -			tls_val);
> +			tls_val, NULL);
>  }
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 4907c9df86b3..357e743d5d4a 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -23,6 +23,7 @@
>  
>  struct cred;
>  struct inode;
> +struct container;
>  
>  /*
>   * COW Supplementary groups list
> @@ -155,7 +156,7 @@ struct cred {
>  
>  extern void __put_cred(struct cred *);
>  extern void exit_creds(struct task_struct *);
> -extern int copy_creds(struct task_struct *, unsigned long);
> +extern int copy_creds(struct task_struct *, unsigned long, struct container *);
>  extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 2ae1b1a4d84d..81838ae24a92 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -11,6 +11,7 @@ struct ipc_namespace;
>  struct pid_namespace;
>  struct cgroup_namespace;
>  struct fs_struct;
> +struct container;
>  
>  /*
>   * A structure to contain pointers to all per-process
> @@ -63,9 +64,13 @@ extern struct nsproxy init_nsproxy;
>   *         * /
>   *     task_unlock(task);
>   *
> + *  4. Container namespaces are set at container creation and cannot be
> + *     changed.
> + *
>   */
>  
> -int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> +int copy_namespaces(unsigned long flags, struct task_struct *tsk,
> +		    struct container *dest_container);
>  void exit_task_namespaces(struct task_struct *tsk);
>  void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
>  void free_nsproxy(struct nsproxy *ns);
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 44c6f15800ff..bdff71b0fb66 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -73,7 +73,8 @@ extern void do_group_exit(int);
>  extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
>  
> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
> +		     int __user *, unsigned long, struct container *);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index dac42098c2dd..15e5cc704df3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -946,6 +946,7 @@ asmlinkage long sys_sb_notify(int dfd, const char __user *path,
>  asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
>  				     unsigned long spare3, unsigned long spare4,
>  				     unsigned long spare5);
> +asmlinkage long sys_fork_into_container(int containerfd);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 21f4a97085b4..f0ee5cec533d 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -313,6 +313,43 @@ struct cred *prepare_exec_creds(void)
>  	return new;
>  }
>  
> +/*
> + * Handle forking a process into a container.
> + */
> +static struct cred *copy_container_creds(struct container *dest_container)
> +{
> +	struct cred *new;
> +
> +	validate_process_creds();
> +
> +	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
> +	if (!new)
> +		return NULL;
> +
> +	kdebug("prepare_creds() alloc %p", new);
> +
> +	memcpy(new, dest_container->cred, sizeof(struct cred));
> +
> +	atomic_set(&new->usage, 1);
> +	set_cred_subscribers(new, 0);
> +	get_group_info(new->group_info);
> +	get_uid(new->user);
> +	get_user_ns(new->user_ns);
> +
> +#ifdef CONFIG_SECURITY
> +	new->security = NULL;
> +#endif
> +
> +	if (security_prepare_creds(new, dest_container->cred, GFP_KERNEL) < 0)
> +		goto error;
> +	validate_creds(new);
> +	return new;
> +
> +error:
> +	abort_creds(new);
> +	return NULL;
> +}
> +
>  /*
>   * Copy credentials for the new process created by fork()
>   *
> @@ -322,7 +359,8 @@ struct cred *prepare_exec_creds(void)
>   * The new process gets the current process's subjective credentials as its
>   * objective and subjective credentials
>   */
> -int copy_creds(struct task_struct *p, unsigned long clone_flags)
> +int copy_creds(struct task_struct *p, unsigned long clone_flags,
> +	       struct container *dest_container)
>  {
>  	struct cred *new;
>  	int ret;
> @@ -343,7 +381,10 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  		return 0;
>  	}
>  
> -	new = prepare_creds();
> +	if (dest_container)
> +		new = copy_container_creds(dest_container);
> +	else
> +		new = prepare_creds();
>  	if (!new)
>  		return -ENOMEM;
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 009cf7e63894..71401deb4434 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1385,9 +1385,33 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  	return retval;
>  }
>  
> -static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
> +static int copy_fs(unsigned long clone_flags, struct task_struct *tsk,
> +		   struct container *dest_container)
>  {
>  	struct fs_struct *fs = current->fs;
> +
> +#ifdef CONFIG_CONTAINERS
> +	if (dest_container) {
> +		fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);
> +		if (!fs)
> +			return -ENOMEM;
> +
> +		fs->users = 1;
> +		fs->in_exec = 0;
> +		spin_lock_init(&fs->lock);
> +		seqcount_init(&fs->seq);
> +		fs->umask = 0022;
> +
> +		spin_lock(&dest_container->lock);
> +		fs->pwd = fs->root = dest_container->root;
> +		path_get(&fs->root);
> +		path_get(&fs->pwd);
> +		spin_unlock(&dest_container->lock);
> +		tsk->fs = fs;
> +		return 0;
> +	}
> +#endif
> +
>  	if (clone_flags & CLONE_FS) {
>  		/* tsk->fs is already what we want */
>  		spin_lock(&fs->lock);
> @@ -1679,7 +1703,8 @@ static __latent_entropy struct task_struct *copy_process(
>  					struct pid *pid,
>  					int trace,
>  					unsigned long tls,
> -					int node)
> +					int node,
> +					struct container *dest_container)
>  {
>  	int retval;
>  	struct task_struct *p;
> @@ -1783,7 +1808,7 @@ static __latent_entropy struct task_struct *copy_process(
>  	}
>  	current->flags &= ~PF_NPROC_EXCEEDED;
>  
> -	retval = copy_creds(p, clone_flags);
> +	retval = copy_creds(p, clone_flags, dest_container);
>  	if (retval < 0)
>  		goto bad_fork_free;
>  
> @@ -1905,7 +1930,7 @@ static __latent_entropy struct task_struct *copy_process(
>  	retval = copy_files(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_semundo;
> -	retval = copy_fs(clone_flags, p);
> +	retval = copy_fs(clone_flags, p, dest_container);
>  	if (retval)
>  		goto bad_fork_cleanup_files;
>  	retval = copy_sighand(clone_flags, p);
> @@ -1917,15 +1942,15 @@ static __latent_entropy struct task_struct *copy_process(
>  	retval = copy_mm(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_signal;
> -	retval = copy_namespaces(clone_flags, p);
> +	retval = copy_container(clone_flags, p, dest_container);
>  	if (retval)
>  		goto bad_fork_cleanup_mm;
> -	retval = copy_container(clone_flags, p, NULL);
> +	retval = copy_namespaces(clone_flags, p, dest_container);
>  	if (retval)
> -		goto bad_fork_cleanup_namespaces;
> +		goto bad_fork_cleanup_container;
>  	retval = copy_io(clone_flags, p);
>  	if (retval)
> -		goto bad_fork_cleanup_container;
> +		goto bad_fork_cleanup_namespaces;
>  	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
> @@ -2124,10 +2149,10 @@ static __latent_entropy struct task_struct *copy_process(
>  bad_fork_cleanup_io:
>  	if (p->io_context)
>  		exit_io_context(p);
> -bad_fork_cleanup_container:
> -	exit_container(p);
>  bad_fork_cleanup_namespaces:
>  	exit_task_namespaces(p);
> +bad_fork_cleanup_container:
> +	exit_container(p);
>  bad_fork_cleanup_mm:
>  	if (p->mm)
>  		mmput(p->mm);
> @@ -2183,7 +2208,7 @@ struct task_struct *fork_idle(int cpu)
>  {
>  	struct task_struct *task;
>  	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
> -			    cpu_to_node(cpu));
> +			    cpu_to_node(cpu), NULL);
>  	if (!IS_ERR(task)) {
>  		init_idle_pids(task);
>  		init_idle(task, cpu);
> @@ -2195,15 +2220,16 @@ struct task_struct *fork_idle(int cpu)
>  /*
>   *  Ok, this is the main fork-routine.
>   *
> - * It copies the process, and if successful kick-starts
> - * it and waits for it to finish using the VM if required.
> + * It copies the process into the specified container, and if successful
> + * kick-starts it and waits for it to finish using the VM if required.
>   */
>  long _do_fork(unsigned long clone_flags,
>  	      unsigned long stack_start,
>  	      unsigned long stack_size,
>  	      int __user *parent_tidptr,
>  	      int __user *child_tidptr,
> -	      unsigned long tls)
> +	      unsigned long tls,
> +	      struct container *dest_container)
>  {
>  	struct completion vfork;
>  	struct pid *pid;
> @@ -2229,8 +2255,32 @@ long _do_fork(unsigned long clone_flags,
>  			trace = 0;
>  	}
>  
> +	if (dest_container) {
> +		/* A process spawned into a container doesn't share anything
> +		 * with the parent other than namespaces.
> +		 */
> +		if (clone_flags & (CLONE_CHILD_CLEARTID |
> +				   CLONE_CHILD_SETTID |
> +				   CLONE_FILES |
> +				   CLONE_FS |
> +				   CLONE_IO |
> +				   CLONE_PARENT |
> +				   CLONE_PARENT_SETTID |
> +				   CLONE_PTRACE |
> +				   CLONE_SETTLS |
> +				   CLONE_SIGHAND |
> +				   CLONE_SYSVSEM |
> +				   CLONE_THREAD))
> +			return -EINVAL;
> +
> +		/* However, we do have to let kernel threads borrow a VM. */
> +		if ((clone_flags & CLONE_VM) && current->mm)
> +			return -EINVAL;
> +	}
> +	
>  	p = copy_process(clone_flags, stack_start, stack_size,
> -			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
> +			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE,
> +			 dest_container);
>  	add_latent_entropy();
>  
>  	if (IS_ERR(p))
> @@ -2279,7 +2329,7 @@ long do_fork(unsigned long clone_flags,
>  	      int __user *child_tidptr)
>  {
>  	return _do_fork(clone_flags, stack_start, stack_size,
> -			parent_tidptr, child_tidptr, 0);
> +			parent_tidptr, child_tidptr, 0, NULL);
>  }
>  #endif
>  
> @@ -2289,14 +2339,14 @@ long do_fork(unsigned long clone_flags,
>  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>  {
>  	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +			(unsigned long)arg, NULL, NULL, 0, NULL);
>  }
>  
>  #ifdef __ARCH_WANT_SYS_FORK
>  SYSCALL_DEFINE0(fork)
>  {
>  #ifdef CONFIG_MMU
> -	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
> +	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, NULL);
>  #else
>  	/* can not support in nommu mode */
>  	return -EINVAL;
> @@ -2308,7 +2358,26 @@ SYSCALL_DEFINE0(fork)
>  SYSCALL_DEFINE0(vfork)
>  {
>  	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
> -			0, NULL, NULL, 0);
> +			0, NULL, NULL, 0, NULL);
> +}
> +#endif
> +
> +#ifdef CONFIG_CONTAINERS
> +SYSCALL_DEFINE1(fork_into_container, int, containerfd)
> +{
> +	struct fd f = fdget(containerfd);
> +	int ret;
> +
> +	if (!f.file)
> +		return -EBADF;
> +	ret = -EINVAL;
> +	if (is_container_file(f.file)) {
> +		struct container *dest_container = f.file->private_data;
> +
> +		ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, dest_container);
> +	}
> +	fdput(f);
> +	return ret;
>  }
>  #endif
>  
> @@ -2336,7 +2405,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>  		 unsigned long, tls)
>  #endif
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
> +	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls,
> +			NULL);
>  }
>  #endif
>  
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 4bb5184b3a80..4031075300a4 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -136,12 +136,19 @@ struct nsproxy *create_new_namespaces(unsigned long flags,
>   * called from clone.  This now handles copy for nsproxy and all
>   * namespaces therein.
>   */
> -int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> +int copy_namespaces(unsigned long flags, struct task_struct *tsk,
> +		    struct container *dest_container)
>  {
>  	struct nsproxy *old_ns = tsk->nsproxy;
>  	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
>  	struct nsproxy *new_ns;
>  
> +	if (dest_container) {
> +		get_nsproxy(dest_container->ns);
> +		tsk->nsproxy = dest_container->ns;
> +		return 0;
> +	}
> +
>  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>  			      CLONE_NEWPID | CLONE_NEWNET |
>  			      CLONE_NEWCGROUP)))) {
> @@ -163,7 +170,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>  		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
>  		return -EINVAL;
>  
> -	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
> +	new_ns = create_new_namespaces(flags, old_ns, user_ns, tsk->fs);
>  	if (IS_ERR(new_ns))
>  		return  PTR_ERR(new_ns);
>  
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index f0455cbb91cf..a23ad529d548 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -144,6 +144,7 @@ COND_SYSCALL(container_create);
>  /* kernel/exit.c */
>  
>  /* kernel/fork.c */
> +COND_SYSCALL(fork_into_container);
>  
>  /* kernel/futex.c */
>  COND_SYSCALL(futex);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 05/27] containers: Open a socket inside a container
  2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside " David Howells
@ 2019-02-19 16:41   ` Eric W. Biederman
  0 siblings, 0 replies; 61+ messages in thread
From: Eric W. Biederman @ 2019-02-19 16:41 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

David Howells <dhowells@redhat.com> writes:

> Provide a system call to open a socket inside of a container, using that
> container's network namespace.  This allows netlink to be used to manage
> the container.
>
> 	fd = container_socket(int container_fd,
> 			      int domain, int type, int protocol);
>

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Use a namespace file descriptor if you need this.  So far we have not
added this system call as it is just a performance optimization.  And it
has been too niche to matter.

If this that has changed we can add this separately from everything else
you are doing here.


> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>  include/linux/socket.h                 |    3 ++-
>  include/linux/syscalls.h               |    2 ++
>  kernel/sys_ni.c                        |    1 +
>  net/compat.c                           |    2 +-
>  net/socket.c                           |   34 +++++++++++++++++++++++++++-----
>  7 files changed, 37 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 8666693510f9..f4c9beff77a6 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -409,3 +409,4 @@
>  395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
>  396	i386	container_create	sys_container_create		__ia32_sys_container_create
>  397	i386	fork_into_container	sys_fork_into_container		__ia32_sys_fork_into_container
> +398	i386	container_socket	sys_container_socket		__ia32_sys_container_socket
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index d40d4790fcb2..e20cdf7b5527 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -354,6 +354,7 @@
>  343	common	sb_notify		__x64_sys_sb_notify
>  344	common	container_create	__x64_sys_container_create
>  345	common	fork_into_container	__x64_sys_fork_into_container
> +346	common	container_socket	__x64_sys_container_socket
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index ab2041a00e01..154ac900a8a5 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -10,6 +10,7 @@
>  #include <linux/compiler.h>		/* __user			*/
>  #include <uapi/linux/socket.h>
>  
> +struct net;
>  struct pid;
>  struct cred;
>  
> @@ -376,7 +377,7 @@ extern int __sys_sendto(int fd, void __user *buff, size_t len,
>  			int addr_len);
>  extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
>  			 int __user *upeer_addrlen, int flags);
> -extern int __sys_socket(int family, int type, int protocol);
> +extern int __sys_socket(struct net *net, int family, int type, int protocol);
>  extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen);
>  extern int __sys_connect(int fd, struct sockaddr __user *uservaddr,
>  			 int addrlen);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 15e5cc704df3..547334c6ffc2 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -947,6 +947,8 @@ asmlinkage long sys_container_create(const char __user *name, unsigned int flags
>  				     unsigned long spare3, unsigned long spare4,
>  				     unsigned long spare5);
>  asmlinkage long sys_fork_into_container(int containerfd);
> +asmlinkage long sys_container_socket(int containerfd,
> +				     int domain, int type, int protocol);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a23ad529d548..ce9c5bb30e7f 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -236,6 +236,7 @@ COND_SYSCALL(shmdt);
>  /* net/socket.c */
>  COND_SYSCALL(socket);
>  COND_SYSCALL(socketpair);
> +COND_SYSCALL(container_socket);
>  COND_SYSCALL(bind);
>  COND_SYSCALL(listen);
>  COND_SYSCALL(accept);
> diff --git a/net/compat.c b/net/compat.c
> index 959d1c51826d..1b2db740fd33 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -856,7 +856,7 @@ COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
>  
>  	switch (call) {
>  	case SYS_SOCKET:
> -		ret = __sys_socket(a0, a1, a[2]);
> +		ret = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]);
>  		break;
>  	case SYS_BIND:
>  		ret = __sys_bind(a0, compat_ptr(a1), a[2]);
> diff --git a/net/socket.c b/net/socket.c
> index 7d271a1d0c7e..7406580598b9 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -80,6 +80,7 @@
>  #include <linux/highmem.h>
>  #include <linux/mount.h>
>  #include <linux/fs_context.h>
> +#include <linux/container.h>
>  #include <linux/security.h>
>  #include <linux/syscalls.h>
>  #include <linux/compat.h>
> @@ -1326,9 +1327,9 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct
>  }
>  EXPORT_SYMBOL(sock_create_kern);
>  
> -int __sys_socket(int family, int type, int protocol)
> +int __sys_socket(struct net *net, int family, int type, int protocol)
>  {
> -	int retval;
> +	long retval;
>  	struct socket *sock;
>  	int flags;
>  
> @@ -1346,7 +1347,7 @@ int __sys_socket(int family, int type, int protocol)
>  	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
>  		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
>  
> -	retval = sock_create(family, type, protocol, &sock);
> +	retval = __sock_create(net, family, type, protocol, &sock, 0);
>  	if (retval < 0)
>  		return retval;
>  
> @@ -1355,9 +1356,32 @@ int __sys_socket(int family, int type, int protocol)
>  
>  SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
>  {
> -	return __sys_socket(family, type, protocol);
> +	return __sys_socket(current->nsproxy->net_ns, family, type, protocol);
>  }
>  
> +/*
> + * Create a socket inside a container.
> + */
> +#ifdef CONFIG_CONTAINERS
> +SYSCALL_DEFINE4(container_socket,
> +		int, containerfd, int, family, int, type, int, protocol)
> +{
> +	struct fd f = fdget(containerfd);
> +	long ret;
> +
> +	if (!f.file)
> +		return -EBADF;
> +	ret = -EINVAL;
> +	if (is_container_file(f.file)) {
> +		struct container *c = f.file->private_data;
> +
> +		ret = __sys_socket(c->ns->net_ns, family, type, protocol);
> +	}
> +	fdput(f);
> +	return ret;
> +}
> +#endif
> +
>  /*
>   *	Create a pair of connected sockets.
>   */
> @@ -2555,7 +2579,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
>  
>  	switch (call) {
>  	case SYS_SOCKET:
> -		err = __sys_socket(a0, a1, a[2]);
> +		err = __sys_socket(current->nsproxy->net_ns, a0, a1, a[2]);
>  		break;
>  	case SYS_BIND:
>  		err = __sys_bind(a0, (struct sockaddr __user *)a1, a[2]);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd
  2019-02-15 16:08 ` [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd David Howells
@ 2019-02-19 16:45   ` Eric W. Biederman
  2019-02-19 23:24   ` David Howells
  1 sibling, 0 replies; 61+ messages in thread
From: Eric W. Biederman @ 2019-02-19 16:45 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

David Howells <dhowells@redhat.com> writes:

> Some filesystem system calls, such as mkdirat(), take a 'directory fd' to
> specify the pathwalk origin.  This takes either AT_FDCWD or a file
> descriptor that refers to an open directory.
>
> Make it possible to supply a container fd, as obtained from
> container_create(), instead thereby specifying the container's root as the
> origin.  This performs the filesystem operation into the container's mount
> namespace.  For example:
>
> 	int cfd = container_create("fred", CONTAINER_NEW_MNT_NS, 0);
> 	mkdirat(cfd, "/fred", 0755);
>
> A better way to do this might be to temporarily override current->fs and
> current->nsproxy, but this requires splitting those fields so that procfs
> doesn't see the override.
>
> A sequence number and lock are available to protect the root pointer in
> case container_chroot() and/or container_pivot_root() are implemented.

If this is desirable we can do this without a ``container''.  We already
have mount namespaces.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

In fact if you take care to use a path that starts with '/' the normal
dirfd based operations work just fine.

So I don't see the point of this system call at all.


> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  fs/namei.c |   45 ++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index a85deb55d0c9..4932b5467285 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2232,20 +2232,43 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  		if (!f.file)
>  			return ERR_PTR(-EBADF);
>  
> -		dentry = f.file->f_path.dentry;
> +		if (is_container_file(f.file)) {
> +			struct container *c = f.file->private_data;
> +			unsigned seq;
>  
> -		if (*s && unlikely(!d_can_lookup(dentry))) {
> -			fdput(f);
> -			return ERR_PTR(-ENOTDIR);
> -		}
> +			if (!*s)
> +				return ERR_PTR(-EINVAL);
>  
> -		nd->path = f.file->f_path;
> -		if (flags & LOOKUP_RCU) {
> -			nd->inode = nd->path.dentry->d_inode;
> -			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> +			if (flags & LOOKUP_RCU) {
> +				do {
> +					seq = read_seqcount_begin(&c->seq);
> +					nd->path = c->root;
> +					nd->inode = nd->path.dentry->d_inode;
> +					nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
> +				} while (read_seqcount_retry(&c->seq, seq));
> +			} else {
> +				spin_lock(&c->lock);
> +				nd->path = c->root;
> +				path_get(&nd->path);
> +				spin_unlock(&c->lock);
> +				nd->inode = nd->path.dentry->d_inode;
> +			}
>  		} else {
> -			path_get(&nd->path);
> -			nd->inode = nd->path.dentry->d_inode;
> +			dentry = f.file->f_path.dentry;
> +
> +			if (*s && unlikely(!d_can_lookup(dentry))) {
> +				fdput(f);
> +				return ERR_PTR(-ENOTDIR);
> +			}
> +
> +			nd->path = f.file->f_path;
> +			if (flags & LOOKUP_RCU) {
> +				nd->inode = nd->path.dentry->d_inode;
> +				nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> +			} else {
> +				path_get(&nd->path);
> +				nd->inode = nd->path.dentry->d_inode;
> +			}
>  		}
>  		fdput(f);
>  		return s;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
  2019-02-17 18:57   ` Trond Myklebust
  2019-02-17 19:39   ` James Bottomley
@ 2019-02-19 16:56   ` Eric W. Biederman
  2019-02-19 23:03   ` David Howells
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 61+ messages in thread
From: Eric W. Biederman @ 2019-02-19 16:56 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

David Howells <dhowells@redhat.com> writes:

The container id details are ludicrous and will break practically
every use case.  This completely unacceptable.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..0a8918435097
> --- /dev/null
> +++ b/include/linux/container.h
> +/*
> + * The container object.
> + */
> +struct container {
> +	u64			id;		/* Container ID */
...

No.  This is absolutely unacceptable.
As this breaks breaks nested containers and process migration.

> +};
> +
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d2f90fa92468..073a3a930514 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@ struct backing_dev_info;
>  struct bio_list;
>  struct blk_plug;
>  struct cfs_rq;
> +struct container;
>  struct fs_struct;
>  struct futex_pi_state;
>  struct io_context;
> @@ -870,6 +871,8 @@ struct task_struct {
>  
>  	/* Namespaces: */
>  	struct nsproxy			*nsproxy;
> +	struct container		*container;
> +	struct list_head		container_link;

Why?  nsproxy would be a much cheaper location to put this.
Less space and less foobar.

>  	/* Signal handlers: */
>  	struct signal_struct		*signal;
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..ca4012632cfa
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,348 @@
[...]
> +
> +	c->id = atomic64_inc_return(&container_id_counter);

This id is not in a namespace, and it doesn't have enough bits
of entropy to be globally unique.   Not that 64bit is enough
to have a chance at being globablly unique.


Eric

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                     ` (2 preceding siblings ...)
  2019-02-19 16:56   ` Eric W. Biederman
@ 2019-02-19 23:03   ` David Howells
  2019-02-20 14:23     ` Trond Myklebust
  2019-02-19 23:06   ` David Howells
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-19 23:03 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, sfrench, keyrings, rgb, linux-kernel,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel

Trond Myklebust <trondmy@hammerspace.com> wrote:

> Do we really need a new system call to set up containers? That would
> force changes to all existing orchestration software.

No, it wouldn't.  Nothing in my patches forces existing orchestration software
to change, unless it wants to use the new facilities - then it would have to
be changed anyway, right?  I will grant, though, that the extent of the change
might vary.

> Given that the main thing we want to achieve is to direct messages from
> the kernel to an appropriate handler, why not focus on adding
> functionality to do just that?

Because it's *not* just that that is added here.  There are a number of things
this patchset (and one it depends on) provides:

 (1) The ability to intercept request_key() upcalls that happen inside a
     container, filtered by operative namespace.

 (2) The ability to provide a per-container keyring that can hold keys that
     can be used inside the container without any action on behalf of the
     denizens of the container.

 (3) The ability to grant permissions to a *container* as a subject, allowing
     it and its denizens to use, but not necessarily read, modify, link or
     invalidate a key.

 (4) The ability to create superblocks inside a container with a separate
     mount namespace from outside, such that they can use the container keys,
     thereby allowing the root of a container to be on an authenticated
     filesystem.

> Is there any reason why a syscall to allow an appropriately privileged
> process to add a keyring-specific message queue to its own
> user_namespace and obtain a file descriptor to that message queue might
> not work?

Yes.  That forces the use of a new user_namespace for every container in which
you want to use any of the above features.  The user_namespace is already way
too big and intrusive a hammer as it is.

> With such an implementation, the fallback mechanism could be to walk
> back up the hierarchy of user_namespaces until a message queue is
> found, and to invoke the existing request_key mechanism if not.

That's definitely wrong.  /sbin/request-key should *not* be spawned if the key
to be instantiated is not in all the init namespaces.

I went with a container object with namespaces for a reason: initially, it was
so that the upcall could take place inside of the container's namespaces, but
now it's do that any request that doesn't match the namespaces on the
container gets rejected at the boundary - so that some daemon up the chain
doesn't try servicing a request for which it can't access the config data or
would end up talking out of the wrong NIC.

I can drop the container object part of it for the moment.

I could instead create 1-3 new namespaces:

 (1) A namespace with an upcall-interception point.

 (2) A namespace with a container keyring.

 (3) A namespace with a subject ID for use in key ACLs.

I think I should also consider adding:

 (4) A namespace with keyring names in it.  I'm leaning towards this not being
     part of user_namespace because these probably should not be visible
     between containers.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                     ` (3 preceding siblings ...)
  2019-02-19 23:03   ` David Howells
@ 2019-02-19 23:06   ` David Howells
  2019-02-20  2:20     ` James Bottomley
  2019-02-19 23:13   ` David Howells
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: David Howells @ 2019-02-19 23:06 UTC (permalink / raw)
  To: James Bottomley
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, containers, cgroups

James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> I thought we got agreement years ago that containers don't exist in
> Linux as a single entity: they're currently a collection of cgroups and
> namespaces some of which may and some of which may not be local to the
> entity the orchestration system thinks of as a "container".

I wasn't party to that agreement and don't feel particularly bound by it.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                     ` (4 preceding siblings ...)
  2019-02-19 23:06   ` David Howells
@ 2019-02-19 23:13   ` David Howells
  2019-02-19 23:55   ` Tycho Andersen
  2019-02-20  2:46   ` Ian Kent
  7 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-19 23:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel

Eric W. Biederman <ebiederm@xmission.com> wrote:

> > +	c->id = atomic64_inc_return(&container_id_counter);
> 
> This id is not in a namespace, and it doesn't have enough bits
> of entropy to be globally unique.   Not that 64bit is enough
> to have a chance at being globablly unique.

It's in a container, so it doesn't need to be in a namespace.  The intended
purpose is for annotating audit messages.  Globally unique wasn't particularly
in mind.  It could be turned into, say, a uuid, so that isn't really a problem
at this point.

You are right, though, it really should be globally unique as best possible -
even the one in init_container should be.  Ideally, it would look the same
inside the root container as any subcontainer.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 04/27] containers: Allow a process to be forked into a container
  2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
  2019-02-15 17:39   ` Stephen Smalley
  2019-02-19 16:39   ` Eric W. Biederman
@ 2019-02-19 23:16   ` David Howells
  2 siblings, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-19 23:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel

Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Further attempts to fork into the container will be rejected.
>
> There are no technical reasons to disallow this, and may good practical
> reasons to allow this.

Fair enough; that can be done.  Could even emulate /sbin/request-key upcalling
that way, with the manager spawning the daemon into the container with it.

> So you are breaking nsenter, and it's like.

It shouldn't stop nsenter() from working.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd
  2019-02-15 16:08 ` [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd David Howells
  2019-02-19 16:45   ` Eric W. Biederman
@ 2019-02-19 23:24   ` David Howells
  1 sibling, 0 replies; 61+ messages in thread
From: David Howells @ 2019-02-19 23:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel

Eric W. Biederman <ebiederm@xmission.com> wrote:

> In fact if you take care to use a path that starts with '/' the normal
> dirfd based operations work just fine.

If the path starts with '/', dirfd is ignored.  And there's an error in my
patch description - it should be:

 	mkdirat(cfd, "fred", 0755);

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
                   ` (28 preceding siblings ...)
  2019-02-19 16:35 ` Eric W. Biederman
@ 2019-02-19 23:42 ` David Howells
  2019-02-20  7:00   ` Paul Moore
  2019-02-20 18:54   ` Steve French
  29 siblings, 2 replies; 61+ messages in thread
From: David Howells @ 2019-02-19 23:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, Linux Containers, linux-api

Eric W. Biederman <ebiederm@xmission.com> wrote:

> So you missed the main mailing lists for discussion of this kind of
> thing

Yeah, sorry about that.  I was primarily aiming it at Trond and Steve as I'd
like to consider how to go about interpolating request_key() into NFS and CIFS
so that they can make use of the key-related facilities that this makes
available with AFS.  And I was in a bit tight for time to mail it out before
having to go out.  I know, excuses... ;-)

> and the maintainer.

That would be me.  I maintain keyrings.

No one is listed in MAINTAINERS as owning namespaces.  If you feel that should
be you, please add a record.

> Looking at your description you are introducing a container id.

Yes.  For audit logging, which was why I cc'd Richard.

> You don't descibe which namespace your contianer id lives in.

It doesn't.  Not everything has to have a namespace.  As you yourself pointed
out, it should be globally unique, in which case the world is the namespace,
maybe even the universe;-).

> Without the container id living in a container this breaks
> nested containers and process migration aka CRIU.

As long as IDs are globally unique, why should break container migration?
Having a kernel container object might even make CRIU easier.

And what does "Without the container id living in a container" mean anyway?  I
have IDs attached to containers.  A container can see the IDs of its child
containers.  There should be no problem with nesting.

David

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                     ` (5 preceding siblings ...)
  2019-02-19 23:13   ` David Howells
@ 2019-02-19 23:55   ` Tycho Andersen
  2019-02-20  2:46   ` Ian Kent
  7 siblings, 0 replies; 61+ messages in thread
From: Tycho Andersen @ 2019-02-19 23:55 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel

On Fri, Feb 15, 2019 at 04:07:33PM +0000, David Howells wrote:
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 

...

>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 

...

>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init' process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree within
>      the container.

Is there a technical reason for this? In particular, there are some
container runtimes that do this today via clever use of bind mounts
and MS_MOVE, for things like dynamically attaching volumes. It would
be useful to be able to mount things into the container after the
fact.

>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are preemptively
>      SIGKILL'd by the kernel.

Isn't this essentially how the pid ns works today? I'm not sure what
the container fd offers here (of course if it lands, then having the
same semantics makes sense).

>  (6) Running different LSM policies by container.  This might particularly
>      make sense with something like Apparmor where different path-based
>      rules might be required inside a container to inside the parent.

Apparmor supports this today, as long as the host is also running
Apparmor. For the more general case, Casey (and others) have been
working on LSM stacking for a long time.

Tycho

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-19 23:06   ` David Howells
@ 2019-02-20  2:20     ` James Bottomley
  2019-02-20  3:04       ` Ian Kent
  0 siblings, 1 reply; 61+ messages in thread
From: James Bottomley @ 2019-02-20  2:20 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	containers, cgroups

On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > I thought we got agreement years ago that containers don't exist in
> > Linux as a single entity: they're currently a collection of cgroups
> > and namespaces some of which may and some of which may not be local
> > to the entity the orchestration system thinks of as a "container".
> 
> I wasn't party to that agreement and don't feel particularly bound by
> it.

That's not at all relevant, is it?  The point is we have widespread
uses of namespaces and cgroups that span containers today meaning that
a "container id" becomes a problematic concept.  What we finally got to
with the audit people was an unmodifiable label which the orchestration
system can set ... can't you just use that?

James


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
                     ` (6 preceding siblings ...)
  2019-02-19 23:55   ` Tycho Andersen
@ 2019-02-20  2:46   ` Ian Kent
  2019-02-20 13:26     ` Christian Brauner
  7 siblings, 1 reply; 61+ messages in thread
From: Ian Kent @ 2019-02-20  2:46 UTC (permalink / raw)
  To: David Howells, keyrings, trond.myklebust, sfrench, James Bottomley
  Cc: linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, containers, cgroups

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the following
> things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.
> 
>  (3) A set of processes, including one designated as the 'init' process.

Yeah, I think a name other than init needs to be used for this
process.

The problem being that there is no requirement for container
process 1 to behave in any way like an "init" process is
expected to behave and that leads to confusion (at least
it certainly did for me).

Admittedly I haven't yet worked through the series but in the
light of the comments from James I wanted to chime in (probably
too early to be useful not having read the series but ...).

I believe what your trying to do here is so badly needed it
would be great if the needs of James could be met to some
(as yet undefined) satisfactory extent.

Would there be any possibility of introducing a concept of
inactive and active containers where the creation is a two
(maybe more) step procedure, first the creation of (if you like
a "true") container that's essentially empty, basically a shell
(not the program "shell" of course), inert wrt. events and such
and implement the ability to make the container active by adding
various things, like processes, to it?

Clearly the concepts of inactive and active require a definition
of what they mean and I don't have that, perhaps a starting point
could be a container that has a process 1 (which should also require
a root fs and namespaces) is active otherwise it's considered inactive.

Ian


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20  2:20     ` James Bottomley
@ 2019-02-20  3:04       ` Ian Kent
  2019-02-20  3:46         ` James Bottomley
  0 siblings, 1 reply; 61+ messages in thread
From: Ian Kent @ 2019-02-20  3:04 UTC (permalink / raw)
  To: James Bottomley, David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	containers, cgroups

On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > 
> > > I thought we got agreement years ago that containers don't exist in
> > > Linux as a single entity: they're currently a collection of cgroups
> > > and namespaces some of which may and some of which may not be local
> > > to the entity the orchestration system thinks of as a "container".
> > 
> > I wasn't party to that agreement and don't feel particularly bound by
> > it.
> 
> That's not at all relevant, is it?  The point is we have widespread
> uses of namespaces and cgroups that span containers today meaning that
> a "container id" becomes a problematic concept.  What we finally got to
> with the audit people was an unmodifiable label which the orchestration
> system can set ... can't you just use that?

Sorry James, I fail to see how assigning an id to a collection of objects
constitutes a problem or how that could restrict the way a container is
used.

Isn't the only problem here the current restrictions on the way objects
need to be combined as a set and the ability to be able add or subtract
from that set.

Then again the notion of active vs. inactive might not be sufficient to
allow for the needed flexibility ...

Ian


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20  3:04       ` Ian Kent
@ 2019-02-20  3:46         ` James Bottomley
  2019-02-20  4:42           ` Ian Kent
  2019-02-20  6:57           ` Paul Moore
  0 siblings, 2 replies; 61+ messages in thread
From: James Bottomley @ 2019-02-20  3:46 UTC (permalink / raw)
  To: Ian Kent, David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	containers, cgroups

On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > 
> > > > I thought we got agreement years ago that containers don't
> > > > exist in Linux as a single entity: they're currently a
> > > > collection of cgroups and namespaces some of which may and some
> > > > of which may not be local to the entity the orchestration
> > > > system thinks of as a "container".
> > > 
> > > I wasn't party to that agreement and don't feel particularly
> > > bound by it.
> > 
> > That's not at all relevant, is it?  The point is we have widespread
> > uses of namespaces and cgroups that span containers today meaning
> > that a "container id" becomes a problematic concept.  What we
> > finally got to with the audit people was an unmodifiable label
> > which the orchestration system can set ... can't you just use that?
> 
> Sorry James, I fail to see how assigning an id to a collection of
> objects constitutes a problem or how that could restrict the way a
> container is used.

Rather than rehash the whole argument again, what's the reason you
can't use the audit label?  It seems to do what you want in a way that
doesn't cause problems.  If you can just use it there's little point
arguing over what is effectively a moot issue.

James


> Isn't the only problem here the current restrictions on the way
> objects need to be combined as a set and the ability to be able add
> or subtract from that set.
> 
> Then again the notion of active vs. inactive might not be sufficient
> to allow for the needed flexibility ...
> 
> Ian
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20  3:46         ` James Bottomley
@ 2019-02-20  4:42           ` Ian Kent
  2019-02-20  6:57           ` Paul Moore
  1 sibling, 0 replies; 61+ messages in thread
From: Ian Kent @ 2019-02-20  4:42 UTC (permalink / raw)
  To: James Bottomley, David Howells
  Cc: keyrings, trond.myklebust, sfrench, linux-security-module,
	linux-nfs, linux-cifs, linux-fsdevel, rgb, linux-kernel,
	containers, cgroups

On Tue, 2019-02-19 at 19:46 -0800, James Bottomley wrote:
> On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > > 
> > > > > I thought we got agreement years ago that containers don't
> > > > > exist in Linux as a single entity: they're currently a
> > > > > collection of cgroups and namespaces some of which may and some
> > > > > of which may not be local to the entity the orchestration
> > > > > system thinks of as a "container".
> > > > 
> > > > I wasn't party to that agreement and don't feel particularly
> > > > bound by it.
> > > 
> > > That's not at all relevant, is it?  The point is we have widespread
> > > uses of namespaces and cgroups that span containers today meaning
> > > that a "container id" becomes a problematic concept.  What we
> > > finally got to with the audit people was an unmodifiable label
> > > which the orchestration system can set ... can't you just use that?
> > 
> > Sorry James, I fail to see how assigning an id to a collection of
> > objects constitutes a problem or how that could restrict the way a
> > container is used.
> 
> Rather than rehash the whole argument again, what's the reason you
> can't use the audit label?  It seems to do what you want in a way that
> doesn't cause problems.  If you can just use it there's little point
> arguing over what is effectively a moot issue.

David might want to use the audit label for this, I don't know.
And maybe that's a good choice initially.

But going way off topic.

Because there is a need to not clutter kernel space with logging,
leaving it to user space to handle but also without providing user
space with sufficient information to do so there will need to be
some sort of globally unique (sub-system) identifiers of kernel
objects for which user space needs logging information so that
if or when that kernel to user space information flow is
implemented the consistent identifiers that will be needed will
at least exist for some kernel objects.

Yes, that's way off topic for this series but I think it's something
that needs at least some consideration for new implementation work.

Unfortunately properly implementing such an encoding scheme probably
warrants a completely separate project so, as you say moot wrt. this
series.

> 
> James
> 
> 
> > Isn't the only problem here the current restrictions on the way
> > objects need to be combined as a set and the ability to be able add
> > or subtract from that set.
> > 
> > Then again the notion of active vs. inactive might not be sufficient
> > to allow for the needed flexibility ...
> > 
> > Ian
> > 
> 
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20  3:46         ` James Bottomley
  2019-02-20  4:42           ` Ian Kent
@ 2019-02-20  6:57           ` Paul Moore
  1 sibling, 0 replies; 61+ messages in thread
From: Paul Moore @ 2019-02-20  6:57 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ian Kent, David Howells, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, containers, cgroups

On Tue, Feb 19, 2019 at 10:46 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > >
> > > > > I thought we got agreement years ago that containers don't
> > > > > exist in Linux as a single entity: they're currently a
> > > > > collection of cgroups and namespaces some of which may and some
> > > > > of which may not be local to the entity the orchestration
> > > > > system thinks of as a "container".
> > > >
> > > > I wasn't party to that agreement and don't feel particularly
> > > > bound by it.
> > >
> > > That's not at all relevant, is it?  The point is we have widespread
> > > uses of namespaces and cgroups that span containers today meaning
> > > that a "container id" becomes a problematic concept.  What we
> > > finally got to with the audit people was an unmodifiable label
> > > which the orchestration system can set ... can't you just use that?
> >
> > Sorry James, I fail to see how assigning an id to a collection of
> > objects constitutes a problem or how that could restrict the way a
> > container is used.
>
> Rather than rehash the whole argument again, what's the reason you
> can't use the audit label?  It seems to do what you want in a way that
> doesn't cause problems.  If you can just use it there's little point
> arguing over what is effectively a moot issue.

Ignoring for a moment whether or not the audit container ID is
applicable here, one of the things I've been focused on with the audit
container ID work is trying to make it difficult for other subsystems
to use it.  I've taken this stance not because I don't think something
like a container ID would be useful outside the audit subsystem, but
rather because I'm afraid of how it might be abused by other
subsystems and that abuse might threaten the existence of the audit
container ID.

If there is a willingness to implement a general kernel container ID
that behaves similarly to how the audit container ID is envisioned,
I'd much rather do that then implement something which is audit
specific.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-19 23:42 ` David Howells
@ 2019-02-20  7:00   ` Paul Moore
  2019-02-20 18:54   ` Steve French
  1 sibling, 0 replies; 61+ messages in thread
From: Paul Moore @ 2019-02-20  7:00 UTC (permalink / raw)
  To: David Howells
  Cc: Eric W. Biederman, keyrings, trond.myklebust, sfrench,
	linux-security-module, linux-nfs, linux-cifs, linux-fsdevel, rgb,
	linux-kernel, Linux Containers, linux-api

On Tue, Feb 19, 2019 at 6:42 PM David Howells <dhowells@redhat.com> wrote:
> Eric W. Biederman <ebiederm@xmission.com> wrote:

...

> > Looking at your description you are introducing a container id.
>
> Yes.  For audit logging, which was why I cc'd Richard.

Not to pile on, but it is more important to CC the audit mailing list.
You can obviously still CC Richard, but you should send it to the
entire mailing list.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20  2:46   ` Ian Kent
@ 2019-02-20 13:26     ` Christian Brauner
  2019-02-21 10:39       ` Ian Kent
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2019-02-20 13:26 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, keyrings, trond.myklebust, sfrench,
	James Bottomley, linux-cifs, linux-nfs, containers, linux-kernel,
	linux-security-module, linux-fsdevel, cgroups

On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > Implement a kernel container object such that it contains the following
> > things:
> > 
> >  (1) Namespaces.
> > 
> >  (2) A root directory.
> > 
> >  (3) A set of processes, including one designated as the 'init' process.
> 
> Yeah, I think a name other than init needs to be used for this
> process.
> 
> The problem being that there is no requirement for container
> process 1 to behave in any way like an "init" process is
> expected to behave and that leads to confusion (at least
> it certainly did for me).

If you look at the documentation for pid namespaces(7) you can see that
the pid 1 inside a pid namespace is expected to behave like an init
process:
-  "The  first  process created in a new namespace [...] has  the PID 1,
   and is the "init" process for the namespace (see init(1))."
- "[...] child process that is orphaned within the namespace will be
  reparented to this process rather than init(1) [...]"
- "If the "init" process of a PID namespace terminates, the kernel
  terminates all of the processes in the  namespace  via a SIGKILL
  signal. This behavior reflects the fact that the "init" process is
  essential for the cor‐ rect operation of a PID namespace."
- "Only signals for which the "init" process has established a signal
  handler can be sent to the  "init" process by other members of the
  PID namespace."
- "[...] the reboot(2) system call causes a signal to be sent to the
  namespace "init" process."

This is one of the reasons why all major current container runtimes
finally after years of failing to realize this run a stub init process
that mimicks a dumb init. Sure, you get away with not having an init
that behaves like an init but this is inherently broken or at least
against the way pid namespaces were designed.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-19 16:35 ` Eric W. Biederman
@ 2019-02-20 14:18   ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2019-02-20 14:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Howells, linux-cifs, linux-nfs, linux-api,
	Linux Containers, linux-kernel, sfrench, linux-security-module,
	keyrings, linux-fsdevel, trond.myklebust

On Tue, Feb 19, 2019 at 10:35:20AM -0600, Eric W. Biederman wrote:
> 
> So you missed the main mailing lists for discussion of this kind of
> thing, and the maintainer.  So I have reservations about the quality of
> your due diligence already.
> 
> Looking at your description you are introducing a container id.
> You don't descibe which namespace your contianer id lives in.
> Without the container id living in a container this breaks
> nested containers and process migration aka CRIU.
> 
> So based on the your description.
> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> 
> David Howells <dhowells@redhat.com> writes:
> 
> > Here's a collection of patches that containerises the kernel keys and makes
> > it possible to separate keys by namespace.  This can be extended to any
> > filesystem that uses request_key() to obtain the pertinent authentication
> > token on entry to VFS or socket methods.

/me puts on kernel hat:
I'm not neccessarily opposed to making containers kernel objects even
though I have been for quite a while (for brevity I'll use "kcontainers"
for this). But I think the approach taken here is a little misguided.
This patchsets pushes the argument that kcontainers are needed because
of keyrings and authenticated filesystems and is designed around this
use-case. Imho, that is bound to fall short of requirements and
use-cases that have been piling up over the years.
If we want to make kcontainers a thing we need to have a separate
discussion and a separate patchset that is *solely* concerned with
creating a kcontainer api. And frankly, that is likely going to take a
long time.
At this point containers have become a real "thing" on Linux - like it
or not. So justifying it to making them in-kernel citizens doesn't need
the detour over keyrings or something else. We should just discuss
whether we think that the benefits of kcontainers (e.g. security)
outweight the costs (e.g. maintenance).

/me puts on runtime maintainer hat:
One thing that is true is that userspace containers (let's call them
"ucontainers") as implemented by runtimes today will not go away. We
have been living with this ad-hoc concept and it's various
implementations on upstream Linux at least since 2008. And kernels
without kcontainers will be with us until the end of (Linux)time
probably. So anyone who thinks that kcontainers will replace ucontainers
and that'll be it will be thoroughly disappointed in the end.
It is also very likely that not all use-cases we can currently cover
with ucontainers can be covered by kcontainers. Now that might be ok but
if we ever introduce kcontainers through a proper kernel api we will end
up maintaining ucontainers and kcontainers simultaneously. That's a
burden we shouldn't underestimate.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-19 23:03   ` David Howells
@ 2019-02-20 14:23     ` Trond Myklebust
  0 siblings, 0 replies; 61+ messages in thread
From: Trond Myklebust @ 2019-02-20 14:23 UTC (permalink / raw)
  To: dhowells
  Cc: linux-kernel, keyrings, linux-nfs, linux-cifs,
	linux-security-module, containers, rgb, linux-fsdevel, sfrench

On Tue, 2019-02-19 at 23:03 +0000, David Howells wrote:
> Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> > Do we really need a new system call to set up containers? That
> > would
> > force changes to all existing orchestration software.
> 
> No, it wouldn't.  Nothing in my patches forces existing orchestration
> software
> to change, unless it wants to use the new facilities - then it would
> have to
> be changed anyway, right?  I will grant, though, that the extent of
> the change
> might vary.

Right. It depends on what you want to the orchestrator to do. If you
want it to manage authenticated storage for you, then I grant that you
may need to change the existing orchestrator. However if you just want
the containerised software to be able to manage AFS/CIFS/... keys for
its own processes, then it's not obvious to me why you would need a new
orchestrator.

> > Given that the main thing we want to achieve is to direct messages
> > from
> > the kernel to an appropriate handler, why not focus on adding
> > functionality to do just that?
> 
> Because it's *not* just that that is added here.  There are a number
> of things
> this patchset (and one it depends on) provides:
> 
>  (1) The ability to intercept request_key() upcalls that happen
> inside a
>      container, filtered by operative namespace.

The requirement that you need to filter derives from the fact that the
kernel is being forced to run an untrusted executable in user space.
That may be acceptable when running in an uncontainerised environment,
where the executable can be vetted by the sysadmin, but it clearly
isn't in an environment where containers can be set up by untrusted
users.

If we replace the executable with a daemon that is started from inside
the container (presumably by the init process running there), then
there should be no requirement for the orchestrator to filter.

>  (2) The ability to provide a per-container keyring that can hold
> keys that
>      can be used inside the container without any action on behalf of
> the
>      denizens of the container.

Keyrings already define some inheritance semantics for child processes.
Why can't we tweak those semantics to do what is needed?

IOW: instead of adding a container syscall and a new keyring type, why
can't we just define the required keyring type and let it be inherited
through the existing clone() mechanism?

>  (3) The ability to grant permissions to a *container* as a subject,
> allowing
>      it and its denizens to use, but not necessarily read, modify,
> link or
>      invalidate a key.

Again, this sounds like a child process keyring inheritance issue.
Right now, the session keyring does not appear to match the semantics
that you describe, but why couldn't we set up a new keyring type that
can provide them?

>  (4) The ability to create superblocks inside a container with a
> separate
>      mount namespace from outside, such that they can use the
> container keys,
>      thereby allowing the root of a container to be on an
> authenticated
>      filesystem.
> 

I'm not sure that I understand the premise. If the orchestrator is
setting up and managing that authenticated root filesystem, then why do
the containerised processes need to be involved at all?

If, OTOH, the intention is to allow the containerised processes to
manage the filesystems without knowledge of the keyring contents, then
again isn't that really the same issue as (3)?

> > Is there any reason why a syscall to allow an appropriately
> > privileged
> > process to add a keyring-specific message queue to its own
> > user_namespace and obtain a file descriptor to that message queue
> > might
> > not work?
> 
> Yes.  That forces the use of a new user_namespace for every container
> in which
> you want to use any of the above features.  The user_namespace is
> already way
> too big and intrusive a hammer as it is.

No. I would need a user_namespace if I want to allow child processes to
handle request upcalls. Is that unreasonable?

> > With such an implementation, the fallback mechanism could be to
> > walk
> > back up the hierarchy of user_namespaces until a message queue is
> > found, and to invoke the existing request_key mechanism if not.
> 
> That's definitely wrong.  /sbin/request-key should *not* be spawned
> if the key
> to be instantiated is not in all the init namespaces.
> 
> I went with a container object with namespaces for a reason:
> initially, it was
> so that the upcall could take place inside of the container's
> namespaces, but
> now it's do that any request that doesn't match the namespaces on the
> container gets rejected at the boundary - so that some daemon up the
> chain
> doesn't try servicing a request for which it can't access the config
> data or
> would end up talking out of the wrong NIC.
> 
> I can drop the container object part of it for the moment.
> 
> I could instead create 1-3 new namespaces:
> 
>  (1) A namespace with an upcall-interception point.
> 
>  (2) A namespace with a container keyring.
> 
>  (3) A namespace with a subject ID for use in key ACLs.
> 
> I think I should also consider adding:
> 
>  (4) A namespace with keyring names in it.  I'm leaning towards this
> not being
>      part of user_namespace because these probably should not be
> visible
>      between containers.
> 
> David
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 00/27] Containers and using authenticated filesystems
  2019-02-19 23:42 ` David Howells
  2019-02-20  7:00   ` Paul Moore
@ 2019-02-20 18:54   ` Steve French
  1 sibling, 0 replies; 61+ messages in thread
From: Steve French @ 2019-02-20 18:54 UTC (permalink / raw)
  To: David Howells
  Cc: Eric W. Biederman, keyrings, trond.myklebust, Steve French,
	linux-security-module, linux-nfs, CIFS, linux-fsdevel, rgb, LKML,
	Linux Containers, Linux API, samba-technical

On Tue, Feb 19, 2019 at 5:42 PM David Howells <dhowells@redhat.com> wrote:
>
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> > So you missed the main mailing lists for discussion of this kind of
> > thing
>
> Yeah, sorry about that.  I was primarily aiming it at Trond and Steve as I'd
> like to consider how to go about interpolating request_key() into NFS and CIFS
> so that they can make use of the key-related facilities that this makes
> available with AFS.

I am interested in this discussion because I have gotten various questions
about using Containers better on SMB3 mounts, and the question about
doing request_key better comes up **a lot** on SMB3 mounts (not just
for kerberos, Active Directory), and usability could be improved of some
of the cifs-utils that cifs.ko depends on.

Note that various virtualization/container identify features were added to the
protocol a few years ago (which we don't yet implement in Linux) but which
probably be **very** useful to followup on how these could be exposed
to help containers on network mounts in Linux.    See in particular this
new protocol feature (implemented by various servers including Windows
but not by Linux client yet) described in the protocol spec (MS-SMB2 section
2.2.9.2.1) - the "SMB2_REMOTED_IDENTITY_TREE_CONNECT context"
which can be sent at mount time:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-smb2/ee7ff411-93e0-484f-9f73-31916fee4cb8

This may be of interest to Samba server developers as well

> > and the maintainer.
>
> That would be me.  I maintain keyrings.


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects
  2019-02-20 13:26     ` Christian Brauner
@ 2019-02-21 10:39       ` Ian Kent
  0 siblings, 0 replies; 61+ messages in thread
From: Ian Kent @ 2019-02-21 10:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, keyrings, trond.myklebust, sfrench,
	James Bottomley, linux-cifs, linux-nfs, containers, linux-kernel,
	linux-security-module, linux-fsdevel, cgroups

On Wed, 2019-02-20 at 14:26 +0100, Christian Brauner wrote:
> On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> > On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > > Implement a kernel container object such that it contains the following
> > > things:
> > > 
> > >  (1) Namespaces.
> > > 
> > >  (2) A root directory.
> > > 
> > >  (3) A set of processes, including one designated as the 'init' process.
> > 
> > Yeah, I think a name other than init needs to be used for this
> > process.
> > 
> > The problem being that there is no requirement for container
> > process 1 to behave in any way like an "init" process is
> > expected to behave and that leads to confusion (at least
> > it certainly did for me).
> 
> If you look at the documentation for pid namespaces(7) you can see that
> the pid 1 inside a pid namespace is expected to behave like an init
> process:
> -  "The  first  process created in a new namespace [...] has  the PID 1,
>    and is the "init" process for the namespace (see init(1))."
> - "[...] child process that is orphaned within the namespace will be
>   reparented to this process rather than init(1) [...]"
> - "If the "init" process of a PID namespace terminates, the kernel
>   terminates all of the processes in the  namespace  via a SIGKILL
>   signal. This behavior reflects the fact that the "init" process is
>   essential for the cor‐ rect operation of a PID namespace."
> - "Only signals for which the "init" process has established a signal
>   handler can be sent to the  "init" process by other members of the
>   PID namespace."
> - "[...] the reboot(2) system call causes a signal to be sent to the
>   namespace "init" process."
> 
> This is one of the reasons why all major current container runtimes
> finally after years of failing to realize this run a stub init process
> that mimicks a dumb init. Sure, you get away with not having an init
> that behaves like an init but this is inherently broken or at least
> against the way pid namespaces were designed.

TBH I wasn't sure why the signal I sent didn't arrive, AFAICS
it should have regardless of what signals the container init
process was accepting. But it could have been due to a
different problem in my kernel code (that's very likely).

In any case it wasn't worth perusing because even if I did work
it out I had already found that the request_key sub-system wasn't
playing well with others when trying to run something within a
container's namespaces, so no point in going further ...

Ian


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2019-02-21 10:39 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15 16:07 [RFC PATCH 00/27] Containers and using authenticated filesystems David Howells
2019-02-15 16:07 ` [RFC PATCH 01/27] containers: Rename linux/container.h to linux/container_dev.h David Howells
2019-02-15 16:07 ` [RFC PATCH 02/27] containers: Implement containers as kernel objects David Howells
2019-02-17 18:57   ` Trond Myklebust
2019-02-17 19:39   ` James Bottomley
2019-02-19 16:56   ` Eric W. Biederman
2019-02-19 23:03   ` David Howells
2019-02-20 14:23     ` Trond Myklebust
2019-02-19 23:06   ` David Howells
2019-02-20  2:20     ` James Bottomley
2019-02-20  3:04       ` Ian Kent
2019-02-20  3:46         ` James Bottomley
2019-02-20  4:42           ` Ian Kent
2019-02-20  6:57           ` Paul Moore
2019-02-19 23:13   ` David Howells
2019-02-19 23:55   ` Tycho Andersen
2019-02-20  2:46   ` Ian Kent
2019-02-20 13:26     ` Christian Brauner
2019-02-21 10:39       ` Ian Kent
2019-02-15 16:07 ` [RFC PATCH 03/27] containers: Provide /proc/containers David Howells
2019-02-15 16:07 ` [RFC PATCH 04/27] containers: Allow a process to be forked into a container David Howells
2019-02-15 17:39   ` Stephen Smalley
2019-02-19 16:39   ` Eric W. Biederman
2019-02-19 23:16   ` David Howells
2019-02-15 16:07 ` [RFC PATCH 05/27] containers: Open a socket inside " David Howells
2019-02-19 16:41   ` Eric W. Biederman
2019-02-15 16:08 ` [RFC PATCH 06/27] containers, vfs: Allow syscall dirfd arguments to take a container fd David Howells
2019-02-19 16:45   ` Eric W. Biederman
2019-02-19 23:24   ` David Howells
2019-02-15 16:08 ` [RFC PATCH 07/27] containers: Make fsopen() able to create a superblock in a container David Howells
2019-02-15 16:08 ` [RFC PATCH 08/27] containers, vfs: Honour CONTAINER_NEW_EMPTY_FS_NS David Howells
2019-02-17  0:11   ` Al Viro
2019-02-15 16:08 ` [RFC PATCH 09/27] vfs: Allow mounting to other namespaces David Howells
2019-02-17  0:14   ` Al Viro
2019-02-15 16:08 ` [RFC PATCH 10/27] containers: Provide fs_context op for container setting David Howells
2019-02-15 16:09 ` [RFC PATCH 11/27] containers: Sample program for driving container objects David Howells
2019-02-15 16:09 ` [RFC PATCH 12/27] containers: Allow a daemon to intercept request_key upcalls in a container David Howells
2019-02-15 16:09 ` [RFC PATCH 13/27] keys: Provide a keyctl to query a request_key authentication key David Howells
2019-02-15 16:09 ` [RFC PATCH 14/27] keys: Break bits out of key_unlink() David Howells
2019-02-15 16:09 ` [RFC PATCH 15/27] keys: Make __key_link_begin() handle lockdep nesting David Howells
2019-02-15 16:09 ` [RFC PATCH 16/27] keys: Grant Link permission to possessers of request_key auth keys David Howells
2019-02-15 16:10 ` [RFC PATCH 17/27] keys: Add a keyctl to move a key between keyrings David Howells
2019-02-15 16:10 ` [RFC PATCH 18/27] keys: Find the least-recently used unseen key in a keyring David Howells
2019-02-15 16:10 ` [RFC PATCH 19/27] containers: Sample: request_key upcall handling David Howells
2019-02-15 16:10 ` [RFC PATCH 20/27] container, keys: Add a container keyring David Howells
2019-02-15 21:46   ` Eric Biggers
2019-02-15 16:11 ` [RFC PATCH 21/27] keys: Fix request_key() lack of Link perm check on found key David Howells
2019-02-15 16:11 ` [RFC PATCH 22/27] KEYS: Replace uid/gid/perm permissions checking with an ACL David Howells
2019-02-15 17:32   ` Stephen Smalley
2019-02-15 17:39   ` David Howells
2019-02-15 16:11 ` [RFC PATCH 23/27] KEYS: Provide KEYCTL_GRANT_PERMISSION David Howells
2019-02-15 16:11 ` [RFC PATCH 24/27] keys: Allow a container to be specified as a subject in a key's ACL David Howells
2019-02-15 16:11 ` [RFC PATCH 25/27] keys: Provide a way to ask for the container keyring David Howells
2019-02-15 16:12 ` [RFC PATCH 26/27] keys: Allow containers to be included in key ACLs by name David Howells
2019-02-15 16:12 ` [RFC PATCH 27/27] containers: Sample to grant access to a key in a container David Howells
2019-02-15 22:36 ` [RFC PATCH 00/27] Containers and using authenticated filesystems James Morris
2019-02-19 16:35 ` Eric W. Biederman
2019-02-20 14:18   ` Christian Brauner
2019-02-19 23:42 ` David Howells
2019-02-20  7:00   ` Paul Moore
2019-02-20 18:54   ` Steve French

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).