Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* User-visible context-mount API
@ 2018-01-15 16:07 David Howells
  2018-01-15 17:31 ` Eric W. Biederman
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: David Howells @ 2018-01-15 16:07 UTC (permalink / raw)
  To: mszeredi, viro, jlayton, ebiederm; +Cc: dhowells, linux-fsdevel

I've been looking at the context-mount API visible to userspace as I need to
adjust the security ops to handle it.  I'm thinking I probably need something
like the following system calls.  Note that:

 topology_flags are MS_PRIVATE, MS_SLAVE, MS_SHARED, MS_UNBINDABLE.

 mount_flags are things like MS_NOSUID, MS_NODEV, MS_NOEXEC that get
 translated to MNT_* flags in the kernel.

 (1) Open a filesystem and create a blank context from it:

	fd = fsopen(const char *fs_name, unsigned int flags, ...);

     where flags includes FSOPEN_CLOEXEC, FSOPEN_CREATE_ONLY (don't reuse
     superblock).

 (2) Access and change the context:

	write(fd, "<command>", ...);
	read(fd, ...);
	ioctl(fd, ...);

 (3) Create and set up a context for an existing mountpoint:

	fd = fspick(int dfd, const char *path, unsigned int flags);

     where flags includes FSPICK_CLOEXEC.

 (4) Create a mountpoint on a path, using a context to supply the superblock
     details:

	mount_create(int fd, int dfd, const char *path,
		     unsigned int topology_flags,
		     unsigned int mount_flags);

 (5) Move a mount:

	mount_move(int from_dfd, const char *frompath,
		   int to_dfd, const char *topath);

     This might want to take new topology flags algo.

 (6) Adjust a mountpoint's topology flags:

	mount_set_topology(int dfd, const char *path,
			   unsigned int topology_flags);

 (7) Reconfigure a mountpoint:

	mount_reconfigure(int dfd, const char *path,
			  unsigned int mount_flags);

 (8) Change R/O protection on a mountpoint:

	mount_protect(int dfd, const char *path,
		      bool read_only);

     This involves changing the R/O protection on the superblock also, but
     might be mergeable with mount_reconfigure().

Note that two things are missing from the list:

 (1) Bind mount.  This is done by:

	fd = fspick("/mnt/a");
	mount_create(fd, ..., "/mnt/b", ...);
	mount_create(fd, ..., "/mnt/c", ...);
	mount_create(fd, ..., "/mnt/d", ...);

 (2) Remount.  Superblock reconfiguration is done by something like:

	fd = fspick("/mnt/a");
	write(fd, "? fs");
	read(fd, filesystem_type);
	write(fd, "o user_xattr"); // Indicate changes to be made
	write(fd, "x reconfigure"); // Perform the reconfiguration

Thinking further on this, maybe I should make a mountpoint-context also, so
that it can be loaded up with target namespace information and other goodies.
This would vastly expand the parameter space for a mountpoint beyond the few
syscall args available.  Creating a new mount might then look like:

	sbfd = fsopen("ext4");
	write(sbfd, "d /dev/sda1");
	write(sbfd, "o user_xattr");
	write(sbfd, "x commit");

	mfd = mount_new();
	write(mfd, "ns mnt 123"); // where fd 123 refers to a mount namesapce
	write(mfd, "o bind=1"); // Set MS_BIND
	write(mfd, "o nosuid=1"); // Set MS_NOSUID

	mount_create(mfd, AT_FDCWD, "/mnt/a", sbfd);

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
@ 2018-01-15 17:31 ` Eric W. Biederman
  2018-01-15 17:32 ` Eric W. Biederman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-01-15 17:31 UTC (permalink / raw)
  To: David Howells; +Cc: mszeredi, viro, jlayton, ebiederm, linux-fsdevel



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
  2018-01-15 17:31 ` Eric W. Biederman
@ 2018-01-15 17:32 ` Eric W. Biederman
  2018-01-16  9:01 ` Miklos Szeredi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-01-15 17:32 UTC (permalink / raw)
  To: David Howells; +Cc: mszeredi, viro, jlayton, ebiederm, linux-fsdevel

David Howells <dhowells@redhat.com> writes:

> I've been looking at the context-mount API visible to userspace as I need to
> adjust the security ops to handle it.  I'm thinking I probably need something
> like the following system calls.  Note that:

Is there supposed to be a question in this?
Are you asking for review of your basic design?

Eric

>  topology_flags are MS_PRIVATE, MS_SLAVE, MS_SHARED, MS_UNBINDABLE.
>
>  mount_flags are things like MS_NOSUID, MS_NODEV, MS_NOEXEC that get
>  translated to MNT_* flags in the kernel.
>
>  (1) Open a filesystem and create a blank context from it:
>
> 	fd = fsopen(const char *fs_name, unsigned int flags, ...);
>
>      where flags includes FSOPEN_CLOEXEC, FSOPEN_CREATE_ONLY (don't reuse
>      superblock).
>
>  (2) Access and change the context:
>
> 	write(fd, "<command>", ...);
> 	read(fd, ...);
> 	ioctl(fd, ...);
>
>  (3) Create and set up a context for an existing mountpoint:
>
> 	fd = fspick(int dfd, const char *path, unsigned int flags);
>
>      where flags includes FSPICK_CLOEXEC.
>
>  (4) Create a mountpoint on a path, using a context to supply the superblock
>      details:
>
> 	mount_create(int fd, int dfd, const char *path,
> 		     unsigned int topology_flags,
> 		     unsigned int mount_flags);
>
>  (5) Move a mount:
>
> 	mount_move(int from_dfd, const char *frompath,
> 		   int to_dfd, const char *topath);
>
>      This might want to take new topology flags algo.
>
>  (6) Adjust a mountpoint's topology flags:
>
> 	mount_set_topology(int dfd, const char *path,
> 			   unsigned int topology_flags);
>
>  (7) Reconfigure a mountpoint:
>
> 	mount_reconfigure(int dfd, const char *path,
> 			  unsigned int mount_flags);
>
>  (8) Change R/O protection on a mountpoint:
>
> 	mount_protect(int dfd, const char *path,
> 		      bool read_only);
>
>      This involves changing the R/O protection on the superblock also, but
>      might be mergeable with mount_reconfigure().
>
> Note that two things are missing from the list:
>
>  (1) Bind mount.  This is done by:
>
> 	fd = fspick("/mnt/a");
> 	mount_create(fd, ..., "/mnt/b", ...);
> 	mount_create(fd, ..., "/mnt/c", ...);
> 	mount_create(fd, ..., "/mnt/d", ...);
>
>  (2) Remount.  Superblock reconfiguration is done by something like:
>
> 	fd = fspick("/mnt/a");
> 	write(fd, "? fs");
> 	read(fd, filesystem_type);
> 	write(fd, "o user_xattr"); // Indicate changes to be made
> 	write(fd, "x reconfigure"); // Perform the reconfiguration
>
> Thinking further on this, maybe I should make a mountpoint-context also, so
> that it can be loaded up with target namespace information and other goodies.
> This would vastly expand the parameter space for a mountpoint beyond the few
> syscall args available.  Creating a new mount might then look like:
>
> 	sbfd = fsopen("ext4");
> 	write(sbfd, "d /dev/sda1");
> 	write(sbfd, "o user_xattr");
> 	write(sbfd, "x commit");
>
> 	mfd = mount_new();
> 	write(mfd, "ns mnt 123"); // where fd 123 refers to a mount namesapce
> 	write(mfd, "o bind=1"); // Set MS_BIND
> 	write(mfd, "o nosuid=1"); // Set MS_NOSUID
>
> 	mount_create(mfd, AT_FDCWD, "/mnt/a", sbfd);
>
> David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
  2018-01-15 17:31 ` Eric W. Biederman
  2018-01-15 17:32 ` Eric W. Biederman
@ 2018-01-16  9:01 ` Miklos Szeredi
  2018-01-16 10:10 ` David Howells
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Miklos Szeredi @ 2018-01-16  9:01 UTC (permalink / raw)
  To: David Howells
  Cc: viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

[Adding linux-api@vger]

On Mon, Jan 15, 2018 at 5:07 PM, David Howells <dhowells@redhat.com> wrote:
> I've been looking at the context-mount API visible to userspace as I need to
> adjust the security ops to handle it.  I'm thinking I probably need something
> like the following system calls.  Note that:
>
>  topology_flags are MS_PRIVATE, MS_SLAVE, MS_SHARED, MS_UNBINDABLE.
>
>  mount_flags are things like MS_NOSUID, MS_NODEV, MS_NOEXEC that get
>  translated to MNT_* flags in the kernel.
>
>  (1) Open a filesystem and create a blank context from it:
>
>         fd = fsopen(const char *fs_name, unsigned int flags, ...);
>
>      where flags includes FSOPEN_CLOEXEC, FSOPEN_CREATE_ONLY (don't reuse
>      superblock).
>
>  (2) Access and change the context:
>
>         write(fd, "<command>", ...);
>         read(fd, ...);
>         ioctl(fd, ...);
>
>  (3) Create and set up a context for an existing mountpoint:
>
>         fd = fspick(int dfd, const char *path, unsigned int flags);
>
>      where flags includes FSPICK_CLOEXEC.
>
>  (4) Create a mountpoint on a path, using a context to supply the superblock
>      details:
>
>         mount_create(int fd, int dfd, const char *path,
>                      unsigned int topology_flags,
>                      unsigned int mount_flags);
>
>  (5) Move a mount:
>
>         mount_move(int from_dfd, const char *frompath,
>                    int to_dfd, const char *topath);
>
>      This might want to take new topology flags algo.
>
>  (6) Adjust a mountpoint's topology flags:
>
>         mount_set_topology(int dfd, const char *path,
>                            unsigned int topology_flags);
>
>  (7) Reconfigure a mountpoint:
>
>         mount_reconfigure(int dfd, const char *path,
>                           unsigned int mount_flags);


What's the fundamental  difference between topology flags and other
flags?  Why two syscalls?

Also I think we need a "mask" argument telling the kernel which flags
need to be changed.

>
>  (8) Change R/O protection on a mountpoint:
>
>         mount_protect(int dfd, const char *path,
>                       bool read_only);
>
>      This involves changing the R/O protection on the superblock also, but
>      might be mergeable with mount_reconfigure().

Methinks this should be merged with mount_reconfigure(), and if
superblock state needs to be changed, than that should be done with
the "remount" procedure below.


> Note that two things are missing from the list:
>
>  (1) Bind mount.  This is done by:
>
>         fd = fspick("/mnt/a");
>         mount_create(fd, ..., "/mnt/b", ...);
>         mount_create(fd, ..., "/mnt/c", ...);
>         mount_create(fd, ..., "/mnt/d", ...);
>
>  (2) Remount.  Superblock reconfiguration is done by something like:
>
>         fd = fspick("/mnt/a");
>         write(fd, "? fs");
>         read(fd, filesystem_type);
>         write(fd, "o user_xattr"); // Indicate changes to be made
>         write(fd, "x reconfigure"); // Perform the reconfiguration
>
> Thinking further on this, maybe I should make a mountpoint-context also, so
> that it can be loaded up with target namespace information and other goodies.
> This would vastly expand the parameter space for a mountpoint beyond the few
> syscall args available.  Creating a new mount might then look like:
>
>         sbfd = fsopen("ext4");
>         write(sbfd, "d /dev/sda1");
>         write(sbfd, "o user_xattr");
>         write(sbfd, "x commit");
>
>         mfd = mount_new();
>         write(mfd, "ns mnt 123"); // where fd 123 refers to a mount namesapce
>         write(mfd, "o bind=1"); // Set MS_BIND

What does MS_BIND mean here?

>         write(mfd, "o nosuid=1"); // Set MS_NOSUID
>
>         mount_create(mfd, AT_FDCWD, "/mnt/a", sbfd);

Yeah, more flexible, but also more complicated, with mount_create()
now taking 3 file descriptors, ugh...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
                   ` (2 preceding siblings ...)
  2018-01-16  9:01 ` Miklos Szeredi
@ 2018-01-16 10:10 ` David Howells
  2018-01-16 10:35   ` Miklos Szeredi
                     ` (2 more replies)
  2018-01-16 14:55 ` David Howells
  2018-01-16 15:40 ` David Howells
  5 siblings, 3 replies; 17+ messages in thread
From: David Howells @ 2018-01-16 10:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

Miklos Szeredi <mszeredi@redhat.com> wrote:

> >  (6) Adjust a mountpoint's topology flags:
> >
> >         mount_set_topology(int dfd, const char *path,
> >                            unsigned int topology_flags);
> >
> >  (7) Reconfigure a mountpoint:
> >
> >         mount_reconfigure(int dfd, const char *path,
> >                           unsigned int mount_flags);
> 
> 
> What's the fundamental  difference between topology flags and other
> flags?  Why two syscalls?

Inside the kernel the MS_* flags appear to belong to a number of fundamentally
different classes:

 (1) Things like MS_SILENT and MS_REMOUNT which affect the behaviour of the
     mount process, but aren't persistent beyond that.

 (2) Inter-namespace topology management, controlling how mounts are shared
     and duplicated between namespaces.

 (3) Restrictions on accesses through a particular mountpoint, eg. MS_NODEV,
     MS_NOEXEC.

 (4) Instructions to a filesystem on how a superblock is to behave.

I think the classes are fundamentally different - and we've already separated
(4) from the others inside the kernel.  However, I've no great objection to
keeping (2) and (3) together in the same mask.  It just sounds cleaner to
separate them.  Do we foresee adding any extra flags to these classes?

> Also I think we need a "mask" argument telling the kernel which flags
> need to be changed.

Sounds reasonable.

> >  (8) Change R/O protection on a mountpoint:
> >
> >         mount_protect(int dfd, const char *path,
> >                       bool read_only);
> >
> >      This involves changing the R/O protection on the superblock also, but
> >      might be mergeable with mount_reconfigure().
> 
> Methinks this should be merged with mount_reconfigure(), and if
> superblock state needs to be changed, than that should be done with
> the "remount" procedure below.

Maybe - the problem is that it's harder to manage if you've got multiple
mounts attached to a single superblock as you can only change the superblock
state if all the mounts are R/O.

> >         write(mfd, "o bind=1"); // Set MS_BIND
> 
> What does MS_BIND mean here?

Sorry, bad example; MS_BIND wouldn't be allowed there.  Consider the following
instead:

     write(mfd, "o nodev=1"); // Set 'MS_NODEV'

> >         write(mfd, "o nosuid=1"); // Set MS_NOSUID
> >
> >         mount_create(mfd, AT_FDCWD, "/mnt/a", sbfd);
> 
> Yeah, more flexible, but also more complicated, with mount_create()
> now taking 3 file descriptors, ugh...

Yeah, I know:-/ ... but there are more parameters that I can foresee adding
(such as [ug]id mapping tables), and a syscall just doesn't have enough
argument space.  Also, I think that we need to set all the parameters on a
mountpoint at the time of creation and that doing this retroactively isn't a
good idea, since it's live as soon as it's created.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-16 10:10 ` David Howells
@ 2018-01-16 10:35   ` Miklos Szeredi
  2018-01-16 14:18   ` David Howells
  2018-01-17 10:43   ` Karel Zak
  2 siblings, 0 replies; 17+ messages in thread
From: Miklos Szeredi @ 2018-01-16 10:35 UTC (permalink / raw)
  To: David Howells
  Cc: viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

On Tue, Jan 16, 2018 at 11:10 AM, David Howells <dhowells@redhat.com> wrote:

>> >  (8) Change R/O protection on a mountpoint:
>> >
>> >         mount_protect(int dfd, const char *path,
>> >                       bool read_only);
>> >
>> >      This involves changing the R/O protection on the superblock also, but
>> >      might be mergeable with mount_reconfigure().
>>
>> Methinks this should be merged with mount_reconfigure(), and if
>> superblock state needs to be changed, than that should be done with
>> the "remount" procedure below.
>
> Maybe - the problem is that it's harder to manage if you've got multiple
> mounts attached to a single superblock as you can only change the superblock
> state if all the mounts are R/O.

Not true:

 [root@veci ~]# cat /proc/self/mountinfo | grep 46
212 78 0:46 / /tmp/1 rw,relatime shared:153 - tmpfs tmpfs rw,seclabel
219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs rw,seclabel
[root@veci ~]# mount -oremount,ro /tmp/1
[root@veci ~]# cat /proc/self/mountinfo | grep 46
212 78 0:46 / /tmp/1 ro,relatime shared:153 - tmpfs tmpfs ro,seclabel
219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
[root@veci ~]# mount -oremount,bind,rw /tmp/1
[root@veci ~]# cat /proc/self/mountinfo | grep 46
212 78 0:46 / /tmp/1 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
[root@veci ~]#

R/O flag on mount and on superblock can be changed independently.

>
>> >         write(mfd, "o bind=1"); // Set MS_BIND
>>
>> What does MS_BIND mean here?
>
> Sorry, bad example; MS_BIND wouldn't be allowed there.  Consider the following
> instead:
>
>      write(mfd, "o nodev=1"); // Set 'MS_NODEV'
>
>> >         write(mfd, "o nosuid=1"); // Set MS_NOSUID
>> >
>> >         mount_create(mfd, AT_FDCWD, "/mnt/a", sbfd);
>>
>> Yeah, more flexible, but also more complicated, with mount_create()
>> now taking 3 file descriptors, ugh...
>
> Yeah, I know:-/ ... but there are more parameters that I can foresee adding
> (such as [ug]id mapping tables), and a syscall just doesn't have enough
> argument space.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-16 10:10 ` David Howells
  2018-01-16 10:35   ` Miklos Szeredi
@ 2018-01-16 14:18   ` David Howells
  2018-01-17 10:43   ` Karel Zak
  2 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2018-01-16 14:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

Miklos Szeredi <mszeredi@redhat.com> wrote:

> > Maybe - the problem is that it's harder to manage if you've got multiple
> > mounts attached to a single superblock as you can only change the superblock
> > state if all the mounts are R/O.
> 
> Not true:
> 
>  [root@veci ~]# cat /proc/self/mountinfo | grep 46
> 212 78 0:46 / /tmp/1 rw,relatime shared:153 - tmpfs tmpfs rw,seclabel
> 219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs rw,seclabel
> [root@veci ~]# mount -oremount,ro /tmp/1
> [root@veci ~]# cat /proc/self/mountinfo | grep 46
> 212 78 0:46 / /tmp/1 ro,relatime shared:153 - tmpfs tmpfs ro,seclabel
> 219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
> [root@veci ~]# mount -oremount,bind,rw /tmp/1
> [root@veci ~]# cat /proc/self/mountinfo | grep 46
> 212 78 0:46 / /tmp/1 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
> 219 78 0:46 / /tmp/2 rw,relatime shared:153 - tmpfs tmpfs ro,seclabel
> [root@veci ~]#
> 
> R/O flag on mount and on superblock can be changed independently.

Ah...  I misread the code.  I was looking at the check that there are no
writers on a mount.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
                   ` (3 preceding siblings ...)
  2018-01-16 10:10 ` David Howells
@ 2018-01-16 14:55 ` David Howells
  2018-01-16 15:40 ` David Howells
  5 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2018-01-16 14:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: dhowells, mszeredi, viro, jlayton, ebiederm, linux-fsdevel

Eric W. Biederman <ebiederm@xmission.com> wrote:

> > I've been looking at the context-mount API visible to userspace as I need
> > to adjust the security ops to handle it.  I'm thinking I probably need
> > something like the following system calls.  Note that:
> 
> Is there supposed to be a question in this?
> Are you asking for review of your basic design?

Yeah - mainly from Miklós and Al, but anyone who cares to weigh in...

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-15 16:07 User-visible context-mount API David Howells
                   ` (4 preceding siblings ...)
  2018-01-16 14:55 ` David Howells
@ 2018-01-16 15:40 ` David Howells
  2018-01-16 16:41   ` Miklos Szeredi
  5 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2018-01-16 15:40 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

Miklos Szeredi <mszeredi@redhat.com> wrote:

> >  (6) Adjust a mountpoint's topology flags:
> >
> >         mount_set_topology(int dfd, const char *path,
> >                            unsigned int topology_flags);
> >
> >  (7) Reconfigure a mountpoint:
> >
> >         mount_reconfigure(int dfd, const char *path,
> >                           unsigned int mount_flags);
> 
> 
> What's the fundamental  difference between topology flags and other
> flags?  Why two syscalls?

Actually, if you look at the do_mount() function, these *are* separate
things.  If you look at the code:

	if (flags & MS_REMOUNT)
		retval = do_remount(&path, flags, sb_flags, mnt_flags,
				    data_page);

^^^ that is mount_reconfigure() and superblock reconfiguration.

	else if (flags & MS_BIND)
		retval = do_loopback(&path, dev_name, flags & MS_REC);
	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
		retval = do_change_type(&path, flags);

^^^ that is mount_set_topology().

	else if (flags & MS_MOVE)
		retval = do_move_mount(&path, dev_name);
	else
		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
				      dev_name, data_page);

The mount() syscall is actually a function switch with five functions.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-16 15:40 ` David Howells
@ 2018-01-16 16:41   ` Miklos Szeredi
  2018-01-17  4:17     ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Miklos Szeredi @ 2018-01-16 16:41 UTC (permalink / raw)
  To: David Howells
  Cc: viro, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

On Tue, Jan 16, 2018 at 4:40 PM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
>> >  (6) Adjust a mountpoint's topology flags:
>> >
>> >         mount_set_topology(int dfd, const char *path,
>> >                            unsigned int topology_flags);
>> >
>> >  (7) Reconfigure a mountpoint:
>> >
>> >         mount_reconfigure(int dfd, const char *path,
>> >                           unsigned int mount_flags);
>>
>>
>> What's the fundamental  difference between topology flags and other
>> flags?  Why two syscalls?
>
> Actually, if you look at the do_mount() function, these *are* separate
> things.  If you look at the code:
>
>         if (flags & MS_REMOUNT)
>                 retval = do_remount(&path, flags, sb_flags, mnt_flags,
>                                     data_page);
>
> ^^^ that is mount_reconfigure() and superblock reconfiguration.
>
>         else if (flags & MS_BIND)
>                 retval = do_loopback(&path, dev_name, flags & MS_REC);
>         else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
>                 retval = do_change_type(&path, flags);
>
> ^^^ that is mount_set_topology().
>
>         else if (flags & MS_MOVE)
>                 retval = do_move_mount(&path, dev_name);
>         else
>                 retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
>                                       dev_name, data_page);
>
> The mount() syscall is actually a function switch with five functions.

Right.

Still, those two (propagation and flags) are properties of the mount.
No fundamental difference in how to handle them, that I see.  Okay, we
have MS_REC handling in the propagation and not in the flags, but
that's something that might make sense for flags as well.

What's more interesting is how MS_PRIVATE + MS_REC semantics are
complete failure in the real world: the logical thing would be to mark
a mount private on the supplied mount AND propagate an umount event to
everywhere else.  But that's not what we do, we completely mess up
propagation that can result in millions of "leaked" mounts.  I've seen
multiple bug reports where this illogical behavior fooled people.  It
can be worked around, of course, but maybe we should fix this on the
new interface and allow a sane private setting.

Thanks,
Miklos


>
> David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-16 16:41   ` Miklos Szeredi
@ 2018-01-17  4:17     ` Al Viro
  2018-01-17  9:53       ` Miklos Szeredi
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2018-01-17  4:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Jeff Layton, Eric W. Biederman, linux-fsdevel, Linux API

On Tue, Jan 16, 2018 at 05:41:46PM +0100, Miklos Szeredi wrote:

> Right.
> 
> Still, those two (propagation and flags) are properties of the mount.
> No fundamental difference in how to handle them, that I see.  Okay, we
> have MS_REC handling in the propagation and not in the flags, but
> that's something that might make sense for flags as well.
> 
> What's more interesting is how MS_PRIVATE + MS_REC semantics are
> complete failure in the real world: the logical thing would be to mark
> a mount private on the supplied mount AND propagate an umount event to
> everywhere else.

This is utter nonsense.  Most of the time it's "Fedora, in its infinite
bogo^Wwisdom has made everything shared; I don't fucking need that
idiocy, so please unshare this, this and that".  You really don't want
(or have permissions for) unmounting e.g. /mnt in namespace of init
when you do that.

Sure, we get tons of bug reports.  Due to idiotic Fedora setup, with
everything shared.  The same setup that would go up in flames on the
semantics change you propose.

If anything, "private bind on itself" would be a useful operation.
Turning given location into a mountpoint, and having everything
under it looking as it used to, but with no propagation at all.
Without bothering anybody else, even if location currently happens
to be on a shared/master mount.

I can slap that together for mount(2), but I'm not sure what a sane
combination of flags for that would look like ;-)  For fsmount
I think it would be very useful thing to have.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-17  4:17     ` Al Viro
@ 2018-01-17  9:53       ` Miklos Szeredi
  2018-01-17 11:06         ` Karel Zak
  2018-01-19  6:32         ` Al Viro
  0 siblings, 2 replies; 17+ messages in thread
From: Miklos Szeredi @ 2018-01-17  9:53 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Jeff Layton, Eric W. Biederman, linux-fsdevel,
	Linux API, util-linux, Michael Kerrisk (man-pages)

[Adding util-linux@vger and Michael Kerrisk]

On Wed, Jan 17, 2018 at 5:17 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jan 16, 2018 at 05:41:46PM +0100, Miklos Szeredi wrote:
>
>> Right.
>>
>> Still, those two (propagation and flags) are properties of the mount.
>> No fundamental difference in how to handle them, that I see.  Okay, we
>> have MS_REC handling in the propagation and not in the flags, but
>> that's something that might make sense for flags as well.
>>
>> What's more interesting is how MS_PRIVATE + MS_REC semantics are
>> complete failure in the real world: the logical thing would be to mark
>> a mount private on the supplied mount AND propagate an umount event to
>> everywhere else.
>
> This is utter nonsense.  Most of the time it's "Fedora, in its infinite
> bogo^Wwisdom has made everything shared; I don't fucking need that
> idiocy, so please unshare this, this and that".  You really don't want
> (or have permissions for) unmounting e.g. /mnt in namespace of init
> when you do that.
>
> Sure, we get tons of bug reports.  Due to idiotic Fedora setup, with
> everything shared.  The same setup that would go up in flames on the
> semantics change you propose.

I wouldn't propose to change existing --make-private, as this would
not be backward compatible. The new semantics would mean a new op,
obviously.

Documenting  --make-private thing properly would also help.  To me the
wording "make private" strongly implies "I want to make submounts
private to this instance".  See for example rhbz#1432211.

> If anything, "private bind on itself" would be a useful operation.
> Turning given location into a mountpoint, and having everything
> under it looking as it used to, but with no propagation at all.
> Without bothering anybody else, even if location currently happens
> to be on a shared/master mount.
>
> I can slap that together for mount(2), but I'm not sure what a sane
> combination of flags for that would look like ;-)  For fsmount
> I think it would be very useful thing to have.

Yes, I think such an operation would be pretty useful.   Not sure if
it's the whole story, though.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-16 10:10 ` David Howells
  2018-01-16 10:35   ` Miklos Szeredi
  2018-01-16 14:18   ` David Howells
@ 2018-01-17 10:43   ` Karel Zak
  2 siblings, 0 replies; 17+ messages in thread
From: Karel Zak @ 2018-01-17 10:43 UTC (permalink / raw)
  To: David Howells
  Cc: Miklos Szeredi, viro, Jeff Layton, Eric W. Biederman,
	linux-fsdevel, Linux API

On Tue, Jan 16, 2018 at 10:10:12AM +0000, David Howells wrote:
> Inside the kernel the MS_* flags appear to belong to a number of fundamentally
> different classes:

Good point, but I'm not sure about your terminology -- for example
"topology" sounds strange if we use "propagation" for years.

>  (1) Things like MS_SILENT and MS_REMOUNT which affect the behaviour of the
>      mount process, but aren't persistent beyond that.

 mount-operation flags  (now including MS_BIND too)

>  (2) Inter-namespace topology management, controlling how mounts are shared
>      and duplicated between namespaces.

 propagation flags

>  (3) Restrictions on accesses through a particular mountpoint, eg. MS_NODEV,
>      MS_NOEXEC.

 VFS flags (now including MS_BIND|MS_REMOUNT|MS_RDONLY too)

>  (4) Instructions to a filesystem on how a superblock is to behave.

 FS flags

> I think the classes are fundamentally different - and we've already separated
> (4) from the others inside the kernel.  However, I've no great objection to
> keeping (2) and (3) together in the same mask.  It just sounds cleaner to
> separate them. 

 I agree.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-17  9:53       ` Miklos Szeredi
@ 2018-01-17 11:06         ` Karel Zak
  2018-01-18  9:48           ` Miklos Szeredi
  2018-01-19  2:27           ` Al Viro
  2018-01-19  6:32         ` Al Viro
  1 sibling, 2 replies; 17+ messages in thread
From: Karel Zak @ 2018-01-17 11:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Al Viro, David Howells, Jeff Layton, Eric W. Biederman,
	linux-fsdevel, Linux API, util-linux, Michael Kerrisk (man-pages)

On Wed, Jan 17, 2018 at 10:53:36AM +0100, Miklos Szeredi wrote:
> [Adding util-linux@vger and Michael Kerrisk]
> 
> On Wed, Jan 17, 2018 at 5:17 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Jan 16, 2018 at 05:41:46PM +0100, Miklos Szeredi wrote:
> >
> >> Right.
> >>
> >> Still, those two (propagation and flags) are properties of the mount.
> >> No fundamental difference in how to handle them, that I see.  Okay, we
> >> have MS_REC handling in the propagation and not in the flags, but
> >> that's something that might make sense for flags as well.
> >>
> >> What's more interesting is how MS_PRIVATE + MS_REC semantics are
> >> complete failure in the real world: the logical thing would be to mark
> >> a mount private on the supplied mount AND propagate an umount event to
> >> everywhere else.
> >
> > This is utter nonsense.  Most of the time it's "Fedora, in its infinite
> > bogo^Wwisdom has made everything shared; I don't fucking need that
> > idiocy, so please unshare this, this and that".  You really don't want
> > (or have permissions for) unmounting e.g. /mnt in namespace of init
> > when you do that.
> >
> > Sure, we get tons of bug reports.  Due to idiotic Fedora setup, with
> > everything shared.  The same setup that would go up in flames on the
> > semantics change you propose.

I guess "all shared" is systemd requirement, so I guess it's not
Fedora specific, right?

> I wouldn't propose to change existing --make-private, as this would
> not be backward compatible. The new semantics would mean a new op,
> obviously.

Definitely.

> Documenting  --make-private thing properly would also help.  To me the
> wording "make private" strongly implies "I want to make submounts
> private to this instance".  See for example rhbz#1432211.

All propagation stuff is poorly documented in mount.8. It would be
nice to add section about it to the man page. Volunteer? (My skills to
explain this topic to end-users is pretty limited...)
 
> > If anything, "private bind on itself" would be a useful operation.
> > Turning given location into a mountpoint, and having everything
> > under it looking as it used to, but with no propagation at all.
> > Without bothering anybody else, even if location currently happens
> > to be on a shared/master mount.

Good idea.

> > I can slap that together for mount(2), but I'm not sure what a sane
> > combination of flags for that would look like ;-)

What about new flag (for the API) rather than try to be smart with the
current flags? But I have doubts that invest time to new mount(2)
features is a good idea.

> For fsmount I think it would be very useful thing to have.

Yes.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-17 11:06         ` Karel Zak
@ 2018-01-18  9:48           ` Miklos Szeredi
  2018-01-19  2:27           ` Al Viro
  1 sibling, 0 replies; 17+ messages in thread
From: Miklos Szeredi @ 2018-01-18  9:48 UTC (permalink / raw)
  To: Karel Zak
  Cc: Al Viro, David Howells, Jeff Layton, Eric W. Biederman,
	linux-fsdevel, Linux API, util-linux, Michael Kerrisk (man-pages)

On Wed, Jan 17, 2018 at 12:06 PM, Karel Zak <kzak@redhat.com> wrote:

>> Documenting  --make-private thing properly would also help.  To me the
>> wording "make private" strongly implies "I want to make submounts
>> private to this instance".  See for example rhbz#1432211.
>
> All propagation stuff is poorly documented in mount.8. It would be
> nice to add section about it to the man page. Volunteer? (My skills to
> explain this topic to end-users is pretty limited...)

Propagation is common to mount(2) and there's some info in there
already, but fine points like this are not explained.

Maybe a new page in section 7?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-17 11:06         ` Karel Zak
  2018-01-18  9:48           ` Miklos Szeredi
@ 2018-01-19  2:27           ` Al Viro
  1 sibling, 0 replies; 17+ messages in thread
From: Al Viro @ 2018-01-19  2:27 UTC (permalink / raw)
  To: Karel Zak
  Cc: Miklos Szeredi, David Howells, Jeff Layton, Eric W. Biederman,
	linux-fsdevel, Linux API, util-linux, Michael Kerrisk (man-pages)

On Wed, Jan 17, 2018 at 12:06:33PM +0100, Karel Zak wrote:

> What about new flag (for the API) rather than try to be smart with the
> current flags? But I have doubts that invest time to new mount(2)
> features is a good idea.

Would be nice, if we had any spare bits left...  We could, in principle,
turn
#define MS_BIND         4096
#define MS_MOVE         8192
into
#define MS_BIND         0x1000
#define MS_MOVE         0x2000
#define MS_SOMETHING    0x3000
seeing that they should never be used together, but... mount(2)
doesn't reject MS_BIND|MS_MOVE and treats it as MS_BIND instead.
_Probably_ nothing would care, but it risks breaking userland.

We could use one of the internal-only bits for that instead, but
they are also quietly ignored and not rejected, so that would
have the same problem.

mount(2) ABI sucks, film at 11...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: User-visible context-mount API
  2018-01-17  9:53       ` Miklos Szeredi
  2018-01-17 11:06         ` Karel Zak
@ 2018-01-19  6:32         ` Al Viro
  1 sibling, 0 replies; 17+ messages in thread
From: Al Viro @ 2018-01-19  6:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Jeff Layton, Eric W. Biederman, linux-fsdevel,
	Linux API, util-linux, Michael Kerrisk (man-pages)

On Wed, Jan 17, 2018 at 10:53:36AM +0100, Miklos Szeredi wrote:

> Documenting  --make-private thing properly would also help.  To me the
> wording "make private" strongly implies "I want to make submounts
> private to this instance".  See for example rhbz#1432211.
> 
> > If anything, "private bind on itself" would be a useful operation.
> > Turning given location into a mountpoint, and having everything
> > under it looking as it used to, but with no propagation at all.
> > Without bothering anybody else, even if location currently happens
> > to be on a shared/master mount.
> >
> > I can slap that together for mount(2), but I'm not sure what a sane
> > combination of flags for that would look like ;-)  For fsmount
> > I think it would be very useful thing to have.
> 
> Yes, I think such an operation would be pretty useful.   Not sure if
> it's the whole story, though.

FWIW, there's a fun variant of the API:

* fsopen(): string -> fsfd; takes fs type name, returns a file descriptor
connected to fs driver.  Subsequent read/write on it is used to pass
options, authenticate, etc. - all you need to talk the driver into
creating an active instance.

* fspick(): location -> fsfd; fsfd connected to filesystem mounted at given
place.  After that you can talk to the driver to get superblock-level
remount.

* new_mount(): fsfd x string -> fd.  Creates a vfsmount and gives a file
descriptor for given relative pathname.

* clone_mount(): location x bool -> fd.  Copies a vfsmount or an entire
subtree (depending upon the second parameter) and returns a file descriptor.
Basically, bind or rbind sans attaching it anywhere.

* change_flags(): fd x (propagation or vfsmount flags) x bool -> int
fd should point to root of some vfsmount (O_PATH, or either of the previous
two.  Flag is "do we want it to affect the entire subtree"; the tricky
question is what to do with vfsmount flags - for those we might want
things like "here's the full set" or "change those flags thus".
Hell knows - there might be two primitives there; the second one
would be fd x mask x new_flags x bool -> int, as in "set the bits
present in mask to values as in new_flags".  Not sure.

* move_mount(): fd x location x bool -> int.  fd - what to move, location -
where to put it, bool - do we want to suppress propagation.  Potentially
hacky part is that if fd is not attached to anything, we simply attach it;
otherwise - move.

Normal mount: fsopen, talk to driver, new_mount, move_mount, close descriptors
mount --bind: fd = clone_mount(old, false); move_mount(fd, new, false); close
mount --rbind: clone_mount(old, true); move_mount; close
mount --make-shared et.al.: open(..., O_PATH); change_flags; close
mount --move: open; mount_move; close
vfsmount-level remount: open; change_flags (or change_mount_flags, if we keep
it separate from topology ones); close
sb-level remount: fspick; talk to driver; close
make an arbitrary subtree really private (as discussed upthread):
	fd = clone_mount(old, true); change_flags (or change_propagation_flags);
	mount_move(fd, old, true); close(fd);

The tricky part in terms of implementation is that we want a
tree created by clone_mount() and never attached anywhere to be
dissolved on the final close() of the result of clone_mount().
It's not quite O_PATH - we want file_operations for that sucker
that would have ->release() doing that.

It would do namespace_lock(), check ->f_path.mnt for a flag and do
umount_tree() if not set, then namespace_unlock().  move_mount()
would set the flag.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-01-19  6:32 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-15 16:07 User-visible context-mount API David Howells
2018-01-15 17:31 ` Eric W. Biederman
2018-01-15 17:32 ` Eric W. Biederman
2018-01-16  9:01 ` Miklos Szeredi
2018-01-16 10:10 ` David Howells
2018-01-16 10:35   ` Miklos Szeredi
2018-01-16 14:18   ` David Howells
2018-01-17 10:43   ` Karel Zak
2018-01-16 14:55 ` David Howells
2018-01-16 15:40 ` David Howells
2018-01-16 16:41   ` Miklos Szeredi
2018-01-17  4:17     ` Al Viro
2018-01-17  9:53       ` Miklos Szeredi
2018-01-17 11:06         ` Karel Zak
2018-01-18  9:48           ` Miklos Szeredi
2018-01-19  2:27           ` Al Viro
2018-01-19  6:32         ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).