Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Christian Schoenebeck <qemu_oss@crudebyte.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Greg Kurz <groug@kaod.org>,
	linux-fsdevel@vger.kernel.org, stefanha@redhat.com,
	mszeredi@redhat.com, vgoyal@redhat.com, gscrivan@redhat.com,
	dwalsh@redhat.com, chirantan@chromium.org
Subject: Re: xattr names for unprivileged stacking?
Date: Fri, 28 Aug 2020 08:24:57 +1000	[thread overview]
Message-ID: <20200827222457.GB12096@dread.disaster.area> (raw)
In-Reply-To: <20200827152207.GJ14765@casper.infradead.org>

On Thu, Aug 27, 2020 at 04:22:07PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 17, 2020 at 10:29:30AM +1000, Dave Chinner wrote:
> > To implement ADS, we'd likely consider adding a new physical inode
> > "ADS fork" which, internally, maps to a separate directory
> > structure. This provides us with the ADS namespace for each inode
> > and a mechanism that instantiates a physical inode per ADS. IOWs,
> > each ADS can be referenced by the VFS natively and independently as
> > an inode (native "file as a directory" semantics). Hence existing
> > create/unlink APIs work for managing ADS, readdir() can list all
> > your ADS, you can keep per ADS xattrs, etc....
> > 
> > IOWs, with a filesystem inode fork implementation like this for ADS,
> > all we really need is for the VFS to pass a magic command to
> > ->lookup() to tell us to use the ADS namespace attached to the inode
> > rather than use the primary inode type/state to perform the
> > operation.
> > 
> > Hence all the ADS support infrastructure is essentially dentry cache
> > infrastructure allowing a dentry to be both a file and directory,
> > and providing the pathname resolution that recognises an ADS
> > redirection. Name that however you want - we've got to do an on-disk
> > format change to support ADS, so we can tell the VFS we support ADS
> > or not. And we have no cares about existing names in the filesystem
> > conflicting with the ADS pathname identifier because it's a mkfs
> > time decision. Given that special flags are needed for the openat()
> > call to resolve an ADS (e.g. O_ALT), we know if we should parse the
> > ADS identifier as an ADS the moment it is seen...
> 
> I think this is equivalent to saying "Linux will never support ADS".
> Al has some choice words on having the dentry cache support objects which
> are both files and directories.  You run into some "fun" locking issues.
> And there's lots of things you just don't want to permit, like mounting
> a new filesystem on top of some ADS, or chrooting a process into an ADS,
> or renaming an ADS into a different file.

I know all this. My point is that the behaviour required by ADS
objects is that of a seekable data file. That requires a struct file
that points at a struct inode, page cache mapping, etc to all work
as they currently do. It also means that how ADS are managed and
presented to userspace is entirely a VFS construct. Indeed,
everything you mention above is functionality controlled/implemented
by the VFS via the dentry cache...

> I think what would satisfy people is allowing actual "alternate data
> streams" to exist in files.  You always start out by opening a file,
> then the presentation layer is a syscall that lets you enumerate the
> data streams available for this file, and another syscall that returns
> an fd for one of those streams.

You could do this with a getdents_at() syscall that has an AT_ALT
flag or something like that. i.e. iterate the streams on the inode
(whether it be a regular file or a directory!) and report them as
dirents to userspace. Userspace can then openat2(fd, name, O_ALT)
and there is your user API.

The VFS can deal with openat2(fd, stream_name, O_ALT) however it
wants - it doesn't need the dentry cache pathwalk here - just vector
straight to the filesystem's ->lookup mechanism on the inode
attached to the "dirfd" passed in. 

AFAICT, the dentry cache only needs to be involved if we want to
-cache- the ADS namespace. I don't think we need to cache the ADS
namespace as long as the inode is cached by the filesystem - just
let the fs and let it do an inode cache lookup and instantiation for
ADS inodes (eg as XFS already does for internal inode accesses
during bulkstat, quotacheck, etc). We don't cache the xattr
namespaces in the VFS - the filesystem is responsible for doing that
if required - so I don't think this would be a problem for ADS
access...

The fact that ADS inodes would not be in the dentry cache and hence
not visible to pathwalks at all then means that all of the issues
such as mounting over them, chroot, etc don't exist in the first
place...

> As long as nobody gets the bright idea to be able to link that fd into
> the directory structure somewhere, this should avoid any problems with
> unwanted things being done to an ADS.  Chunks of your implementation
> described above should be fine for this.

I can see the need for rename and linkat linking O_TMPFILE fd's into
ADS names, though. e.g. to be able to do safe overwrites of ADS
data.

From a fs management POV, we'll also want to be able to do things
like defrag ADS inodes, which means we'll need to be able to do
atomic inode operations (e.g. swap extents) between O_TMPFILE inodes
and ADS inodes, etc. So in addition to the VFS interfaces, there's a
bunch of filesystem admin stuff that will need to be made ADS aware,
and it's likely there will be fs specific ioctls that need to be
modifed/added to manipulate ADS inodes directly...

> I thought through some of this a while back, and came up with this list:
> 
> > Work as expected:
> > mmap(), read(), write(), close(), splice(), sendfile(), fallocate(),
> > ftruncate(), dup(), dup2(), dup3(), utimensat(), futimens(), select(),
> > poll(), lseek(), fcntl(): F_DUPFD, F_GETFD, F_GETFL, F_SETFL, F_SETLK,
> > F_SETLKW, F_GETLK, F_GETOWN, F_SETOWN, F_GETSIG, F_SETSIG, F_SETLEASE,
> > F_GETLEASE)
> >
> > Return error if fd refers to the non-default stream:
> > linkat(), symlinkat(), mknodat(), mkdirat()
> >
> > Remove a stream from a file:
> > unlinkat()
> >
> > Open an existing stream in a file or create a new stream in a file:
> > openat()
> >
> > fstat()
> > st_ino may be different for different names.  st_dev may be different.
> > st_mode will match the object for files, even if it is changed after
> > creation.  For directories, it will match except that execute permission
> > will be removed and S_IFMT will be S_ISREG (do we want to define a
> > new S_ISSTRM?).  st_nlink will be 1.  st_uid and st_gid will match.
> > It will have its own st_atime/st_mtime/st_ctime.  Accessing a stream
> > will not update its parent's atime/mtime/ctime.
> >
> > renameat()
> > If olddirfd + oldpath refers to a stream then newdirfd + newpath must
> > refer to a stream within the same parent object.  If that stream exists,
> > it is removed.  If olddirfd + oldpath does not refer to a stream,
> > then newdirfd + newpath must not refer to a stream.
> >
> > The two file specifications must resolve to the same parent object.
> > It is possible to use renameat() to rename a stream within an object,
> > but not to move a stream from one object to another.  If newpath refers
> > to an existing named stream, it is removed.
> 
> I don't seem to have come up with an actual syscall for enumerating the
> stream names.  I kind of think a fresh syscall might be the right way to
> go.

Why reinvent the wheel? getdentsat() seems like the right interface
to use here because it matches up with all the other *at(AT_ALT)
style interfaces we'd be using to operate on ADS... :P

> For the benefit of shell scripts, I think an argument to 'cat' to open
> an ADS and an lsads command should be enough.
> 
> Oh, and I would think we might want i_blocks of the 'host' inode to
> reflect the blocks allocated to all the data streams attached to the
> inode.  That should address at least parts of the data exfiltration
> concern.

I think that's a problem, because metadata blocks that are invisible
to userspace are also accounted to the inode block count, so a user
cannot know if the difference between the data file size and the
block count stat() reports is block mapping metadata, xattrs,
speculative delayed allocation reservations, etc. It's just not a
useful signal because it's already so overloaded with invisible
stuff....

It also means that every block map modification to an ADS inode also
has to lock and modify the host inode. That's going to mean adding a
heap of complexity to the filesystem transaction models because now
there are two independent inodes that have to be locked we doing a
single inode operations instead of largely being a simple drop in...

IOWs, if ADS visibility is required (which I don't think anyone will
argue against) I'd suggest that statx() has a flag added to indicate
ADS exist on the inode. Then it's easy to discover through a
standard interface....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2020-08-27 22:25 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-28 10:55 Dr. David Alan Gilbert
2020-07-28 13:08 ` Greg Kurz
2020-07-28 13:55   ` Christian Schoenebeck
2020-08-04 11:28     ` Dr. David Alan Gilbert
2020-08-04 13:51       ` Christian Schoenebeck
2020-08-12 11:18         ` Dr. David Alan Gilbert
2020-08-12 13:34           ` Christian Schoenebeck
2020-08-12 14:33             ` Dr. David Alan Gilbert
2020-08-13  9:01               ` Christian Schoenebeck
2020-08-16 22:56                 ` Dave Chinner
2020-08-16 23:09                   ` Matthew Wilcox
2020-08-17  0:29                     ` Dave Chinner
2020-08-17 10:37                       ` file forks vs. xattr (was: xattr names for unprivileged stacking?) Christian Schoenebeck
2020-08-23 23:40                         ` Dave Chinner
2020-08-24 15:30                           ` Christian Schoenebeck
2020-08-24 20:01                             ` Miklos Szeredi
2020-08-24 21:26                             ` Frank van der Linden
2020-08-24 22:29                             ` Theodore Y. Ts'o
2020-08-25 15:12                               ` Christian Schoenebeck
2020-08-25 15:32                                 ` Miklos Szeredi
2020-08-27 12:02                                   ` Christian Schoenebeck
2020-08-27 12:25                                     ` Matthew Wilcox
2020-08-27 13:48                                       ` Christian Schoenebeck
2020-08-27 14:01                                         ` Matthew Wilcox
2020-08-27 14:23                                           ` Christian Schoenebeck
2020-08-27 14:25                                             ` Matthew Wilcox
2020-08-27 14:44                                             ` Al Viro
2020-08-27 16:29                                               ` Dr. David Alan Gilbert
2020-08-27 16:35                                                 ` Matthew Wilcox
2020-08-28  9:11                                                 ` Christian Schoenebeck
2020-08-28 14:46                                                   ` Theodore Y. Ts'o
2020-08-27 15:22                       ` xattr names for unprivileged stacking? Matthew Wilcox
2020-08-27 22:24                         ` Dave Chinner [this message]
2020-08-29 16:07                           ` Matthew Wilcox
2020-08-29 16:13                             ` Al Viro
2020-08-29 17:51                               ` Miklos Szeredi
2020-08-29 18:04                                 ` Al Viro
2020-08-29 18:22                                   ` Christian Schoenebeck
2020-08-29 19:13                                   ` Miklos Szeredi
2020-08-29 19:25                                     ` Al Viro
2020-08-30 19:05                                       ` Miklos Szeredi
2020-08-30 19:10                                         ` Matthew Wilcox
2020-08-31  7:34                                           ` Miklos Szeredi
2020-08-31 11:37                                             ` Matthew Wilcox
2020-08-31 11:51                                               ` Miklos Szeredi
2020-08-31 13:23                                                 ` Matthew Wilcox
2020-08-31 14:21                                                   ` Miklos Szeredi
2020-08-31 14:25                                                   ` Theodore Y. Ts'o
2020-08-31 14:45                                                     ` Matthew Wilcox
2020-08-31 14:49                                                       ` Miklos Szeredi
2020-09-01  3:34                                                     ` Dave Chinner
2020-09-01 14:52                                                       ` Theodore Y. Ts'o
2020-09-01 15:14                                                         ` Theodore Y. Ts'o
2020-09-02  5:19                                                           ` Dave Chinner
2020-08-31 18:02                                                   ` Andreas Dilger
2020-09-01  3:48                                                     ` Dave Chinner
2020-08-29 19:17                               ` Matthew Wilcox
2020-08-29 19:40                                 ` Al Viro
2020-08-29 20:12                                   ` Matthew Wilcox
2020-08-31 14:23                                     ` Theodore Y. Ts'o
2020-08-31 14:40                                       ` Matthew Wilcox
2020-08-31 16:11                                       ` Christian Schoenebeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200827222457.GB12096@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=chirantan@chromium.org \
    --cc=dgilbert@redhat.com \
    --cc=dwalsh@redhat.com \
    --cc=groug@kaod.org \
    --cc=gscrivan@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=qemu_oss@crudebyte.com \
    --cc=stefanha@redhat.com \
    --cc=vgoyal@redhat.com \
    --cc=willy@infradead.org \
    --subject='Re: xattr names for unprivileged stacking?' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).