Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Christian Schoenebeck <qemu_oss@crudebyte.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Greg Kurz <groug@kaod.org>,
	linux-fsdevel@vger.kernel.org, stefanha@redhat.com,
	mszeredi@redhat.com, vgoyal@redhat.com, gscrivan@redhat.com,
	dwalsh@redhat.com, chirantan@chromium.org
Subject: Re: xattr names for unprivileged stacking?
Date: Sat, 29 Aug 2020 17:07:17 +0100	[thread overview]
Message-ID: <20200829160717.GS14765@casper.infradead.org> (raw)
In-Reply-To: <20200827222457.GB12096@dread.disaster.area>

On Fri, Aug 28, 2020 at 08:24:57AM +1000, Dave Chinner wrote:
> On Thu, Aug 27, 2020 at 04:22:07PM +0100, Matthew Wilcox wrote:
> > On Mon, Aug 17, 2020 at 10:29:30AM +1000, Dave Chinner wrote:
> > > To implement ADS, we'd likely consider adding a new physical inode
> > > "ADS fork" which, internally, maps to a separate directory
> > > structure. This provides us with the ADS namespace for each inode
> > > and a mechanism that instantiates a physical inode per ADS. IOWs,
> > > each ADS can be referenced by the VFS natively and independently as
> > > an inode (native "file as a directory" semantics). Hence existing
> > > create/unlink APIs work for managing ADS, readdir() can list all
> > > your ADS, you can keep per ADS xattrs, etc....
> > > 
> > > IOWs, with a filesystem inode fork implementation like this for ADS,
> > > all we really need is for the VFS to pass a magic command to
> > > ->lookup() to tell us to use the ADS namespace attached to the inode
> > > rather than use the primary inode type/state to perform the
> > > operation.
> > > 
> > > Hence all the ADS support infrastructure is essentially dentry cache
> > > infrastructure allowing a dentry to be both a file and directory,
> > > and providing the pathname resolution that recognises an ADS
> > > redirection. Name that however you want - we've got to do an on-disk
> > > format change to support ADS, so we can tell the VFS we support ADS
> > > or not. And we have no cares about existing names in the filesystem
> > > conflicting with the ADS pathname identifier because it's a mkfs
> > > time decision. Given that special flags are needed for the openat()
> > > call to resolve an ADS (e.g. O_ALT), we know if we should parse the
> > > ADS identifier as an ADS the moment it is seen...
> > 
> > I think this is equivalent to saying "Linux will never support ADS".
> > Al has some choice words on having the dentry cache support objects which
> > are both files and directories.  You run into some "fun" locking issues.
> > And there's lots of things you just don't want to permit, like mounting
> > a new filesystem on top of some ADS, or chrooting a process into an ADS,
> > or renaming an ADS into a different file.
> 
> I know all this. My point is that the behaviour required by ADS
> objects is that of a seekable data file. That requires a struct file
> that points at a struct inode, page cache mapping, etc to all work
> as they currently do. It also means that how ADS are managed and
> presented to userspace is entirely a VFS construct. Indeed,
> everything you mention above is functionality controlled/implemented
> by the VFS via the dentry cache...

I agree with you that supporting named streams within a file requires
an independent inode for each stream.  I disagree with you that this is
dentry cache infrastructure.  I do not believe in giving each stream
its own dentry.  Either they share the default stream's dentry, or they
have no dentry (mild preference for no dentry).

> > I think what would satisfy people is allowing actual "alternate data
> > streams" to exist in files.  You always start out by opening a file,
> > then the presentation layer is a syscall that lets you enumerate the
> > data streams available for this file, and another syscall that returns
> > an fd for one of those streams.
> 
> You could do this with a getdents_at() syscall that has an AT_ALT
> flag or something like that. i.e. iterate the streams on the inode
> (whether it be a regular file or a directory!) and report them as
> dirents to userspace. Userspace can then openat2(fd, name, O_ALT)
> and there is your user API.

Maybe.  getdents is a little overkill; these things don't have inode
numbers (at least not ones which are meaningful to userspace), or
d_type.  I might be tempted by just read() on an fd like v7 unix.

> The VFS can deal with openat2(fd, stream_name, O_ALT) however it
> wants - it doesn't need the dentry cache pathwalk here - just vector
> straight to the filesystem's ->lookup mechanism on the inode
> attached to the "dirfd" passed in. 
>
> AFAICT, the dentry cache only needs to be involved if we want to
> -cache- the ADS namespace. I don't think we need to cache the ADS
> namespace as long as the inode is cached by the filesystem - just
> let the fs and let it do an inode cache lookup and instantiation for
> ADS inodes (eg as XFS already does for internal inode accesses
> during bulkstat, quotacheck, etc). We don't cache the xattr
> namespaces in the VFS - the filesystem is responsible for doing that
> if required - so I don't think this would be a problem for ADS
> access...
> 
> The fact that ADS inodes would not be in the dentry cache and hence
> not visible to pathwalks at all then means that all of the issues
> such as mounting over them, chroot, etc don't exist in the first
> place...

Wait, you've now switched from "this is dentry cache infrastructure"
to "it should not be in the dentry cache".  So I don't understand what
you're arguing for.

> > As long as nobody gets the bright idea to be able to link that fd into
> > the directory structure somewhere, this should avoid any problems with
> > unwanted things being done to an ADS.  Chunks of your implementation
> > described above should be fine for this.
> 
> I can see the need for rename and linkat linking O_TMPFILE fd's into
> ADS names, though. e.g. to be able to do safe overwrites of ADS
> data.

I don't have a problem with being able to create unnamed streams and
then atomically linking them into their containing file.

> From a fs management POV, we'll also want to be able to do things
> like defrag ADS inodes, which means we'll need to be able to do
> atomic inode operations (e.g. swap extents) between O_TMPFILE inodes
> and ADS inodes, etc. So in addition to the VFS interfaces, there's a
> bunch of filesystem admin stuff that will need to be made ADS aware,
> and it's likely there will be fs specific ioctls that need to be
> modifed/added to manipulate ADS inodes directly...

Yes, probably.

> > For the benefit of shell scripts, I think an argument to 'cat' to open
> > an ADS and an lsads command should be enough.
> > 
> > Oh, and I would think we might want i_blocks of the 'host' inode to
> > reflect the blocks allocated to all the data streams attached to the
> > inode.  That should address at least parts of the data exfiltration
> > concern.
> 
> I think that's a problem, because metadata blocks that are invisible
> to userspace are also accounted to the inode block count, so a user
> cannot know if the difference between the data file size and the
> block count stat() reports is block mapping metadata, xattrs,
> speculative delayed allocation reservations, etc. It's just not a
> useful signal because it's already so overloaded with invisible
> stuff....

My concern is that 'du' should not have to be made stream-aware to
continue to be accurate.  Yes, all these other things also contribute
to the space being used by a file, so it's not a very reliable signal,
but if you see a vast discrepancy (several gigabytes being used by a
file which is notionally a few hundred bytes), it's suspicious.

> It also means that every block map modification to an ADS inode also
> has to lock and modify the host inode. That's going to mean adding a
> heap of complexity to the filesystem transaction models because now
> there are two independent inodes that have to be locked we doing a
> single inode operations instead of largely being a simple drop in...

It doesn't have to be reflected in the on-disk inode.  As long as the
calling stat() returns the number of blocks allocated to all streams
contained in the file, you can implement that any way you want.

> IOWs, if ADS visibility is required (which I don't think anyone will
> argue against) I'd suggest that statx() has a flag added to indicate
> ADS exist on the inode. Then it's easy to discover through a
> standard interface....

The amount of space used has to be visible to unmodified utilities.
We could have an implementation where unmodified utilities walk all
the sub-streams at stat() time while statx() with the appropriate flag
reports disaggregated data (and is more efficient).


I think we have a group of people contributing to this thread who want
the plain "named streams" functionality that you and I are currently
discussing.  And then another group who want something more complex
where the "alternate" contents of the file could be a directory tree
with files and subdirectories and permissions ... essentially mounting
the contents of a ZIP file on top of itself.  And I think that's a level
of complexity we have to step away from.

  reply	other threads:[~2020-08-29 16:07 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-28 10:55 Dr. David Alan Gilbert
2020-07-28 13:08 ` Greg Kurz
2020-07-28 13:55   ` Christian Schoenebeck
2020-08-04 11:28     ` Dr. David Alan Gilbert
2020-08-04 13:51       ` Christian Schoenebeck
2020-08-12 11:18         ` Dr. David Alan Gilbert
2020-08-12 13:34           ` Christian Schoenebeck
2020-08-12 14:33             ` Dr. David Alan Gilbert
2020-08-13  9:01               ` Christian Schoenebeck
2020-08-16 22:56                 ` Dave Chinner
2020-08-16 23:09                   ` Matthew Wilcox
2020-08-17  0:29                     ` Dave Chinner
2020-08-17 10:37                       ` file forks vs. xattr (was: xattr names for unprivileged stacking?) Christian Schoenebeck
2020-08-23 23:40                         ` Dave Chinner
2020-08-24 15:30                           ` Christian Schoenebeck
2020-08-24 20:01                             ` Miklos Szeredi
2020-08-24 21:26                             ` Frank van der Linden
2020-08-24 22:29                             ` Theodore Y. Ts'o
2020-08-25 15:12                               ` Christian Schoenebeck
2020-08-25 15:32                                 ` Miklos Szeredi
2020-08-27 12:02                                   ` Christian Schoenebeck
2020-08-27 12:25                                     ` Matthew Wilcox
2020-08-27 13:48                                       ` Christian Schoenebeck
2020-08-27 14:01                                         ` Matthew Wilcox
2020-08-27 14:23                                           ` Christian Schoenebeck
2020-08-27 14:25                                             ` Matthew Wilcox
2020-08-27 14:44                                             ` Al Viro
2020-08-27 16:29                                               ` Dr. David Alan Gilbert
2020-08-27 16:35                                                 ` Matthew Wilcox
2020-08-28  9:11                                                 ` Christian Schoenebeck
2020-08-28 14:46                                                   ` Theodore Y. Ts'o
2020-08-27 15:22                       ` xattr names for unprivileged stacking? Matthew Wilcox
2020-08-27 22:24                         ` Dave Chinner
2020-08-29 16:07                           ` Matthew Wilcox [this message]
2020-08-29 16:13                             ` Al Viro
2020-08-29 17:51                               ` Miklos Szeredi
2020-08-29 18:04                                 ` Al Viro
2020-08-29 18:22                                   ` Christian Schoenebeck
2020-08-29 19:13                                   ` Miklos Szeredi
2020-08-29 19:25                                     ` Al Viro
2020-08-30 19:05                                       ` Miklos Szeredi
2020-08-30 19:10                                         ` Matthew Wilcox
2020-08-31  7:34                                           ` Miklos Szeredi
2020-08-31 11:37                                             ` Matthew Wilcox
2020-08-31 11:51                                               ` Miklos Szeredi
2020-08-31 13:23                                                 ` Matthew Wilcox
2020-08-31 14:21                                                   ` Miklos Szeredi
2020-08-31 14:25                                                   ` Theodore Y. Ts'o
2020-08-31 14:45                                                     ` Matthew Wilcox
2020-08-31 14:49                                                       ` Miklos Szeredi
2020-09-01  3:34                                                     ` Dave Chinner
2020-09-01 14:52                                                       ` Theodore Y. Ts'o
2020-09-01 15:14                                                         ` Theodore Y. Ts'o
2020-09-02  5:19                                                           ` Dave Chinner
2020-08-31 18:02                                                   ` Andreas Dilger
2020-09-01  3:48                                                     ` Dave Chinner
2020-08-29 19:17                               ` Matthew Wilcox
2020-08-29 19:40                                 ` Al Viro
2020-08-29 20:12                                   ` Matthew Wilcox
2020-08-31 14:23                                     ` Theodore Y. Ts'o
2020-08-31 14:40                                       ` Matthew Wilcox
2020-08-31 16:11                                       ` Christian Schoenebeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200829160717.GS14765@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=chirantan@chromium.org \
    --cc=david@fromorbit.com \
    --cc=dgilbert@redhat.com \
    --cc=dwalsh@redhat.com \
    --cc=groug@kaod.org \
    --cc=gscrivan@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=qemu_oss@crudebyte.com \
    --cc=stefanha@redhat.com \
    --cc=vgoyal@redhat.com \
    --subject='Re: xattr names for unprivileged stacking?' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).