LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <christian@brauner.io>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-api@vger.kernel.org, jannh@google.com, fweimer@redhat.com,
oleg@redhat.com, tglx@linutronix.de,
torvalds@linux-foundation.org, arnd@arndb.de, shuah@kernel.org,
dhowells@redhat.com, tkjos@android.com, ldv@altlinux.org,
miklos@szeredi.hu, linux-alpha@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org,
linux-m68k@lists.linux-m68k.org, linux-mips@vger.kernel.org,
linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-s390@vger.kernel.org, linux-sh@vger.kernel.org,
sparclinux@vger.kernel.org, linux-xtensa@linux-xtensa.org,
linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org,
x86@kernel.org
Subject: Re: [PATCH 1/2] open: add close_range()
Date: Tue, 21 May 2019 16:00:06 +0100 [thread overview]
Message-ID: <20190521150006.GJ17978@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20190521113448.20654-1-christian@brauner.io>
On Tue, May 21, 2019 at 01:34:47PM +0200, Christian Brauner wrote:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve(....);
>
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating
> file descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
> - Python (cf. [2])
> - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
>
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
> DIR *dir;
> struct dirent *direntp;
>
> dir = opendir("/proc/self/fd");
> if (!dir)
> return -1;
>
> while ((direntp = readdir(dir))) {
> int fd;
> if (strcmp(direntp->d_name, ".") == 0)
> continue;
> if (strcmp(direntp->d_name, "..") == 0)
> continue;
> fd = atoi(direntp->d_name);
> if (fd == 0 || fd == 1 || fd == 2)
> continue;
> close(fd);
> }
>
> closedir(dir); /* cannot fail */
> return 0;
> }
>
> to close_range() yields:
> 1. closing 4 open files:
> - close_all_fds(): ~280 us
> - close_range(): ~24 us
>
> 2. closing 1000 open files:
> - close_all_fds(): ~5000 us
> - close_range(): ~800 us
>
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.
> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.
>
> >From an implementation side this is kept rather dumb. It saw some input
> from David and Jann but all nonsense is obviously my own!
> - Errors to close file descriptors are currently ignored. (Could be changed
> by setting a flag in the future if needed.)
> - __close_range() is a rather simplistic wrapper around __close_fd().
> My reasoning behind this is based on the nature of how __close_fd() needs
> to release an fd. But maybe I misunderstood specifics:
> We take the files_lock and rcu-dereference the fdtable of the calling
> task, we find the entry in the fdtable, get the file and need to release
> files_lock before calling filp_close().
> In the meantime the fdtable might have been altered so we can't just
> retake the spinlock and keep the old rcu-reference of the fdtable
> around. Instead we need to grab a fresh reference to the fdtable.
> If my reasoning is correct then there's really no point in fancyfying
> __close_range(): We just need to rcu-dereference the fdtable of the
> calling task once to cap the max_fd value correctly and then go on
> calling __close_fd() in a loop.
> +/**
> + * __close_range() - Close all file descriptors in a given range.
> + *
> + * @fd: starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + */
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + if (max_fd >= cur_max)
> + max_fd = cur_max - 1;
> +
> + while (fd <= max_fd)
> + __close_fd(files, fd++);
> +
> + return 0;
> +}
Umm... That's going to be very painful if you dup2() something to MAX_INT and
then run that; roughly 2G iterations of bouncing ->file_lock up and down,
without anything that would yield CPU in process.
If anything, I would suggest something like
fd = *start_fd;
grab the lock
fdt = files_fdtable(files);
more:
look for the next eviction candidate in ->open_fds, starting at fd
if there's none up to max_fd
drop the lock
return NULL
*start_fd = fd + 1;
if the fscker is really opened and not just reserved
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
drop the lock
return the file we'd got
if (unlikely(need_resched()))
drop lock
cond_resched();
grab lock
fdt = files_fdtable(files);
goto more;
with the main loop being basically
while ((file = pick_next(files, &start_fd, max_fd)) != NULL)
filp_close(file, files);
next prev parent reply other threads:[~2019-05-21 15:00 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-21 11:34 Christian Brauner
2019-05-21 11:34 ` [PATCH 2/2] tests: add close_range() tests Christian Brauner
2019-05-21 12:09 ` [PATCH 1/2] open: add close_range() Florian Weimer
2019-05-21 13:04 ` Christian Brauner
2019-05-21 13:10 ` Florian Weimer
2019-05-21 13:18 ` Christian Brauner
2019-05-21 13:23 ` Christian Brauner
2019-05-21 13:39 ` Rasmus Villemoes
2019-05-21 15:00 ` Al Viro [this message]
2019-05-21 16:53 ` Christian Brauner
2019-05-21 16:30 ` David Howells
2019-05-21 16:41 ` Christian Brauner
2019-05-21 20:23 ` Linus Torvalds
2019-05-22 8:12 ` Christian Brauner
2019-05-21 19:20 ` Al Viro
2019-05-21 19:59 ` Matthew Wilcox
2019-05-24 20:32 ` Michael Tirado
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190521150006.GJ17978@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=arnd@arndb.de \
--cc=christian@brauner.io \
--cc=dhowells@redhat.com \
--cc=fweimer@redhat.com \
--cc=jannh@google.com \
--cc=ldv@altlinux.org \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-m68k@lists.linux-m68k.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux-sh@vger.kernel.org \
--cc=linux-xtensa@linux-xtensa.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=miklos@szeredi.hu \
--cc=oleg@redhat.com \
--cc=shuah@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=tkjos@android.com \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
--subject='Re: [PATCH 1/2] open: add close_range()' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).