Linux-Fsdevel Archive on
help / color / mirror / Atom feed
From: Lokesh Gidra <>
To: Andrea Arcangeli <>
Cc: "Michael S. Tsirkin" <>,
	Nick Kralevich <>,
	Jeffrey Vander Stoep <>,
	Suren Baghdasaryan <>,
	Kees Cook <>,
	Jonathan Corbet <>,
	Alexander Viro <>,
	Luis Chamberlain <>,
	Iurii Zaikin <>,
	Mauro Carvalho Chehab <>,
	Andrew Morton <>,
	Andy Shevchenko <>,
	Vlastimil Babka <>,
	Mel Gorman <>,
	Sebastian Andrzej Siewior <>,
	Peter Xu <>, Mike Rapoport <>,
	Jerome Glisse <>, Shaohua Li <>,, LKML <>,
	Linux FS Devel <>,
	Tim Murray <>,
	Minchan Kim <>,
	Sandeep Patil <>,, Daniel Colascione <>,
	Kalesh Singh <>,
	"Dr. David Alan Gilbert" <>,
	Tetsuo Handa <>,
	Dmitry Vyukov <>
Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
Date: Fri, 4 Sep 2020 17:36:02 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <> wrote:
> Hello,
> On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:
> > There has been an emphasis that Android is probably the only user for
> > the restriction of userfaults from kernel-space and that it wouldn’t
> > be useful anywhere else. I humbly disagree! There are various areas
> > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
> > user-space setting. Basically, any lazy, on-demand,
> For the record what I said is quoted below
> :
> """It all boils down of how peculiar it is to be able to leverage only
> the acceleration [..] Right now there's a single user that can cope
> with that limitation [..] If there will be more users [..]  it'd be
> fine to add a value "2" later."""
> Specifically I never said "that it wouldn’t be useful anywhere else.".
Thanks a lot for clarifying.

> Also I'm only arguing about the sysctl visible kABI change in patch
> 2/2: the flag passed as parameter to the syscall in patch 1/2 is all
> great, because seccomp needs it in the scalar parameter of the syscall
> to implement a filter equivalent to your sysctl "2" policy with only
> patch 1/2 applied.
> I've two more questions now:
> 1) why don't you enforce the block of kernel initiated faults with
>    seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
>    an optimization to remove a few instructions per syscall in the bpf
>    execution of Android unprivileged apps? You should block a lot of
>    other syscalls by default to all unprivileged processes, including
>    vmsplice.
>    In other words if it's just for Android, why can't Android solve it
>    with only patch 1/2 by tweaking the seccomp filter?

I would let Nick (nnk@) and Jeff (jeffv@) respond to this.

The previous responses from both of them on this email thread
suggest that the performance overhead of seccomp-bpf is too much. Kees
also objected to it

I'm not familiar with how seccomp-bpf works. All that I can add here
is that userfaultfd syscall is usually not invoked in a performance
critical code path. So, if the performance overhead of seccomp-bpf (if
enabled) is observed on all syscalls originating from a process, then
I'd say patch 2/2 is essential. Otherwise, it should be ok to let
seccomp perform the same functionality instead.

> 2) given that Android is secure enough with the sysctl at value 2, why
>    should we even retain the current sysctl 0 semantics? Why can't
>    more secure systems just use seccomp and block userfaultfd, as it
>    is already happens by default in the podman default seccomp
>    whitelist (for those containers that don't define a new json
>    whitelist in the OCI schema)? Shouldn't we focus our energy in
>    making containers more secure by preventing the OCI schema of a
>    random container to re-enable userfaultfd in the container seccomp
>    filter instead of trying to solve this with a global sysctl?
>    What's missing in my view is a kubernetes hard allowlist/denylist
>    that cannot be overridden with the OCI schema in case people has
>    the bad idea of running containers downloaded from a not fully
>    trusted source, without adding virt isolation and that's an
>    userland problem to be solved in the container runtime, not a
>    kernel issue. Then you'd just add userfaultfd to the json of the
>    k8s hard seccomp denylist instead of going around tweaking sysctl.
> What's your take in changing your 2/2 patch to just replace value "0"
> and avoid introducing a new value "2"?

SGTM. Disabling uffd completely for unprivileged processes can be
achieved either using seccomp-bpf, or via SELinux, once the following
patch series is upstreamed

> The value "0" was motivated by the concern that uffd can enlarge the
> race window for use after free by providing one more additional way to
> block kernel faults, but value "2" is already enough to solve that
> concern completely and it'll be the default on all Android.
> In other words by adding "2" you're effectively doing a more
> finegrined and more optimal implementation of "0" that remains useful
> and available to unprivileged apps and it already resolves all
> "robustness against side effects other kernel bugs" concerns. Clearly
> "0" is even more secure statistically but that would apply to every
> other syscall including vmsplice, and there's no
> /proc/sys/vm/unprivileged_vmsplice sysctl out there.
> The next issue we have now is with the pipe mutex (which is not a
> major concern but we need to solve it somehow for correctness). So I
> wonder if should make the default value to be "0" (or "2" if think we
> should not replace "0") and to allow only user initiated faults by
> default.
> Changing the sysctl default to be 0, will make live migration fail to
> switch to postcopy which will be (unnoticeable to the guest), instead
> of risking the VM to be killed because of network latency
> outlier. Then we wouldn't need to change the pipe code at all.
SGTM. I can change the default value to '0' (or '2') in the next
revision of patch 2/2, unless somebody objects to this.

> Alternatively we could still fix the pipe code so it runs better (but
> it'll be more complex) or to disable uffd faults only in the pipe
> code.
> One thing to keep in mind is that if we change the default, then
> hypervisor hosts running QEMU would need to set:
> vm.userfaultfd = 1
> in /etc/sysctl.conf if postcopy live migration is required, that's not
> particularly concerning constraint for qemu (there are likely other
> tweaks required and it looks less risky than an arbitrary timeout
> which could kill the VM: if the above is forgotten the postcopy live
> migration won't even start and it'll be unnoticeable to the guest).
> The main concern really are future apps that may want to use uffd for
> kernel initiated faults won't be allowed to do so by default anymore,
> those apps will be heavily incentivated to use bounce buffers before
> passing data to syscalls, similarly to the current use case of patch 2/2.
> Comments welcome,
> Andrea
> PS. Another usage of uffd that remains possible without privilege with
> the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
> acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
> will remain possible without requiring extra privileges, by adding a
> UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
> (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
> indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
> performance bottleneck by avoiding a wrprotect fault after the
> allocation and it will be THP capable too). Memory will be freed with
> MADV_DONTNEED, without ever having to call mmap/mumap. It could move
> memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
> which already exists.
UFFDIO_POPULATE sounds like a really useful feature. I don't see it in
the kernel yet. Is there a patch under work on this? If so, kindly

  reply	other threads:[~2020-09-05  0:36 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
2020-07-24 14:28   ` Michael S. Tsirkin
2020-07-24 14:46     ` Lokesh Gidra
2020-07-26 10:09       ` Michael S. Tsirkin
2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
2020-05-06 19:38   ` Peter Xu
2020-05-07 19:15     ` Jonathan Corbet
2020-05-20  4:06       ` Andrea Arcangeli
2020-05-08 16:52   ` Michael S. Tsirkin
2020-05-08 16:54     ` Michael S. Tsirkin
2020-05-20  4:59       ` Andrea Arcangeli
2020-05-20 18:03         ` Kees Cook
2020-05-20 19:48           ` Andrea Arcangeli
2020-05-20 19:51             ` Andrea Arcangeli
2020-05-20 20:17               ` Lokesh Gidra
2020-05-20 21:16                 ` Andrea Arcangeli
2020-07-17 12:57                   ` Jeffrey Vander Stoep
2020-07-23 17:30                     ` Lokesh Gidra
2020-07-24  0:13                       ` Nick Kralevich
2020-07-24 13:40                         ` Michael S. Tsirkin
2020-08-06  0:43                           ` Nick Kralevich
2020-08-06  5:44                             ` Michael S. Tsirkin
2020-08-17 22:11                               ` Lokesh Gidra
2020-09-04  3:34                                 ` Andrea Arcangeli
2020-09-05  0:36                                   ` Lokesh Gidra [this message]
2020-09-19 18:14                                     ` Nick Kralevich
2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
2020-07-24 14:41   ` Lokesh Gidra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).