LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] prctl: add PR_[GS]ET_KILLABLE
@ 2018-07-30  7:52 Jürg Billeter
  2018-07-30 10:17 ` Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Jürg Billeter @ 2018-07-30  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Eric Biederman, linux-api, linux-kernel,
	Jürg Billeter

PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
special signal protection for init processes.

This is required for job control in a shell that uses CLONE_NEWPID for
child processes.

This prctl does not allow setting the SIGNAL_UNKILLABLE flag, only
clearing.

Signed-off-by: Jürg Billeter <j@bitron.ch>
---
 include/uapi/linux/prctl.h |  4 ++++
 kernel/sys.c               | 11 +++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0bf5b6..92afb63da727 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
 # define PR_SPEC_DISABLE		(1UL << 2)
 # define PR_SPEC_FORCE_DISABLE		(1UL << 3)
 
+/* Control SIGNAL_UNKILLABLE */
+#define PR_GET_KILLABLE			54
+#define PR_SET_KILLABLE			55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 38509dc1f77b..264de630d548 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2484,6 +2484,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
 		break;
+	case PR_GET_KILLABLE:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = put_user(!(me->signal->flags & SIGNAL_UNKILLABLE),
+				 (int __user *)arg2);
+		break;
+	case PR_SET_KILLABLE:
+		if (arg2 != 1 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		me->signal->flags &= ~SIGNAL_UNKILLABLE;
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-07-30  7:52 [PATCH] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
@ 2018-07-30 10:17 ` Oleg Nesterov
  2018-07-30 19:32   ` Jürg Billeter
  2018-07-31  7:03 ` [PATCH v2] " Jürg Billeter
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2018-07-30 10:17 UTC (permalink / raw)
  To: Jürg Billeter; +Cc: Andrew Morton, Eric Biederman, linux-api, linux-kernel

On 07/30, Jürg Billeter wrote:
>
> This is required for job control in a shell that uses CLONE_NEWPID for
> child processes.

Could you explain in more details?

> +	case PR_SET_KILLABLE:
> +		if (arg2 != 1 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		me->signal->flags &= ~SIGNAL_UNKILLABLE;

this needs spin_lock_irq(me->sighand->siglock).

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-07-30 10:17 ` Oleg Nesterov
@ 2018-07-30 19:32   ` Jürg Billeter
  2018-07-30 19:39     ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-07-30 19:32 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, Eric Biederman, linux-api, linux-kernel

On Mon, 2018-07-30 at 12:17 +0200, Oleg Nesterov wrote:
> On 07/30, Jürg Billeter wrote:
> > 
> > This is required for job control in a shell that uses CLONE_NEWPID for
> > child processes.
> 
> Could you explain in more details?

The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
with CLONE_NEWPID, has the effect of ignoring all signals (from
userspace) if the corresponding handler is set to SIG_DFL. The only
exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
from an ancestor namespace.

SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
While a task with the SIGNAL_UNKILLABLE flag could install handlers for
these signals, this is not sufficient to implement a shell that uses
CLONE_NEWPID for child processes:

 * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
   itself, I don't think it's possible to implement the stop action in
   a custom SIGTSTP handler.
 * Many applications do not install handlers for these signals and
   thus, job control won't work properly with unmodified applications.

Job control in a shell is just an example. There are other scenarios,
of course, where applications rely on the default actions as described
in signal(7), and PID isolation may be useful. In my opinion, the
kernel support for preventing accidental killing of the "init" process
should really be optional and this new prctl provides this without
breaking backward compatibility.

> > +	case PR_SET_KILLABLE:
> > +		if (arg2 != 1 || arg3 || arg4 || arg5)
> > +			return -EINVAL;
> > +		me->signal->flags &= ~SIGNAL_UNKILLABLE;
> 
> this needs spin_lock_irq(me->sighand->siglock).

Thanks for the review, will fix this for v2.

Jürg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-07-30 19:32   ` Jürg Billeter
@ 2018-07-30 19:39     ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2018-07-30 19:39 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Oleg Nesterov, Andrew Morton, Eric Biederman, linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1665 bytes --]

On Mon, 30 Jul 2018, Jürg Billeter wrote:

> On Mon, 2018-07-30 at 12:17 +0200, Oleg Nesterov wrote:
> > On 07/30, Jürg Billeter wrote:
> > > 
> > > This is required for job control in a shell that uses CLONE_NEWPID for
> > > child processes.
> > 
> > Could you explain in more details?
> 
> The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
> with CLONE_NEWPID, has the effect of ignoring all signals (from
> userspace) if the corresponding handler is set to SIG_DFL. The only
> exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
> from an ancestor namespace.
> 
> SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
> While a task with the SIGNAL_UNKILLABLE flag could install handlers for
> these signals, this is not sufficient to implement a shell that uses
> CLONE_NEWPID for child processes:
> 
>  * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
>    itself, I don't think it's possible to implement the stop action in
>    a custom SIGTSTP handler.
>  * Many applications do not install handlers for these signals and
>    thus, job control won't work properly with unmodified applications.
> 
> Job control in a shell is just an example. There are other scenarios,
> of course, where applications rely on the default actions as described
> in signal(7), and PID isolation may be useful. In my opinion, the
> kernel support for preventing accidental killing of the "init" process
> should really be optional and this new prctl provides this without
> breaking backward compatibility.

This makes sense and exactly that information needs to be in the changelog.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-07-30  7:52 [PATCH] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
  2018-07-30 10:17 ` Oleg Nesterov
@ 2018-07-31  7:03 ` " Jürg Billeter
  2018-07-31 14:39   ` Oleg Nesterov
  2018-07-31 16:26 ` [PATCH] " Jann Horn
  2018-08-03 14:40 ` [PATCH v3 1/2] fork: do not rely on SIGNAL_UNKILLABLE for init check Jürg Billeter
  3 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-07-31  7:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Thomas Gleixner, Eric Biederman, linux-api,
	linux-kernel, Jürg Billeter

PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
special signal protection for init processes. This prctl does not allow
setting the SIGNAL_UNKILLABLE flag, only clearing.

The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
with CLONE_NEWPID, has the effect of ignoring all signals (from
userspace) if the corresponding handler is set to SIG_DFL. The only
exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
from an ancestor namespace.

SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
While a task with the SIGNAL_UNKILLABLE flag could install handlers for
these signals, this is not sufficient to implement a shell that uses
CLONE_NEWPID for child processes:

 * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
   itself, it's not possible to implement the stop action in a custom
   SIGTSTP handler.
 * Many applications do not install handlers for these signals and
   thus, job control won't work properly with unmodified applications.

There are other scenarios besides job control in a shell where
applications rely on the default actions as described in signal(7) and
PID isolation may be useful. This new prctl makes the signal protection
for "init" processes optional, without breaking backward compatibility.

Signed-off-by: Jürg Billeter <j@bitron.ch>
---
 v2: Hold siglock for PR_SET_KILLABLE, expand commit message.

 include/uapi/linux/prctl.h |  4 ++++
 kernel/sys.c               | 13 +++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0bf5b6..92afb63da727 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
 # define PR_SPEC_DISABLE		(1UL << 2)
 # define PR_SPEC_FORCE_DISABLE		(1UL << 3)
 
+/* Control SIGNAL_UNKILLABLE */
+#define PR_GET_KILLABLE			54
+#define PR_SET_KILLABLE			55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 38509dc1f77b..92c9322cfb98 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2484,6 +2484,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
 		break;
+	case PR_GET_KILLABLE:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = put_user(!(me->signal->flags & SIGNAL_UNKILLABLE),
+				 (int __user *)arg2);
+		break;
+	case PR_SET_KILLABLE:
+		if (arg2 != 1 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		spin_lock_irq(&me->sighand->siglock);
+		me->signal->flags &= ~SIGNAL_UNKILLABLE;
+		spin_unlock_irq(&me->sighand->siglock);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-07-31  7:03 ` [PATCH v2] " Jürg Billeter
@ 2018-07-31 14:39   ` Oleg Nesterov
  2018-07-31 16:12     ` Jürg Billeter
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2018-07-31 14:39 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On 07/31, Jürg Billeter wrote:
>
> PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
> CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
> special signal protection for init processes. This prctl does not allow
> setting the SIGNAL_UNKILLABLE flag, only clearing.
>
> The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
> with CLONE_NEWPID, has the effect of ignoring all signals (from
> userspace) if the corresponding handler is set to SIG_DFL. The only
> exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
> from an ancestor namespace.
>
> SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
> While a task with the SIGNAL_UNKILLABLE flag could install handlers for
> these signals, this is not sufficient to implement a shell that uses
> CLONE_NEWPID for child processes:

Ah. My question wasn't clear, sorry.

Could you explain your use-case? Why a shell wants to use CLONE_NEWPID?
And what do we actually want in, say, ^Z case? Just stop the child reaper
or may be it would be better to stop the whole pid namespace?

>  * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
>    itself, it's not possible to implement the stop action in a custom
>    SIGTSTP handler.

Yes. So may be we actually want to change __isig() paths to use
SEND_SIG_FORCED (this is not that simple), or perhaps we can change
__send_signal() to not drop SIGSTOP sent to itself, or may be we can even
introduce SIG_DFL_EVEN_IF_INIT, I dunno.

>  * Many applications do not install handlers for these signals and
>    thus, job control won't work properly with unmodified applications.

I can't understand this. An application should be changed anyway to do
PR_SET_KILLABLE?


Let me clarify. I am not arguing with this patch, probably it makes sense in
any case. I am just trying to understand your real motivation for this change.



> +	case PR_SET_KILLABLE:
> +		if (arg2 != 1 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		spin_lock_irq(&me->sighand->siglock);
> +		me->signal->flags &= ~SIGNAL_UNKILLABLE;
> +		spin_unlock_irq(&me->sighand->siglock);

OK, but then you need to change the CLONE_PARENT/SIGNAL_UNKILLABLE check
in copy_process().

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-07-31 14:39   ` Oleg Nesterov
@ 2018-07-31 16:12     ` Jürg Billeter
  2018-08-01 14:19       ` Oleg Nesterov
  0 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-07-31 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On Tue, 2018-07-31 at 16:39 +0200, Oleg Nesterov wrote:
> On 07/31, Jürg Billeter wrote:
> > SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
> > While a task with the SIGNAL_UNKILLABLE flag could install handlers for
> > these signals, this is not sufficient to implement a shell that uses
> > CLONE_NEWPID for child processes:
> 
> Ah. My question wasn't clear, sorry.
> 
> Could you explain your use-case? Why a shell wants to use
> CLONE_NEWPID?

To guarantee that there won't be any runaway processes, i.e., ensure
that no descendants (background helper daemons or misbehaving
processes) survive when the child process is terminated. And to prevent
children from killing their ancestors. This is not something that can
be always-on in all shells, but it could be an option for users that
want this control/isolation.

> And what do we actually want in, say, ^Z case? Just stop the child reaper
> or may be it would be better to stop the whole pid namespace?

Stopping the whole PID namespace would be interesting, however, I think
this should be discussed separately if and when there is a proposal to
support this. For now the process group is stopped, same as without PID
namespaces.

> >  * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
> >    itself, it's not possible to implement the stop action in a custom
> >    SIGTSTP handler.
> 
> Yes. So may be we actually want to change __isig() paths to use
> SEND_SIG_FORCED (this is not that simple), or perhaps we can change
> __send_signal() to not drop SIGSTOP sent to itself, or may be we can even
> introduce SIG_DFL_EVEN_IF_INIT, I dunno.

In my opinion, my patch is much simpler and also more general as it
covers all scenarios where regular signal handling is required or
desired for "init" processes, with minimal code changes (after
PR_SET_KILLABLE, binaries that expect SIG_DFL to work can be executed
without changes).

> >  * Many applications do not install handlers for these signals and
> >    thus, job control won't work properly with unmodified applications.
> 
> I can't understand this. An application should be changed anyway to do
> PR_SET_KILLABLE?

PR_SET_KILLABLE can be called (e.g., by the shell) between clone() and
execve(). (Some applications may have issues running as subreaper but
that's a separate matter, signal handling will work as expected).

> > +	case PR_SET_KILLABLE:
> > +		if (arg2 != 1 || arg3 || arg4 || arg5)
> > +			return -EINVAL;
> > +		spin_lock_irq(&me->sighand->siglock);
> > +		me->signal->flags &= ~SIGNAL_UNKILLABLE;
> > +		spin_unlock_irq(&me->sighand->siglock);
> 
> OK, but then you need to change the CLONE_PARENT/SIGNAL_UNKILLABLE check
> in copy_process().

Good point, need a different check for the PID namespace root process
in copy_process().

Thanks,
Jürg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-07-30  7:52 [PATCH] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
  2018-07-30 10:17 ` Oleg Nesterov
  2018-07-31  7:03 ` [PATCH v2] " Jürg Billeter
@ 2018-07-31 16:26 ` " Jann Horn
  2018-08-01  7:43   ` Jürg Billeter
  2018-08-03 14:40 ` [PATCH v3 1/2] fork: do not rely on SIGNAL_UNKILLABLE for init check Jürg Billeter
  3 siblings, 1 reply; 18+ messages in thread
From: Jann Horn @ 2018-07-31 16:26 UTC (permalink / raw)
  To: j; +Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Linux API, kernel list

On Mon, Jul 30, 2018 at 10:01 AM Jürg Billeter <j@bitron.ch> wrote:
>
> PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
> CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
> special signal protection for init processes.
>
> This is required for job control in a shell that uses CLONE_NEWPID for
> child processes.
>
> This prctl does not allow setting the SIGNAL_UNKILLABLE flag, only
> clearing.
>
> Signed-off-by: Jürg Billeter <j@bitron.ch>
> ---
[...]
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 38509dc1f77b..264de630d548 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
[...]
> +       case PR_SET_KILLABLE:
> +               if (arg2 != 1 || arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               me->signal->flags &= ~SIGNAL_UNKILLABLE;
> +               break;

I don't have an opinion on this patchset otherwise, but should this
prctl maybe block PR_SET_KILLABLE if you're actually the real init
process? This seems like it could potentially lead to weird things.

This code in kernel/fork.c seems to rely on the fact that global init
is SIGNAL_UNKILLABLE, and probably also leads to weirdness if
container init is non-SIGNAL_UNKILLABLE:

        /*
         * Siblings of global init remain as zombies on exit since they are
         * not reaped by their parent (swapper). To solve this and to avoid
         * multi-rooted process trees, prevent global and container-inits
         * from creating siblings.
         */
        if ((clone_flags & CLONE_PARENT) &&
                                current->signal->flags & SIGNAL_UNKILLABLE)
                return ERR_PTR(-EINVAL);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-07-31 16:26 ` [PATCH] " Jann Horn
@ 2018-08-01  7:43   ` Jürg Billeter
  2018-08-01  7:56     ` Jann Horn
  0 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-08-01  7:43 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Linux API, kernel list

On Tue, 2018-07-31 at 18:26 +0200, Jann Horn wrote:
> On Mon, Jul 30, 2018 at 10:01 AM Jürg Billeter <j@bitron.ch> wrote:
> 
> [...]
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 38509dc1f77b..264de630d548 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> 
> [...]
> > +       case PR_SET_KILLABLE:
> > +               if (arg2 != 1 || arg3 || arg4 || arg5)
> > +                       return -EINVAL;
> > +               me->signal->flags &= ~SIGNAL_UNKILLABLE;
> > +               break;
> 
> I don't have an opinion on this patchset otherwise, but should this
> prctl maybe block PR_SET_KILLABLE if you're actually the real init
> process? This seems like it could potentially lead to weird things.

While I don't expect global init to use this, I can't think of a good
reason to disallow it in the kernel. Do you have specific concerns or
is the code in kernel/fork.c the only reason? I prefer avoiding special
cases unless really required.

> This code in kernel/fork.c seems to rely on the fact that global init
> is SIGNAL_UNKILLABLE, and probably also leads to weirdness if
> container init is non-SIGNAL_UNKILLABLE:

Yes, Oleg has mentioned this as well. I have to change copy_process()
to directly check for the PID namespace root process instead of
checking for SIGNAL_UNKILLABLE.

Jürg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_KILLABLE
  2018-08-01  7:43   ` Jürg Billeter
@ 2018-08-01  7:56     ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2018-08-01  7:56 UTC (permalink / raw)
  To: j; +Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Linux API, kernel list

On Wed, Aug 1, 2018 at 9:44 AM Jürg Billeter <j@bitron.ch> wrote:
>
> On Tue, 2018-07-31 at 18:26 +0200, Jann Horn wrote:
> > On Mon, Jul 30, 2018 at 10:01 AM Jürg Billeter <j@bitron.ch> wrote:
> >
> > [...]
> > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > index 38509dc1f77b..264de630d548 100644
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> >
> > [...]
> > > +       case PR_SET_KILLABLE:
> > > +               if (arg2 != 1 || arg3 || arg4 || arg5)
> > > +                       return -EINVAL;
> > > +               me->signal->flags &= ~SIGNAL_UNKILLABLE;
> > > +               break;
> >
> > I don't have an opinion on this patchset otherwise, but should this
> > prctl maybe block PR_SET_KILLABLE if you're actually the real init
> > process? This seems like it could potentially lead to weird things.
>
> While I don't expect global init to use this, I can't think of a good
> reason to disallow it in the kernel. Do you have specific concerns or
> is the code in kernel/fork.c the only reason?

No, I don't have any other specific concerns.

> I prefer avoiding special cases unless really required.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-07-31 16:12     ` Jürg Billeter
@ 2018-08-01 14:19       ` Oleg Nesterov
  2018-08-03 10:15         ` Jürg Billeter
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2018-08-01 14:19 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On 07/31, Jürg Billeter wrote:
>
> > Could you explain your use-case? Why a shell wants to use
> > CLONE_NEWPID?
>
> To guarantee that there won't be any runaway processes, i.e., ensure
> that no descendants (background helper daemons or misbehaving
> processes) survive when the child process is terminated.

We already have PR_SET_CHILD_SUBREAPER.

Perhaps we can finally add PR_KILL_MY_DESCENDANTS_ON_EXIT? This was already
discussed some time ago, but I can't find the previous discussion... Simple
to implement.

> And to prevent
> children from killing their ancestors.

OK, this is the only reason for CLONE_NEWPID which I can understand so far.
Not that I understand why this is that useful ;)

> > >  * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
> > >    itself, it's not possible to implement the stop action in a custom
> > >    SIGTSTP handler.
> >
> > Yes. So may be we actually want to change __isig() paths to use
> > SEND_SIG_FORCED (this is not that simple), or perhaps we can change
> > __send_signal() to not drop SIGSTOP sent to itself, or may be we can even
> > introduce SIG_DFL_EVEN_IF_INIT, I dunno.
>
> In my opinion, my patch is much simpler and also more general as it

Yes, yes, let me repeat that I am not arguing with your patch, I am just trying
to understand what

> > I can't understand this. An application should be changed anyway to do
> > PR_SET_KILLABLE?
>
> PR_SET_KILLABLE can be called (e.g., by the shell) between clone() and
> execve().

OK, this is true.

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-01 14:19       ` Oleg Nesterov
@ 2018-08-03 10:15         ` Jürg Billeter
  2018-08-03 12:14           ` Oleg Nesterov
  2018-08-03 13:34           ` Eric W. Biederman
  0 siblings, 2 replies; 18+ messages in thread
From: Jürg Billeter @ 2018-08-03 10:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On Wed, 2018-08-01 at 16:19 +0200, Oleg Nesterov wrote:
> On 07/31, Jürg Billeter wrote:
> > 
> > > Could you explain your use-case? Why a shell wants to use
> > > CLONE_NEWPID?
> > 
> > To guarantee that there won't be any runaway processes, i.e., ensure
> > that no descendants (background helper daemons or misbehaving
> > processes) survive when the child process is terminated.
> 
> We already have PR_SET_CHILD_SUBREAPER.
> 
> Perhaps we can finally add PR_KILL_MY_DESCENDANTS_ON_EXIT? This was already
> discussed some time ago, but I can't find the previous discussion... Simple
> to implement.

This would definitely be an option. You mentioned it last October in
the PR_SET_PDEATHSIG_PROC discussion¹. However, as PID namespaces
already exist and appear to be a good fit for the most part, I think it
makes sense to just add the missing pieces to PID namespaces instead of
duplicating part of the PID namespace functionality.

Also, based on Eric's comment in that other discussion about
no_new_privs not being allowed to increase the attack surface,
PR_KILL_MY_DESCENDANTS_ON_EXIT might require CAP_SYS_ADMIN as well (due
to setuid children). In which case the only potential benefit would be
that it still allows the child to kill arbitrary processes, as far as I
can tell.

> > And to prevent children from killing their ancestors.
> 
> OK, this is the only reason for CLONE_NEWPID which I can understand so far.
> Not that I understand why this is that useful ;)

The overall goal is increasing isolation between (some) child processes
and the rest of the system. Isolation from runaway processes and
isolation from signals are independent aspects and it could be useful
to control them independently. However, I also expect it to be common
that both are wanted at the same time.

Jürg

¹ https://lkml.org/lkml/2017/10/5/546


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-03 10:15         ` Jürg Billeter
@ 2018-08-03 12:14           ` Oleg Nesterov
  2018-08-03 13:34           ` Eric W. Biederman
  1 sibling, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2018-08-03 12:14 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On 08/03, Jürg Billeter wrote:
>
> On Wed, 2018-08-01 at 16:19 +0200, Oleg Nesterov wrote:
> > On 07/31, Jürg Billeter wrote:
> > >
> > > > Could you explain your use-case? Why a shell wants to use
> > > > CLONE_NEWPID?
> > >
> > > To guarantee that there won't be any runaway processes, i.e., ensure
> > > that no descendants (background helper daemons or misbehaving
> > > processes) survive when the child process is terminated.
> >
> > We already have PR_SET_CHILD_SUBREAPER.
> >
> > Perhaps we can finally add PR_KILL_MY_DESCENDANTS_ON_EXIT? This was already
> > discussed some time ago, but I can't find the previous discussion... Simple
> > to implement.
>
> This would definitely be an option. You mentioned it last October in
> the PR_SET_PDEATHSIG_PROC discussion¹. However, as PID namespaces
> already exist and appear to be a good fit for the most part,

Sure, if CLONE_NEWPID fits your needs you can use it,

> I think it
> makes sense to just add the missing pieces to PID namespaces instead of
> duplicating part of the PID namespace functionality.

Again, I am not arguing with your change.

PR_KILL_MY_DESCENDANTS_ON_EXIT can make sense just like PR_SET_CHILD_SUBREAPER
even if PID namespace functionality implies both. Simply because CLONE_NEWPID
is not necessarily the best tool, if nothing else you do not necessarily want
the pid isolation.

> Also, based on Eric's comment in that other discussion about
> no_new_privs not being allowed to increase the attack surface,
> PR_KILL_MY_DESCENDANTS_ON_EXIT might require CAP_SYS_ADMIN as well (due
> to setuid children).

No, no, the exiting parent should simply do group_send_sig_info(SIGKILL)
for every descendant and rely on check_kill_permission().

OK, lets forget it for now.

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-03 10:15         ` Jürg Billeter
  2018-08-03 12:14           ` Oleg Nesterov
@ 2018-08-03 13:34           ` Eric W. Biederman
  2018-08-03 14:39             ` Jürg Billeter
  1 sibling, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2018-08-03 13:34 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Oleg Nesterov, Andrew Morton, Thomas Gleixner, linux-api, linux-kernel

Jürg Billeter <j@bitron.ch> writes:

> On Wed, 2018-08-01 at 16:19 +0200, Oleg Nesterov wrote:
>> On 07/31, Jürg Billeter wrote:
>> > 
>> > > Could you explain your use-case? Why a shell wants to use
>> > > CLONE_NEWPID?
>> > 
>> > To guarantee that there won't be any runaway processes, i.e., ensure
>> > that no descendants (background helper daemons or misbehaving
>> > processes) survive when the child process is terminated.
>> 
>> We already have PR_SET_CHILD_SUBREAPER.
>> 
>> Perhaps we can finally add PR_KILL_MY_DESCENDANTS_ON_EXIT? This was already
>> discussed some time ago, but I can't find the previous discussion... Simple
>> to implement.
>
> This would definitely be an option. You mentioned it last October in
> the PR_SET_PDEATHSIG_PROC discussion¹. However, as PID namespaces
> already exist and appear to be a good fit for the most part, I think it
> makes sense to just add the missing pieces to PID namespaces instead of
> duplicating part of the PID namespace functionality.
>
> Also, based on Eric's comment in that other discussion about
> no_new_privs not being allowed to increase the attack surface,
> PR_KILL_MY_DESCENDANTS_ON_EXIT might require CAP_SYS_ADMIN as well (due
> to setuid children). In which case the only potential benefit would be
> that it still allows the child to kill arbitrary processes, as far as I
> can tell.

We don't require CAP_SYS_ADMIN if it is a session and so I think a
similar allowance can be made for PR_KILL_MY_DESCENDANTS_ON_EXIT.  There
is a long standing tradition of being able to kill your own descendants
in linux.  I don't think this allows anything that the tranditional
session allowance for killing process won't.


From the other direction I think we can just go ahead and fix handling
of the job control stop signals as well.  As far as I understand it
there is a legitimate complaint that SIGTSTP SIGTTIN SIGTTOU do not work
on a pid namespace leader.

The current implementation actual overshoots.  We only need to ignore
signals from the descendants in the pid namespace.  Ideally signals from
other processes are treated like normal.  We have only been able to
apply that ideal to SIGSTOP and SIGKILL as we can handle them in
prepare_signal.  Other signals can be blocked which means the logic to
handle them needs to live in get_signal where we may have no sender
information.


Signals with signal handlers we treat as normal.
Signals with whose default action is to ignore the signal we treat as
normal.

If a process is not in a context where job control has been set up then
SIGTSTP SIGTTIN and SIGTTOU are ignored.  I believe a typical init
process lives in just such an environment.  So I think we can safely
remove the special handling for the job control stops and not have
anyone care.

The rule is that the process group of the process must have a parent in
the same session, or the job control signals are ignored.

A typical init processes calls setsid, which guarantees it has no
parents in the same session.  So the default action of the job control
stops will be to ignore the signal.

A process once a session leader will always be a session leader, and
will never have any parents in a different pgrp in the same session.

So I think this gives us wiggle room needed to just fix this behavior.

Let's see.

For the signals SIGTSTP SIGTTIN and SIGTTOU if we are the typical init
process and we are a session leader we simply don't care who sends those
signals they will be ignored.


So I say we double check my assumption.  Look at sysv init, busy box,
upstart, systemd, whatever android uses, and the container runtimes
light weight inits.  Document it in a change log and just remove the
special case.

If except when handling job control signals is interesting init always
winds up a signal group leader I can't see the point in forcing init
to ignore the job control stop signals.

> ¹ https://lkml.org/lkml/2017/10/5/546

In the future please use mesage-id based links to email disccussions.
That way people can look up the conversations in other email archives.


Eric


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-03 13:34           ` Eric W. Biederman
@ 2018-08-03 14:39             ` Jürg Billeter
  0 siblings, 0 replies; 18+ messages in thread
From: Jürg Billeter @ 2018-08-03 14:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Thomas Gleixner, linux-api, linux-kernel

On Fri, 2018-08-03 at 08:34 -0500, Eric W. Biederman wrote:
> From the other direction I think we can just go ahead and fix handling
> of the job control stop signals as well.  As far as I understand it
> there is a legitimate complaint that SIGTSTP SIGTTIN SIGTTOU do not work
> on a pid namespace leader.
> 
> The current implementation actual overshoots.  We only need to ignore
> signals from the descendants in the pid namespace.  Ideally signals from
> other processes are treated like normal.  We have only been able to
> apply that ideal to SIGSTOP and SIGKILL as we can handle them in
> prepare_signal.  Other signals can be blocked which means the logic to
> handle them needs to live in get_signal where we may have no sender
> information.

SIGINT and SIGQUIT are also relevant for job control. Would the same
approach be possible for them?

And I would like to allow regular POSIX signal behavior also for
signals used outside job control, e.g., SIGTERM, for maximum
compatibility with existing applications. Furthermore, it would also be
good to allow a PID namespace leader to send a signal to itself.

Do you think we can and should cover all of the above without a prctl
by loosening the restrictions imposed by SIGNAL_UNKILLABLE (with
reasonable effort)?

In my opinion, my patch still makes sense as it simply allows regular
POSIX signal behavior for PID namespace leaders and it doesn't risk any
compatibility issues as the behavior doesn't change at all for
processes that don't invoke the new prctl. I.e., simple patch, low
risk, and covers all signals.

In the meantime I've tested the missing patch for copy_process() and
will send out v3 of the patch in case the new prctl makes sense after
all.

Jürg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/2] fork: do not rely on SIGNAL_UNKILLABLE for init check
  2018-07-30  7:52 [PATCH] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
                   ` (2 preceding siblings ...)
  2018-07-31 16:26 ` [PATCH] " Jann Horn
@ 2018-08-03 14:40 ` Jürg Billeter
  2018-08-03 14:40   ` [PATCH v3 2/2] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
  3 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-08-03 14:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Thomas Gleixner, Eric Biederman, linux-api,
	linux-kernel, Jürg Billeter

copy_process() currently checks the SIGNAL_UNKILLABLE flag to determine
whether to accept CLONE_PARENT. In preparation for allowing init
processes to opt out of SIGNAL_UNKILLABLE, directly check whether the
process is an init process with is_child_reaper().

Signed-off-by: Jürg Billeter <j@bitron.ch>
---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1b27babc4c78..c019ce461556 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1646,7 +1646,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 * from creating siblings.
 	 */
 	if ((clone_flags & CLONE_PARENT) &&
-				current->signal->flags & SIGNAL_UNKILLABLE)
+				is_child_reaper(task_tgid(current)))
 		return ERR_PTR(-EINVAL);
 
 	/*
-- 
2.18.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 2/2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-03 14:40 ` [PATCH v3 1/2] fork: do not rely on SIGNAL_UNKILLABLE for init check Jürg Billeter
@ 2018-08-03 14:40   ` Jürg Billeter
  2018-09-06 22:42     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Jürg Billeter @ 2018-08-03 14:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Thomas Gleixner, Eric Biederman, linux-api,
	linux-kernel, Jürg Billeter

PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
special signal protection for init processes. This prctl does not allow
setting the SIGNAL_UNKILLABLE flag, only clearing.

The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
with CLONE_NEWPID, has the effect of ignoring all signals (from
userspace) if the corresponding handler is set to SIG_DFL. The only
exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
from an ancestor namespace.

SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
While a task with the SIGNAL_UNKILLABLE flag could install handlers for
these signals, this is not sufficient to implement a shell that uses
CLONE_NEWPID for child processes:

 * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
   itself, it's not possible to implement the stop action in a custom
   SIGTSTP handler.
 * Many applications do not install handlers for these signals and
   thus, job control won't work properly with unmodified applications.

There are other scenarios besides job control in a shell where
applications rely on the default actions as described in signal(7) and
PID isolation may be useful. This new prctl makes the signal protection
for "init" processes optional, without breaking backward compatibility.

Signed-off-by: Jürg Billeter <j@bitron.ch>
---
 include/uapi/linux/prctl.h |  4 ++++
 kernel/sys.c               | 13 +++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0bf5b6..92afb63da727 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
 # define PR_SPEC_DISABLE		(1UL << 2)
 # define PR_SPEC_FORCE_DISABLE		(1UL << 3)
 
+/* Control SIGNAL_UNKILLABLE */
+#define PR_GET_KILLABLE			54
+#define PR_SET_KILLABLE			55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 38509dc1f77b..92c9322cfb98 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2484,6 +2484,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
 		break;
+	case PR_GET_KILLABLE:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = put_user(!(me->signal->flags & SIGNAL_UNKILLABLE),
+				 (int __user *)arg2);
+		break;
+	case PR_SET_KILLABLE:
+		if (arg2 != 1 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		spin_lock_irq(&me->sighand->siglock);
+		me->signal->flags &= ~SIGNAL_UNKILLABLE;
+		spin_unlock_irq(&me->sighand->siglock);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/2] prctl: add PR_[GS]ET_KILLABLE
  2018-08-03 14:40   ` [PATCH v3 2/2] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
@ 2018-09-06 22:42     ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2018-09-06 22:42 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Oleg Nesterov, Thomas Gleixner, Eric Biederman, linux-api, linux-kernel

On Fri,  3 Aug 2018 16:40:21 +0200 Jürg Billeter <j@bitron.ch> wrote:

> PR_SET_KILLABLE clears the SIGNAL_UNKILLABLE flag. This allows
> CLONE_NEWPID tasks to restore normal signal behavior, opting out of the
> special signal protection for init processes. This prctl does not allow
> setting the SIGNAL_UNKILLABLE flag, only clearing.
> 
> The SIGNAL_UNKILLABLE flag, which is implicitly set for tasks cloned
> with CLONE_NEWPID, has the effect of ignoring all signals (from
> userspace) if the corresponding handler is set to SIG_DFL. The only
> exceptions are SIGKILL and SIGSTOP and they are only accepted if raised
> from an ancestor namespace.
> 
> SIGINT, SIGQUIT and SIGTSTP are used in job control for ^C, ^\, ^Z.
> While a task with the SIGNAL_UNKILLABLE flag could install handlers for
> these signals, this is not sufficient to implement a shell that uses
> CLONE_NEWPID for child processes:
> 
>  * As SIGSTOP is ignored when raised from the SIGNAL_UNKILLABLE process
>    itself, it's not possible to implement the stop action in a custom
>    SIGTSTP handler.
>  * Many applications do not install handlers for these signals and
>    thus, job control won't work properly with unmodified applications.
> 
> There are other scenarios besides job control in a shell where
> applications rely on the default actions as described in signal(7) and
> PID isolation may be useful. This new prctl makes the signal protection
> for "init" processes optional, without breaking backward compatibility.

This one is above my pay grade.  Eric & Oleg: could you please provide
input?


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, back to index

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-30  7:52 [PATCH] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
2018-07-30 10:17 ` Oleg Nesterov
2018-07-30 19:32   ` Jürg Billeter
2018-07-30 19:39     ` Thomas Gleixner
2018-07-31  7:03 ` [PATCH v2] " Jürg Billeter
2018-07-31 14:39   ` Oleg Nesterov
2018-07-31 16:12     ` Jürg Billeter
2018-08-01 14:19       ` Oleg Nesterov
2018-08-03 10:15         ` Jürg Billeter
2018-08-03 12:14           ` Oleg Nesterov
2018-08-03 13:34           ` Eric W. Biederman
2018-08-03 14:39             ` Jürg Billeter
2018-07-31 16:26 ` [PATCH] " Jann Horn
2018-08-01  7:43   ` Jürg Billeter
2018-08-01  7:56     ` Jann Horn
2018-08-03 14:40 ` [PATCH v3 1/2] fork: do not rely on SIGNAL_UNKILLABLE for init check Jürg Billeter
2018-08-03 14:40   ` [PATCH v3 2/2] prctl: add PR_[GS]ET_KILLABLE Jürg Billeter
2018-09-06 22:42     ` Andrew Morton

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lkml.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git