LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
       [not found] ` <ef45eb94-4480-0683-ccc0-eb1efaf7e96c@virtuozzo.com>
@ 2018-04-19 12:50   ` Kirill Tkhai
  2018-04-19 15:34     ` Al Viro
  0 siblings, 1 reply; 5+ messages in thread
From: Kirill Tkhai @ 2018-04-19 12:50 UTC (permalink / raw)
  To: Alexander Aring, Al Viro, linux-kernel; +Cc: netdev, Jamal Hadi Salim

Hi, Al,

commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the below test crashes the kernel:

    Author: Al Viro <viro@zeniv.linux.org.uk>
    Date:   Sat Jan 10 19:01:08 2015 -0500

    switch the IO-triggering parts of umount to fs_pin
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

$modprobe dummy

$while true
 do
     mkdir /var/run/netns
     touch /var/run/netns/init_net
     mount --bind /proc/1/ns/net /var/run/netns/init_net

     ip netns add foo
     ip netns exec foo ip link add dummy0 type dummy
     ip netns delete foo
done

[   22.058349] ip (3249) used greatest stack depth: 8 bytes left
[   22.182195] BUG: unable to handle kernel paging request at 000000035bb1f080
[   22.183065] IP: [<ffffffff810718e4>] kick_process+0x34/0x80
[   22.183065] PGD 0 
[   22.183065] Thread overran stack, or stack corrupted
[   22.183065] Oops: 0000 [#1] PREEMPT SMP 
[   22.183065] CPU: 1 PID: 3255 Comm: ip Not tainted 3.19.0-rc5+ #111
[   22.183065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[   22.183065] task: ffff88007c475100 ti: ffff88007b3cc000 task.ti: ffff88007b3cc000
[   22.183065] RIP: 0010:[<ffffffff810718e4>]  [<ffffffff810718e4>] kick_process+0x34/0x80
[   22.183065] RSP: 0018:ffff88007b3cfcf8  EFLAGS: 00010293
[   22.183065] RAX: 0000000000012900 RBX: ffff88007c475100 RCX: ffff88007b20e7b8
[   22.183065] RDX: 000000007b3cc028 RSI: ffffffff819b05f8 RDI: ffffffff819cb999
[   22.183065] RBP: ffff88007b3cfd08 R08: ffffffff81cbf688 R09: ffff88007d3d0810
[   22.183065] R10: ffff88007fc933c8 R11: 0000000000000000 R12: 000000007b3cc028
[   22.183065] R13: ffff88007c475100 R14: 0000000000000000 R15: 00007fff7793a448
[   22.183065] FS:  00007fc987546700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
[   22.183065] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   22.183065] CR2: 000000035bb1f080 CR3: 0000000001c11000 CR4: 00000000000006e0
[   22.183065] Stack:
[   22.183065]  ffff88007c3b67b8 ffff88007b3cfd98 ffff88007b3cfd18 ffffffff81066b05
[   22.183065]  ffff88007b3cfd38 ffffffff81176f4c ffff88007b3cfd48 ffff88007c3b68a0
[   22.183065]  ffff88007b3cfd48 ffffffff8117777f ffff88007b3cfd68 ffffffff81177a49
[   22.183065] Call Trace:
[   22.183065]  [<ffffffff81066b05>] task_work_add+0x45/0x60
[   22.183065]  [<ffffffff81176f4c>] mntput_no_expire+0xdc/0x150
[   22.183065]  [<ffffffff8117777f>] mntput+0x1f/0x30
[   22.183065]  [<ffffffff81177a49>] drop_mountpoint+0x29/0x30
[   22.183065]  [<ffffffff81188df6>] pin_kill+0x66/0xf0
[   22.183065]  [<ffffffff81082c60>] ? __wake_up_common+0x90/0x90
[   22.183065]  [<ffffffff81188ed9>] group_pin_kill+0x19/0x40
[   22.183065]  [<ffffffff811761d8>] namespace_unlock+0x58/0x60
[   22.183065]  [<ffffffff81178cae>] drop_collected_mounts+0x4e/0x60
[   22.183065]  [<ffffffff8117a3ed>] put_mnt_ns+0x2d/0x50
[   22.183065]  [<ffffffff81068b0a>] free_nsproxy+0x1a/0x80
[   22.183065]  [<ffffffff81068c68>] switch_task_namespaces+0x58/0x70
[   22.183065]  [<ffffffff81068c8b>] exit_task_namespaces+0xb/0x10
[   22.183065]  [<ffffffff8104eb57>] do_exit+0x2c7/0xc00
[   22.183065]  [<ffffffff8104f50a>] do_group_exit+0x3a/0xa0
[   22.183065]  [<ffffffff8104f57f>] SyS_exit_group+0xf/0x10
[   22.183065]  [<ffffffff817ad092>] system_call_fastpath+0x12/0x17

Kirill

On 19.04.2018 01:08, Kirill Tkhai wrote:
> Hi, Alexander!
> 
> On 18.04.2018 22:45, Alexander Aring wrote:
>> I currently can crash my net/master kernel by execute the following script:
>>
>> --- snip
>>
>> modprobe dummy
>>
>> #mkdir /var/run/netns
>> #touch /var/run/netns/init_net
>> #mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>> while true
>> do
>>     mkdir /var/run/netns
>>     touch /var/run/netns/init_net
>>     mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>>     ip netns add foo
>>     ip netns exec foo ip link add dummy0 type dummy
>>     ip netns delete foo
>> done
> 
> Fast answer is the best, so I tried your test on my not-for-work computer.
> There is old kernel without asynchronous pernet operations:
> 
> $uname -a
> Linux localhost.localdomain 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux
> 
> After approximately 15 seconds of your test execution it died :(
> (Hopefully, I executed it in "init 1" with all partitions RO as usual).
> 
> There is no serial console, so I can't say that the first stack is exactly
> the same as you see. But it crashed. So, it seems, the problem have been
> existing long ago.
> 
> Have you tried to reproduce it in older kernels or to bisect the problem commit?
> Or maybe it doesn't reproduce on old kernels in your environment?
> 
>> --- snap
>>
>> After max ~1 minute the kernel will crash.
>> Doing my hack of saving init_net outside the loop it will run fine...
>> So the mount bind is necessary.
>>
>> The last message which I see is:
>>
>> BUG: stack guard page was hit at 00000000f0751759 (stack is
>> 0000000069363195..0000000073ddc474)
>> kernel stack overflow (double-fault): 0000 [#1] SMP PTI
>> Modules linked in:
>> CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> RIP: 0010:validate_chain.isra.23+0x44/0xc40
>> RSP: 0018:ffffc900002cbff8 EFLAGS: 00010002
>> RAX: 0000000000040000 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
>> RDX: 0000000000000000 RSI: ffff8802b25ee2a0 RDI: ffff8802b25edb00
>> RBP: 0e58b88e1d4d15da R08: 0000000000000000 R09: 0000000000000004
>> R10: ffffc900002cc050 R11: ffff8802b1054be8 R12: 0000000000000001
>> R13: ffff8802b25ee268 R14: ffff8802b25edb00 R15: 0000000000000000
>> FS:  0000000000000000(0000) GS:ffff8802bfc00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffc900002cbfe8 CR3: 0000000002024000 CR4: 00000000000006f0
>> Call Trace:
>>  ? get_max_files+0x10/0x10
>>  __lock_acquire+0x332/0x710
>>  lock_acquire+0x67/0xb0
>>  ? lockref_put_or_lock+0x9/0x30
>>  ? dput.part.7+0x17/0x2d0
>>  _raw_spin_lock+0x2b/0x60
>>  ? lockref_put_or_lock+0x9/0x30
>>  lockref_put_or_lock+0x9/0x30
>>  dput.part.7+0x1ec/0x2d0
>>  drop_mountpoint+0x10/0x40
>>  pin_kill+0x9b/0x3a0
>>  ? wait_woken+0x90/0x90
>>  ? mnt_pin_kill+0x2d/0x100
>>  mnt_pin_kill+0x2d/0x100
>>  cleanup_mnt+0x66/0x70
>>  pin_kill+0x9b/0x3a0
>>  ? wait_woken+0x90/0x90
>>  ? mnt_pin_kill+0x2d/0x100
>>  mnt_pin_kill+0x2d/0x100
>>  cleanup_mnt+0x66/0x70
>> ...
>>
>> I guess maybe it has something to do with recently switching to
>> migrate per-net ops to async.
>>
>> - Alex
> 
> Kirill
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
  2018-04-19 12:50   ` [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow) Kirill Tkhai
@ 2018-04-19 15:34     ` Al Viro
  2018-04-19 16:44       ` Al Viro
  0 siblings, 1 reply; 5+ messages in thread
From: Al Viro @ 2018-04-19 15:34 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Alexander Aring, linux-kernel, netdev, Jamal Hadi Salim

On Thu, Apr 19, 2018 at 03:50:25PM +0300, Kirill Tkhai wrote:
> Hi, Al,
> 
> commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the below test crashes the kernel:
> 
>     Author: Al Viro <viro@zeniv.linux.org.uk>
>     Date:   Sat Jan 10 19:01:08 2015 -0500
> 
>     switch the IO-triggering parts of umount to fs_pin
>     
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> $modprobe dummy
> 
> $while true
>  do
>      mkdir /var/run/netns
>      touch /var/run/netns/init_net
>      mount --bind /proc/1/ns/net /var/run/netns/init_net
> 
>      ip netns add foo
>      ip netns exec foo ip link add dummy0 type dummy
>      ip netns delete foo
> done

I can reproduce that, all right, and with a stack chain that
looks like this:
[77132.414912]  pin_kill+0x81/0x150
[77132.424362]  ? finish_wait+0x80/0x80
[77132.433917]  mnt_pin_kill+0x1e/0x30
[77132.443829]  cleanup_mnt+0x6b/0x70
[77132.453477]  pin_kill+0x81/0x150
[77132.463064]  ? finish_wait+0x80/0x80
[77132.472553]  group_pin_kill+0x1a/0x30
[77132.481973]  namespace_unlock+0x6f/0x80
[77132.491801]  put_mnt_ns+0x1d/0x30
[77132.501258]  free_nsproxy+0x17/0x90
[77132.510604]  do_exit+0x2dc/0xb40
[77132.520146]  ? handle_mm_fault+0xaa/0x1e0
[77132.529725]  do_group_exit+0x3a/0xa0
[77132.539506]  SyS_exit_group+0x10/0x10
with the top 4 entries repeated a lot.  Those cleanup_mnt()
could be called from __cleanup_mnt(), delayed_mntput() or
mntput_no_expire().

__cleanup_mnt() is only fed to task_work_add(); no way in hell
would you get the call stack similar to that; it would be
called by task_work_run() from exit_task_work() from
do_exit().  Not in the evidence.

delayed_mntput() is only fed to schedule_delayed_work();
again, not a chance of having the call chain like that.

The one in mntput_no_expire() is a tail-call, with
mntput_no_expire() called from umount(2) and mntput()
(tail-calls both of them).  The former is never called
from exit(2), so that call chain reads

pin_kill -> mntput or something tail-calling mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill

Now, the thing called by pin_kill must be something passed to
init_fs_pin(), i.e. acct_pin_kill() or drop_mountpoint().
acct_pin_kill() ends with
        pin_remove(pin);
        acct_put(acct);
}
with
static void acct_put(struct bsd_acct_struct *p)
{
        if (atomic_long_dec_and_test(&p->count))
                kfree_rcu(p, rcu);
}

IOW, no tail-call of mntput() in there.  OTOH,
static void drop_mountpoint(struct fs_pin *p)
{
        struct mount *m = container_of(p, struct mount, mnt_umount);
        dput(m->mnt_ex_mountpoint);
        pin_remove(p);
        mntput(&m->mnt);
}
*does* have the tail-call, so this call chain must be
pin_kill -> drop_mountpoint -> mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill

So far, so good, but if you look into mntput_no_expire() you see
        if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
                struct task_struct *task = current;
                if (likely(!(task->flags & PF_KTHREAD))) {
                        init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
                        if (!task_work_add(task, &mnt->mnt_rcu, true))
                                return;
                }
                if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
                        schedule_delayed_work(&delayed_mntput_work, 1);
                return;
        }
        cleanup_mnt(mnt);

IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
->mnt_pins of some other mount.  Which, AFAICS, means that
it used to be mounted on that other mount.  How the hell can
that happen?

It looks like you somehow get a long chain of MNT_INTERNAL mounts
stacked on top of each other, which ought to be prevented by
        mnt_flags &= ~MNT_INTERNAL_FLAGS;
in do_add_mount().  Nuts...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
  2018-04-19 15:34     ` Al Viro
@ 2018-04-19 16:44       ` Al Viro
  2018-04-19 16:56         ` Kirill Tkhai
  0 siblings, 1 reply; 5+ messages in thread
From: Al Viro @ 2018-04-19 16:44 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Alexander Aring, linux-kernel, netdev, Jamal Hadi Salim

On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote:

> IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
> ->mnt_pins of some other mount.  Which, AFAICS, means that
> it used to be mounted on that other mount.  How the hell can
> that happen?
> 
> It looks like you somehow get a long chain of MNT_INTERNAL mounts
> stacked on top of each other, which ought to be prevented by
>         mnt_flags &= ~MNT_INTERNAL_FLAGS;
> in do_add_mount().  Nuts...

Arrrrrgh...  Nuts is right - clone_mnt() preserves the sodding
MNT_INTERNAL, with obvious results.

netns is related to the problem, by exposing MNT_INTERNAL mounts
(in /proc/*/ns/*) for mount --bind to copy and attach to the
tree.  AFAICS, the minimal reproducer is

touch /tmp/a
unshare -m sh -c 'for i in `seq 10000`; do mount --bind /proc/1/ns/net /tmp/a; done'

(and it can be anything in /proc/*/ns/*, really)

I think the fix should be along the lines of the following:

Don't leak MNT_INTERNAL away from internal mounts

We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
-	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
+	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
+	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
 	/* Don't allow unprivileged users to change mount flags */
 	if (flag & CL_UNPRIVILEGED) {
 		mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
  2018-04-19 16:44       ` Al Viro
@ 2018-04-19 16:56         ` Kirill Tkhai
  2018-04-19 18:37           ` Alexander Aring
  0 siblings, 1 reply; 5+ messages in thread
From: Kirill Tkhai @ 2018-04-19 16:56 UTC (permalink / raw)
  To: Al Viro; +Cc: Alexander Aring, linux-kernel, netdev, Jamal Hadi Salim

On 19.04.2018 19:44, Al Viro wrote:
> On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote:
> 
>> IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
>> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
>> ->mnt_pins of some other mount.  Which, AFAICS, means that
>> it used to be mounted on that other mount.  How the hell can
>> that happen?
>>
>> It looks like you somehow get a long chain of MNT_INTERNAL mounts
>> stacked on top of each other, which ought to be prevented by
>>         mnt_flags &= ~MNT_INTERNAL_FLAGS;
>> in do_add_mount().  Nuts...
> 
> Arrrrrgh...  Nuts is right - clone_mnt() preserves the sodding
> MNT_INTERNAL, with obvious results.
> 
> netns is related to the problem, by exposing MNT_INTERNAL mounts
> (in /proc/*/ns/*) for mount --bind to copy and attach to the
> tree.  AFAICS, the minimal reproducer is
> 
> touch /tmp/a
> unshare -m sh -c 'for i in `seq 10000`; do mount --bind /proc/1/ns/net /tmp/a; done'
> 
> (and it can be anything in /proc/*/ns/*, really)
> 
> I think the fix should be along the lines of the following:
> 
> Don't leak MNT_INTERNAL away from internal mounts
> 
> We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
> their copies.
> 
> Cc: stable@kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Flawless victory! Thanks.

Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  			goto out_free;
>  	}
>  
> -	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
> +	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
> +	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
>  	/* Don't allow unprivileged users to change mount flags */
>  	if (flag & CL_UNPRIVILEGED) {
>  		mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
  2018-04-19 16:56         ` Kirill Tkhai
@ 2018-04-19 18:37           ` Alexander Aring
  0 siblings, 0 replies; 5+ messages in thread
From: Alexander Aring @ 2018-04-19 18:37 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Al Viro, linux-kernel, netdev, Jamal Hadi Salim

Hi,

On Thu, Apr 19, 2018 at 12:56 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 19.04.2018 19:44, Al Viro wrote:
>> On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote:
>>
>>> IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
>>> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
>>> ->mnt_pins of some other mount.  Which, AFAICS, means that
>>> it used to be mounted on that other mount.  How the hell can
>>> that happen?
>>>
>>> It looks like you somehow get a long chain of MNT_INTERNAL mounts
>>> stacked on top of each other, which ought to be prevented by
>>>         mnt_flags &= ~MNT_INTERNAL_FLAGS;
>>> in do_add_mount().  Nuts...
>>
>> Arrrrrgh...  Nuts is right - clone_mnt() preserves the sodding
>> MNT_INTERNAL, with obvious results.
>>
>> netns is related to the problem, by exposing MNT_INTERNAL mounts
>> (in /proc/*/ns/*) for mount --bind to copy and attach to the
>> tree.  AFAICS, the minimal reproducer is
>>
>> touch /tmp/a
>> unshare -m sh -c 'for i in `seq 10000`; do mount --bind /proc/1/ns/net /tmp/a; done'
>>
>> (and it can be anything in /proc/*/ns/*, really)
>>
>> I think the fix should be along the lines of the following:
>>
>> Don't leak MNT_INTERNAL away from internal mounts
>>
>> We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
>> their copies.
>>
>> Cc: stable@kernel.org
>> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> Flawless victory! Thanks.
>

Thanks to all.

Also thanks to Kirill for helping me here and doing the main part by
bisecting this issue.

Finally, my testing stuff which produced this bug also works well now.

Tested-by: Alexander Aring <aring@mojatatu.com>

- Alex

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-04-19 18:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOHTApjPwWDxHmGrfMwFvszWubfjJ7YfBgZL-DQ0mdv00UtQEQ@mail.gmail.com>
     [not found] ` <ef45eb94-4480-0683-ccc0-eb1efaf7e96c@virtuozzo.com>
2018-04-19 12:50   ` [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow) Kirill Tkhai
2018-04-19 15:34     ` Al Viro
2018-04-19 16:44       ` Al Viro
2018-04-19 16:56         ` Kirill Tkhai
2018-04-19 18:37           ` Alexander Aring

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).