LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Dmitry Vyukov <dvyukov@google.com>
To: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: syzbot <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	David Miller <davem@davemloft.net>,
	David Ahern <dsahern@gmail.com>, Florian Westphal <fw@strlen.de>,
	Jiri Benc <jbenc@redhat.com>, LKML <linux-kernel@vger.kernel.org>,
	Xin Long <lucien.xin@gmail.com>,
	mschiffer@universe-factory.net, netdev <netdev@vger.kernel.org>,
	syzkaller-bugs <syzkaller-bugs@googlegroups.com>,
	Vladislav Yasevich <vyasevich@gmail.com>
Subject: Re: INFO: task hung in ip6gre_exit_batch_net
Date: Fri, 8 Jun 2018 10:18:18 +0200	[thread overview]
Message-ID: <CACT4Y+a57xaVfph_u78Z+kM8xUF=8fVk-VSA+xqok38Zm4bhFw@mail.gmail.com> (raw)
In-Reply-To: <6b14e8b9-c335-dd46-98a6-7f58a624fcea@virtuozzo.com>

On Thu, Jun 7, 2018 at 9:59 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 07.06.2018 22:03, Dmitry Vyukov wrote:
>> On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>> Hi, Dmirty!
>>>>>>>
>>>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>>>> <syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> syzbot found the following crash on:
>>>>>>>>>
>>>>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>>>> git tree:       upstream
>>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>>>
>>>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>>>
>>>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com
>>>>>>>>
>>>>>>>> Another hang on rtnl lock:
>>>>>>>>
>>>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>>>
>>>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>>>
>>>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>>>> be something different.
>>>>>>
>>>>>>
>>>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>>>
>>>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>>>
>>>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>>>> if they appear?
>>>>>>
>>>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>>>
>>>>> OK, then dead task not releasing the lock is excluded.
>>>>>
>>>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>>>> There may help a crash dump of the VM.
>>>>
>>>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>>>> in large amounts in just any task hung report for different mutexes,
>>>> so I would not expect that it means corruption.
>>>>
>>>> Are dozens of known corruptions that syzkaller can trigger. But
>>>> usually they are reliably caught by KASAN. If any of them would lead
>>>> to silent memory corruption, we would got dozens of assorted crashes
>>>> throughout the kernel. We've seen that at some points, but not
>>>> recently. So I would assume that memory is not corrupted in all these
>>>> cases:
>>>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>>>
>>> This BUG clarifies the {+.+.}:
>>>
>>> 4 locks held by kworker/0:145/381:
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>  #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
>>>  #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
>>>  #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
>>>  #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>>>
>>> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
>>> since there is one more lock after it.
>>>
>>> This BUG happen because of there are many rtnl_mutex waiters while owner
>>> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
>>
>>
>> You mean that it's not hanged, but rather needs more than 2 minutes to
>> complete, right?
>
> Yeah, I think, this is possible. I've seen the situations like that.
> Let synchronize_rcu_expedited() is executed for X seconds. Then,
> it's need just 120/x calls of "this function under rtnl_mutex" to make
> a soft lockup of someone else who wants the mutex too.
>
> Also, despite the CFS is fair scheduler, in case of the calls are
> made from workqueue, every work will cause sleep. So, every work
> will be executed in separate worker task. Every worker will haved its
> own time slice. This increases the probability these tasks will
> take cpu time before the task in the header of the hang.


OK, let's stick with this theory for now. Looking at the crash frequencies here:
https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
I can actually believe that this is just flakes due to too slow execution.

I've noted that we need either reduce load and/or increase timeouts:
https://github.com/google/syzkaller/issues/516#issuecomment-395685629

Let's keep duping new such reports onto "INFO: task hung in
netdev_run_todo" so that they are all collected at a single location.

Thanks for help

  reply	other threads:[~2018-06-08  8:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-04 15:03 syzbot
2018-06-04 15:22 ` Dmitry Vyukov
2018-06-05  9:03   ` Kirill Tkhai
2018-06-05  9:36     ` Dmitry Vyukov
2018-06-05 13:55       ` Kirill Tkhai
2018-06-07 18:23         ` Dmitry Vyukov
2018-06-07 18:54           ` Kirill Tkhai
2018-06-07 19:03             ` Dmitry Vyukov
2018-06-07 19:59               ` Kirill Tkhai
2018-06-08  8:18                 ` Dmitry Vyukov [this message]
2018-06-08  8:31                   ` Kirill Tkhai
2018-06-08  8:38                     ` Dmitry Vyukov
2018-06-08  8:43                       ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACT4Y+a57xaVfph_u78Z+kM8xUF=8fVk-VSA+xqok38Zm4bhFw@mail.gmail.com' \
    --to=dvyukov@google.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=fw@strlen.de \
    --cc=jbenc@redhat.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lucien.xin@gmail.com \
    --cc=mschiffer@universe-factory.net \
    --cc=netdev@vger.kernel.org \
    --cc=syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=vyasevich@gmail.com \
    --subject='Re: INFO: task hung in ip6gre_exit_batch_net' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).