Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
To: Alan Stern <stern@rowland.harvard.edu>,
	Felipe Balbi <balbi@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: syzbot <syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	syzkaller-bugs@googlegroups.com,
	Pavel Skripkin <paskripkin@gmail.com>,
	Thierry Escande <thierry.escande@collabora.com>,
	Andrey Konovalov <andreyknvl@gmail.com>
Subject: Re: [syzbot] INFO: task hung in port100_probe
Date: Wed, 20 Oct 2021 22:56:42 +0200	[thread overview]
Message-ID: <b9695fc8-51b5-c61e-0a2f-fec9c2f0bae0@canonical.com> (raw)
In-Reply-To: <20210722144721.GA6592@rowland.harvard.edu>

On 22/07/2021 16:47, Alan Stern wrote:
> On Thu, Jul 22, 2021 at 04:20:10PM +0200, Krzysztof Kozlowski wrote:
>> On 22/06/2021 17:43, syzbot wrote:
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit:    fd0aa1a4 Merge tag 'for-linus' of git://git.kernel.org/pub..
>>> git tree:       upstream
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13e1500c300000
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=7ca96a2d153c74b0
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869
>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1792e284300000
>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13ad9d48300000
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com
>>>
>>> INFO: task kworker/0:1:7 blocked for more than 143 seconds.
>>>       Not tainted 5.13.0-rc6-syzkaller #0
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> task:kworker/0:1     state:D stack:25584 pid:    7 ppid:     2 flags:0x00004000
>>> Workqueue: usb_hub_wq hub_event
>>> Call Trace:
>>>  context_switch kernel/sched/core.c:4339 [inline]
>>>  __schedule+0x916/0x23e0 kernel/sched/core.c:5147
>>>  schedule+0xcf/0x270 kernel/sched/core.c:5226
>>>  schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
>>>  do_wait_for_common kernel/sched/completion.c:85 [inline]
>>>  __wait_for_common kernel/sched/completion.c:106 [inline]
>>>  wait_for_common kernel/sched/completion.c:117 [inline]
>>>  wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
>>>  port100_send_cmd_sync drivers/nfc/port100.c:923 [inline]
>>>  port100_get_command_type_mask drivers/nfc/port100.c:1008 [inline]
>>>  port100_probe+0x9e4/0x1340 drivers/nfc/port100.c:1554
>>>  usb_probe_interface+0x315/0x7f0 drivers/usb/core/driver.c:396
> ...
> 
>> Cc: Thierry, Alan, Andrey,
>>
>> The issue is reproducible immediately on QEMU instance with
>> USB_DUMMY_HCD and USB_RAW_GADGET. I don't know about real port100 NFC
>> device.
>>
>> I spent some time looking into this and have no clue, except that it
>> looks like an effect of a race condition.
>>
>> 1. When using syskaller reproducer against one USB device (In the C
>> reproducer change the loop in main() to use procid=0) - issue does not
>> happen.
>>
>> 2. With two threads or more talking to separate Dummy USB devices, the
>> issue appears. The more of them, the better...
>>
>> 3. The reported problem is in missing complete. The correct flow is like:
>> port100_probe()
>> port100_get_command_type_mask()
>> port100_send_cmd_sync()
>> port100_send_cmd_async()
>> port100_submit_urb_for_ack()
>> port100_send_complete()
>> [   63.363863] port100 2-1:0.0: NFC: Urb failure (status -71)
>> port100_recv_ack()
>> [   63.369942] port100 2-1:0.0: NFC: Urb failure (status -71)
>>
>> and schedule_work() which completes and unblocks port100_send_cmd_sync
>>
>> However in the failing case (hung task) the port100_recv_ack() is never
>> called. It looks like USB core / HCD / gadget does not send the Ack/URB
>> complete.
>>
>> I don't know why. The port100 NFC driver code looks OK, except it is not
>> prepared for missing ack/urb so it waits indefinitely. I could try to
>> convert it to wait_for_completion_timeout() but it won't be trivial and
>> more important - I am not sure if this is the problem. Somehow the ACK
>> with Urb failure is not sent back to the port100 device. Therefore I am
>> guessing that the race condition is somwhere in USB stack, not in
>> port100 driver.
>>
>> The lockdep and other testing tools did not find anything here.
>>
>> Anyone hints where the issue could be?
> 
> Here's what I wrote earlier: "It looks like the problem stems from the fact 
> that port100_send_frame_async() submits two URBs, but 
> port100_send_cmd_sync() only waits for one of them to complete.  The other 
> URB may then still be active when the driver tries to reuse it."
> 
> Of course, there may be more than one problem, so we may not be talking 
> about the same thing.

Hi Alan, Felipe, Greg and others,

This is an old issue reported by syzkaller for NFC port100 driver [1].
There is something similar for pn533 [2].

I was looking at it some time ago, took a break and now I am trying to
fix it again. Without success.

The issue is reproducible via USB gadget on QEMU, not on real HW. I
looked and debugged the code and I think previously mentioned
double-URB-submit is not the reason here. Or I miss how the USB works
(which is quite probable...).

1. The port100 driver calls port100_send_cmd_sync() which eventually
goes to port100_send_frame_async(). After it, it waits for "sync"
completion.

2. In port100_send_frame_async(), driver indeed first submits "out_urb"
which quite fast is being processed by dummy_hcd with "no ep configured"
and -EPROTO.

3. Then (or sometimes before -EPROTO response from (2) above) the
port100_send_frame_async() submits "in_urb" via
port100_submit_urb_for_ack() and waits for its completion. Completion of
"in_urb" (or the "ack") in port100_recv_ack() would schedule work to
complete the (1) above - the sync completion.

4. Usually, when reproducer works fine (does not trigger issue), the
dummy_timer() from gadget responds with the same "no ep configured for
urb" for this "in_urb" (3). This completes "in_urb", which eventually
completes (1) and probe finishes with error. Error is expected, because
it's random junk-gadget...

The syzkaller reproducer fails if >1 of threads are running these usb
gadgets.  When this happens, no "in_urb" completion happens. No this
"ack" port100_recv_ack().

I added some debugs and simply dummy_hcd dummy_timer() is woken up on
enqueuing in_urb and then is looping crazy on a previous URB (some older
URB, coming from before port100 driver probe started). The dummy_timer()
loop never reaches the second "in_urb" to process it, I think.

The pn533 NFC driver has similar design, but I have now really doubts it
is a NFC driver issue. Instead an issue in dummy gadget HCD is somehow
triggered by the reproducer.

Reproduction - just follow [1] or [2]. Eventually I slightly tweaked the
code and put here:
https://github.com/krzk/tools/tree/master/tests-var/nfc/port100_probe
$ make
$ sudo ./port100_probe


[1] https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869
[2] https://syzkaller.appspot.com/bug?extid=1dc8b460d6d48d7ef9ca


Best regards,
Krzysztof

  parent reply	other threads:[~2021-10-20 20:56 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:43 syzbot
2021-06-22 16:07 ` Pavel Skripkin
2021-06-22 16:21   ` syzbot
2021-07-22 14:20 ` Krzysztof Kozlowski
2021-07-22 14:23   ` Krzysztof Kozlowski
2021-07-22 14:47   ` Alan Stern
2021-07-23  9:05     ` Krzysztof Kozlowski
2021-07-23 13:07       ` Alan Stern
2021-10-20 20:56     ` Krzysztof Kozlowski [this message]
2021-10-20 22:05       ` Alan Stern
2021-10-25 14:57         ` Krzysztof Kozlowski
2021-10-25 16:22           ` Alan Stern
2021-10-25 17:13             ` Krzysztof Kozlowski
2021-10-25 18:54               ` Alan Stern
2022-03-09 19:33 ` Pavel Skripkin
2022-03-09 19:56   ` syzbot
     [not found] <20220310084247.1148-1-hdanton@sina.com>
2022-03-10 14:22 ` syzbot
     [not found] ` <20220311053751.1226-1-hdanton@sina.com>
2022-03-11 19:17   ` Pavel Skripkin
2022-03-11 19:18     ` syzbot
2022-03-11 19:19       ` Pavel Skripkin
2022-03-11 19:32         ` syzbot
     [not found]   ` <20220312005624.1310-1-hdanton@sina.com>
2022-03-12 10:36     ` Pavel Skripkin
     [not found]     ` <20220312115854.1399-1-hdanton@sina.com>
2022-03-12 12:44       ` Pavel Skripkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9695fc8-51b5-c61e-0a2f-fec9c2f0bae0@canonical.com \
    --to=krzysztof.kozlowski@canonical.com \
    --cc=andreyknvl@gmail.com \
    --cc=balbi@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paskripkin@gmail.com \
    --cc=stern@rowland.harvard.edu \
    --cc=syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=thierry.escande@collabora.com \
    --subject='Re: [syzbot] INFO: task hung in port100_probe' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).