Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Felipe Balbi <balbi@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	syzbot <syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	syzkaller-bugs@googlegroups.com,
	Pavel Skripkin <paskripkin@gmail.com>,
	Thierry Escande <thierry.escande@collabora.com>,
	Andrey Konovalov <andreyknvl@gmail.com>
Subject: Re: [syzbot] INFO: task hung in port100_probe
Date: Mon, 25 Oct 2021 19:13:59 +0200	[thread overview]
Message-ID: <1927ec9b-d1d0-9c70-992b-925ddfbba79a@canonical.com> (raw)
In-Reply-To: <20211025162200.GC1258186@rowland.harvard.edu>

On 25/10/2021 18:22, Alan Stern wrote:
> On Mon, Oct 25, 2021 at 04:57:23PM +0200, Krzysztof Kozlowski wrote:
>> On 21/10/2021 00:05, Alan Stern wrote:
>>>>
>>>> The syzkaller reproducer fails if >1 of threads are running these usb
>>>> gadgets.  When this happens, no "in_urb" completion happens. No this
>>>> "ack" port100_recv_ack().
>>>>
>>>> I added some debugs and simply dummy_hcd dummy_timer() is woken up on
>>>> enqueuing in_urb and then is looping crazy on a previous URB (some older
>>>> URB, coming from before port100 driver probe started). The dummy_timer()
>>>> loop never reaches the second "in_urb" to process it, I think.
>>>
>>> Is there any way you can track down what's happening in that crazy loop?  
>>> That is, what driver was responsible for the previous URB?
>>>
>>> We have seen this sort of thing before, where a driver submits an URB 
>>> for a gadget which has disconnected.  The URB fails with -EPROTO status 
>>> but the URB's completion handler does an automatic resubmit.  That can 
>>> lead to a very tight loop with dummy-hcd, and it could easily prevent 
>>> some other important processing from occurring.  The simple solution is 
>>> to prevent the driver from resubmitting when the completion status is 
>>> -EPROTO.
>>
>> Hi Alan,
>>
>> Thanks for the reply.
>>
>> The URB which causes crazy loop is the port100 driver second URB, the
>> one called ack or in_urb.
>>
>> The flow is:
>> 1. probe()
>> 2. port100_get_command_type_mask()
>> 3. port100_send_cmd_async()
>> 4. port100_send_frame_async()
>> 5. usb_submit_urb(dev->out_urb)
>>    The call succeeds, the dummy_hcd picks it up and immediately ends the
>> timer-loop with -EPROTO
> 
> So that URB completes immediately.
> 
>> The completion here does not resubmit another/same URB. I checked this
>> carefully and I hope I did not miss anything.
> 
> Yeah, I see the same thing.
> 
>> 6. port100_submit_urb_for_ack() which sends the in_urb:
>>    usb_submit_urb(dev->in_urb)
>> ... wait for completion
>> ... dummy_hcd loops on this URB around line 2000:
>> if (status == -EINPROGRESS)
>>   continue
> 
> Do I understand this correctly?  You're saying that dummy-hcd executes 
> the following jump at line 1975:
> 
> 		/* incomplete transfer? */
> 		if (status == -EINPROGRESS)
> 			continue;
> 
> which goes back up to the loop head on line 1831:
> 
> 	list_for_each_entry_safe(urbp, tmp, &dum_hcd->urbp_list, urbp_list) {
> 
> Is that right?

Yes, exactly. The loop continues, iterating over list finishes thus the
loops and dummy timer function exits. Then immediately it is being
rescheduled by something (I don't know by what yet).

To remind - the syzbot reproducer must run at least two threads
(spawning USB gadgets so creating separate dummy devices) at the same
time. However only one of dummy HCD devices seems to timer-loop
endlessly... but this might not be important, e.g. maybe it's how syzbot
reproducer works.

>  I don't see why this should cause any problem.  It won't 
> loop back to the same URB; it will make its way through the list.  
> (Unless the list has somehow gotten corrupted...)  dum_hcd->urbp_list 
> should be short (perhaps 32 entries at most), so the loop should reach 
> the end of the list fairly quickly.

The list has actually only one element - only this one URB coming from
port100 device (which I was always calling second URB/ack, in_urb).

> Now, doing all this 1000 times per second could use up a significant 
> portion of the available time.  Do you think that's the reason for the 
> problem?  It seems pretty unlikely.

No, this timer-looping itself is not a problem. Problem is that this URB
never reaches some final state, e.g. -EPROTO.

In normal operation, e.g. when reproducer did not hit the issue, both
URBs from port100 (the first out_urb and second in_urb) complete with
-EPROTO. In the case leading to hang ("task kworker/0:0:5 blocked for
more than 143 seconds"), the in_urb does not complete therefore the
port100 driver waits.

Whether this intensive timer-loop is important (processing the same URB
and continuing), I don't know.

Best regards,
Krzysztof

  reply	other threads:[~2021-10-25 17:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:43 syzbot
2021-06-22 16:07 ` Pavel Skripkin
2021-06-22 16:21   ` syzbot
2021-07-22 14:20 ` Krzysztof Kozlowski
2021-07-22 14:23   ` Krzysztof Kozlowski
2021-07-22 14:47   ` Alan Stern
2021-07-23  9:05     ` Krzysztof Kozlowski
2021-07-23 13:07       ` Alan Stern
2021-10-20 20:56     ` Krzysztof Kozlowski
2021-10-20 22:05       ` Alan Stern
2021-10-25 14:57         ` Krzysztof Kozlowski
2021-10-25 16:22           ` Alan Stern
2021-10-25 17:13             ` Krzysztof Kozlowski [this message]
2021-10-25 18:54               ` Alan Stern
2022-03-09 19:33 ` Pavel Skripkin
2022-03-09 19:56   ` syzbot
     [not found] <20220310084247.1148-1-hdanton@sina.com>
2022-03-10 14:22 ` syzbot
     [not found] ` <20220311053751.1226-1-hdanton@sina.com>
2022-03-11 19:17   ` Pavel Skripkin
2022-03-11 19:18     ` syzbot
2022-03-11 19:19       ` Pavel Skripkin
2022-03-11 19:32         ` syzbot
     [not found]   ` <20220312005624.1310-1-hdanton@sina.com>
2022-03-12 10:36     ` Pavel Skripkin
     [not found]     ` <20220312115854.1399-1-hdanton@sina.com>
2022-03-12 12:44       ` Pavel Skripkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1927ec9b-d1d0-9c70-992b-925ddfbba79a@canonical.com \
    --to=krzysztof.kozlowski@canonical.com \
    --cc=andreyknvl@gmail.com \
    --cc=balbi@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paskripkin@gmail.com \
    --cc=stern@rowland.harvard.edu \
    --cc=syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=thierry.escande@collabora.com \
    --subject='Re: [syzbot] INFO: task hung in port100_probe' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).