Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Michael Chan <michael.chan@broadcom.com>
To: Baptiste Covolato <baptiste@arista.com>
Cc: David Miller <davem@davemloft.net>,
Netdev <netdev@vger.kernel.org>, Jakub Kicinski <kuba@kernel.org>,
David Christensen <drc@linux.vnet.ibm.com>
Subject: Re: [PATCH net] tg3: Fix soft lockup when tg3_reset_task() fails.
Date: Sat, 5 Sep 2020 02:02:09 -0700 [thread overview]
Message-ID: <CACKFLimoBx18uoJmXbVQTML+7eQb94nZJv2To7Wd2drJMSSeNg@mail.gmail.com> (raw)
In-Reply-To: <CABb8VeHA8yEmi-iDs3O-eRfOucWqGM+9p6gj87NLdjeQHfJROA@mail.gmail.com>
On Fri, Sep 4, 2020 at 4:20 PM Baptiste Covolato <baptiste@arista.com> wrote:
> Thank you for proposing this patch. Unfortunately, it appears to make
> things worse on my test setup. The problem is a lot easier to
> reproduce, and not related to transmit timeout anymore.
This patch specifically addresses the issue reported by David
Christensen. When tg3_reset_task() is unsuccessful, it will bring the
device to a consistent IF_DOWN state to prevent soft lockup.
tg3_reset_task() is usually scheduled from TX timeout, or from a few
other error conditions. In David's case, it was triggered from TX
timeout.
So if the issue you're reporting has nothing to do with TX timeout or
the other error conditions that trigger tg3_reset_task(), this patch
should have no effect.
>
> The manifestation of the problem with the new patch starts with a
> CmpltTO error on the PCI root port of the CPU:
> [11288.471126] tg3 0000:56:00.0: tg3_abort_hw timed out,
> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
It is unclear how tg3_abort_hw() is called, but it is encountering an
error. The TX mode register cannot be cleared.
> [11290.258733] tg3 0000:56:00.0 lc4: No firmware running
Most tg3 NICs have firmware running. This message about no firmware
running usually means something is wrong.
> [11302.336601] tg3 0000:56:00.0 lc4: Link is down
> [11302.336616] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal)
> error received: 0000:00:03.0
> [11302.336621] pcieport 0000:00:03.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester
> ID)
> [11302.470089] pcieport 0000:00:03.0: device [8086:6f08] error
> status/mask=00004000/00000000
> [11302.570218] pcieport 0000:00:03.0: [14] CmpltTO (First)
> [11302.651611] pcieport 0000:00:03.0: broadcast error_detected message
> [11305.119349] br1: port 4(lc4) entered disabled state
> [11305.119443] br1: port 1(lc4.42) entered disabled state
> [11305.119696] device lc4 left promiscuous mode
> [11305.119697] br1: port 4(lc4) entered disabled state
> [11305.143622] device lc4.42 left promiscuous mode
> [11305.143626] br1: port 1(lc4.42) entered disabled state
> [11305.219623] iommu: Removing device 0000:56:00.0 from group 52
> [11305.219672] tg3 0000:61:00.0 lc5: PCI I/O error detected
> [11305.345904] tg3 0000:6c:00.0 lc6: PCI I/O error detected
Now we have AER errors detected on 2 other tg3 devices, not from the
one above with tg3_abort_hw() failure.
I think this issue that you're reporting is not the same as David's
issue of TX timeout happening at about the same time as AER.
Please describe the issue in more detail, in particular how's the
tg3_abort_hw() seen above initiated and how many tg3 devices do you
have. Also, are you injecting these AER errors? Please also include
the complete dmesg. Thanks.
> [11305.472089] pcieport 0000:00:03.0: AER: Device recovery failed
> [11305.472092] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal)
> error received: 0000:00:03.0
> [11305.472096] pcieport 0000:00:03.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester
> ID)
> [11305.472142] pcielw 0000:52:0d.0:pcie204: link down processing complete
> [11305.605566] pcieport 0000:00:03.0: device [8086:6f08] error
> status/mask=00004000/00000000
> [11305.605568] pcieport 0000:00:03.0: [14] CmpltTO (First)
> [11305.605578] pcieport 0000:00:03.0: broadcast error_detected message
> [11305.787386] tg3 0000:61:00.0 lc5: PCI I/O error detected
>
next prev parent reply other threads:[~2020-09-05 9:03 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-03 18:28 Michael Chan
2020-09-03 19:24 ` David Miller
2020-09-04 23:20 ` Baptiste Covolato
2020-09-05 9:02 ` Michael Chan [this message]
2020-09-10 23:00 ` Baptiste Covolato
2020-09-08 16:55 ` David Christensen
2020-10-07 16:54 Tomas Charvat
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACKFLimoBx18uoJmXbVQTML+7eQb94nZJv2To7Wd2drJMSSeNg@mail.gmail.com \
--to=michael.chan@broadcom.com \
--cc=baptiste@arista.com \
--cc=davem@davemloft.net \
--cc=drc@linux.vnet.ibm.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--subject='Re: [PATCH net] tg3: Fix soft lockup when tg3_reset_task() fails.' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).