Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* Load on RTL8168g/8111g stalls network for multiple seconds
@ 2021-06-07 13:11 Johannes Brandstätter
  2021-06-07 20:39 ` Heiner Kallweit
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Brandstätter @ 2021-06-07 13:11 UTC (permalink / raw)
  To: netdev

Hi,

just the other day I wanted to set up a bridge between an external 2.5G
RTL8156 USB Ethernet adapter (using r8152) and the built in dual
RTL8168g/8111g Ethernet chip (using r8169).
I compiled the kernel as of 5.13.0-rc4 because of the r8125 supporting
the RTL8156.
This was done using the Debian kernel config of 5.10.0-4 as a base and
left the rest as default.

So this setup was working the way I wanted it to, but unfortunately
when running iperf3 against the machine it would rather quickly stall
all communications on the internal RTL8168g.
I was still able to communicate fine over the external RTL8156 link
with the machine.
Even without the generated network load, it would occasionally become
stalled.

The only information I could really gather were that the rx_missed
counter was going up, and this kernel message some time after the stall
was happening:

[81853.129107] r8169 0000:02:00.0 enp2s0: rtl_rxtx_empty_cond == 0
(loop: 42, delay: 100).

Which has apparently to do with the wait for an empty fifo within the
r8169 driver.

Until that the machine (an UP² board) using the RTL8168g ran without
any issues for multiple years in different configurations.
Only bridging immediately showed the issue when given enough network
load.

After many hours of trying out different things, nothing of which
showed any difference whatsoever, I tried to replace the internal
RTL8168g with an additional external USB Ethernet adapter which I had
laying around, having a RTL8153 inside.

Once the RTL8168g was removed and the RTL8153 added to the bridge, I
was unable to reproduce the issue.
Of course I'd rather like to make use of the two internal Ethernet
ports if I somehow can.

So is there anything I could try to do?

I'm eyeing with a regression test next on the kernel's r8168 driver.
Though this is without me knowing if there ever was a working version.
As this is a rather large task, with only limited time I wanted to seek
out some help before I go down that route.

Maybe you could point me into the right direction, as to what to try
next.

Thanks and best regards,
Johannes


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Load on RTL8168g/8111g stalls network for multiple seconds
  2021-06-07 13:11 Load on RTL8168g/8111g stalls network for multiple seconds Johannes Brandstätter
@ 2021-06-07 20:39 ` Heiner Kallweit
  2021-06-07 20:54   ` Heiner Kallweit
  2021-06-08  0:11   ` Andrew Lunn
  0 siblings, 2 replies; 5+ messages in thread
From: Heiner Kallweit @ 2021-06-07 20:39 UTC (permalink / raw)
  To: Johannes Brandstätter; +Cc: netdev

On 07.06.2021 15:11, Johannes Brandstätter wrote:
> Hi,
> 
> just the other day I wanted to set up a bridge between an external 2.5G
> RTL8156 USB Ethernet adapter (using r8152) and the built in dual
> RTL8168g/8111g Ethernet chip (using r8169).
> I compiled the kernel as of 5.13.0-rc4 because of the r8125 supporting
> the RTL8156.
> This was done using the Debian kernel config of 5.10.0-4 as a base and
> left the rest as default.
> 
> So this setup was working the way I wanted it to, but unfortunately
> when running iperf3 against the machine it would rather quickly stall
> all communications on the internal RTL8168g.
> I was still able to communicate fine over the external RTL8156 link
> with the machine.
> Even without the generated network load, it would occasionally become
> stalled.
> 
> The only information I could really gather were that the rx_missed
> counter was going up, and this kernel message some time after the stall
> was happening:
> 
> [81853.129107] r8169 0000:02:00.0 enp2s0: rtl_rxtx_empty_cond == 0
> (loop: 42, delay: 100).
> 
> Which has apparently to do with the wait for an empty fifo within the
> r8169 driver.
> 
> Until that the machine (an UP² board) using the RTL8168g ran without
> any issues for multiple years in different configurations.
> Only bridging immediately showed the issue when given enough network
> load.
> 
> After many hours of trying out different things, nothing of which
> showed any difference whatsoever, I tried to replace the internal
> RTL8168g with an additional external USB Ethernet adapter which I had
> laying around, having a RTL8153 inside.
> 
> Once the RTL8168g was removed and the RTL8153 added to the bridge, I
> was unable to reproduce the issue.
> Of course I'd rather like to make use of the two internal Ethernet
> ports if I somehow can.
> 
> So is there anything I could try to do?
> 
Do you have flow control enabled? From 5.13-rc r8169 supports adjusting
pause settings via ethtool. You could play with the settings to see
whether it makes a difference.
Next thing you could check is whether the issue persists when using
the r8168 vendor driver.

However I'm not an expert in bridging and don't know which difference
it could make whether a NIC is operated standalone or as part of a bridge.

> I'm eyeing with a regression test next on the kernel's r8168 driver.
> Though this is without me knowing if there ever was a working version.
> As this is a rather large task, with only limited time I wanted to seek
> out some help before I go down that route.
> 
> Maybe you could point me into the right direction, as to what to try
> next.
> 
> Thanks and best regards,
> Johannes
> 
Heiner

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Load on RTL8168g/8111g stalls network for multiple seconds
  2021-06-07 20:39 ` Heiner Kallweit
@ 2021-06-07 20:54   ` Heiner Kallweit
  2021-06-08  0:11   ` Andrew Lunn
  1 sibling, 0 replies; 5+ messages in thread
From: Heiner Kallweit @ 2021-06-07 20:54 UTC (permalink / raw)
  To: Johannes Brandstätter; +Cc: netdev

On 07.06.2021 22:39, Heiner Kallweit wrote:
> On 07.06.2021 15:11, Johannes Brandstätter wrote:
>> Hi,
>>
>> just the other day I wanted to set up a bridge between an external 2.5G
>> RTL8156 USB Ethernet adapter (using r8152) and the built in dual
>> RTL8168g/8111g Ethernet chip (using r8169).
>> I compiled the kernel as of 5.13.0-rc4 because of the r8125 supporting
>> the RTL8156.
>> This was done using the Debian kernel config of 5.10.0-4 as a base and
>> left the rest as default.
>>
>> So this setup was working the way I wanted it to, but unfortunately
>> when running iperf3 against the machine it would rather quickly stall
>> all communications on the internal RTL8168g.
>> I was still able to communicate fine over the external RTL8156 link
>> with the machine.
>> Even without the generated network load, it would occasionally become
>> stalled.
>>
>> The only information I could really gather were that the rx_missed
>> counter was going up, and this kernel message some time after the stall
>> was happening:
>>
>> [81853.129107] r8169 0000:02:00.0 enp2s0: rtl_rxtx_empty_cond == 0
>> (loop: 42, delay: 100).
>>
>> Which has apparently to do with the wait for an empty fifo within the
>> r8169 driver.
>>
>> Until that the machine (an UP² board) using the RTL8168g ran without
>> any issues for multiple years in different configurations.
>> Only bridging immediately showed the issue when given enough network
>> load.
>>
>> After many hours of trying out different things, nothing of which
>> showed any difference whatsoever, I tried to replace the internal
>> RTL8168g with an additional external USB Ethernet adapter which I had
>> laying around, having a RTL8153 inside.
>>
>> Once the RTL8168g was removed and the RTL8153 added to the bridge, I
>> was unable to reproduce the issue.
>> Of course I'd rather like to make use of the two internal Ethernet
>> ports if I somehow can.
>>
>> So is there anything I could try to do?
>>
> Do you have flow control enabled? From 5.13-rc r8169 supports adjusting
> pause settings via ethtool. You could play with the settings to see
> whether it makes a difference.
> Next thing you could check is whether the issue persists when using
> the r8168 vendor driver.
> 
> However I'm not an expert in bridging and don't know which difference
> it could make whether a NIC is operated standalone or as part of a bridge.
> 
>> I'm eyeing with a regression test next on the kernel's r8168 driver.
>> Though this is without me knowing if there ever was a working version.
>> As this is a rather large task, with only limited time I wanted to seek
>> out some help before I go down that route.
>>
>> Maybe you could point me into the right direction, as to what to try
>> next.
>>
>> Thanks and best regards,
>> Johannes
>>
> Heiner
> 

Also something you could test, I run my interfaces with the following
settings (as replacement for traditional interrupt coalescing).

echo 20000 > /sys/class/net/enp2s0/gro_flush_timeout
echo 1 > /sys/class/net/enp2s0/napi_defer_hard_irqs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Load on RTL8168g/8111g stalls network for multiple seconds
  2021-06-07 20:39 ` Heiner Kallweit
  2021-06-07 20:54   ` Heiner Kallweit
@ 2021-06-08  0:11   ` Andrew Lunn
  2021-07-06 14:43     ` Johannes Brandstätter
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Lunn @ 2021-06-08  0:11 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Johannes Brandstätter, netdev

On Mon, Jun 07, 2021 at 10:39:06PM +0200, Heiner Kallweit wrote:
> On 07.06.2021 15:11, Johannes Brandstätter wrote:
> > Hi,
> > 
> > just the other day I wanted to set up a bridge between an external 2.5G
> > RTL8156 USB Ethernet adapter (using r8152) and the built in dual
> > RTL8168g/8111g Ethernet chip (using r8169).
> > I compiled the kernel as of 5.13.0-rc4 because of the r8125 supporting
> > the RTL8156.
> > This was done using the Debian kernel config of 5.10.0-4 as a base and
> > left the rest as default.
> > 
> > So this setup was working the way I wanted it to, but unfortunately
> > when running iperf3 against the machine it would rather quickly stall
> > all communications on the internal RTL8168g.
> > I was still able to communicate fine over the external RTL8156 link
> > with the machine.
> > Even without the generated network load, it would occasionally become
> > stalled.
> > 
> > The only information I could really gather were that the rx_missed
> > counter was going up, and this kernel message some time after the stall
> > was happening:
> > 
> > [81853.129107] r8169 0000:02:00.0 enp2s0: rtl_rxtx_empty_cond == 0
> > (loop: 42, delay: 100).
> > 
> > Which has apparently to do with the wait for an empty fifo within the
> > r8169 driver.
> > 
> > Until that the machine (an UP² board) using the RTL8168g ran without
> > any issues for multiple years in different configurations.
> > Only bridging immediately showed the issue when given enough network
> > load.
> > 
> > After many hours of trying out different things, nothing of which
> > showed any difference whatsoever, I tried to replace the internal
> > RTL8168g with an additional external USB Ethernet adapter which I had
> > laying around, having a RTL8153 inside.
> > 
> > Once the RTL8168g was removed and the RTL8153 added to the bridge, I
> > was unable to reproduce the issue.
> > Of course I'd rather like to make use of the two internal Ethernet
> > ports if I somehow can.
> > 
> > So is there anything I could try to do?

Using a bridge means the interface is in promiscuous mode. That is not
something typically used. What could be interesting is keep the
interface out of the bridge, but do:

ip link set [interface] promisc on

Then do some iperf test etc to see if you can reproduce the issue.

     Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Load on RTL8168g/8111g stalls network for multiple seconds
  2021-06-08  0:11   ` Andrew Lunn
@ 2021-07-06 14:43     ` Johannes Brandstätter
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Brandstätter @ 2021-07-06 14:43 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit; +Cc: netdev

Thanks a lot for your quick responses, and sorry it took me so long to
get back to you. I tried all of your suggestions, but nothing really
changed anything about the stall happening.

The machine has a few services which my network relies on, so
interrupting normal operation is not desired. And as the current usb
adapter solution is working fine, there was no real incentive to change
anything yet.

I still want to try a few things, upgrade to the final 5.13, and set up
a better test network for promisc mode. If I gather anything useful
I'll
let you know.

Johannes


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-06 14:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 13:11 Load on RTL8168g/8111g stalls network for multiple seconds Johannes Brandstätter
2021-06-07 20:39 ` Heiner Kallweit
2021-06-07 20:54   ` Heiner Kallweit
2021-06-08  0:11   ` Andrew Lunn
2021-07-06 14:43     ` Johannes Brandstätter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).