Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* Fw: [Bug 213729] New: PMTUD failure with ECMP.
@ 2021-07-14 15:13 Stephen Hemminger
  2021-07-14 16:11 ` Vadim Fedorenko
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Hemminger @ 2021-07-14 15:13 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Wed, 14 Jul 2021 13:43:51 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 213729] New: PMTUD failure with ECMP.


https://bugzilla.kernel.org/show_bug.cgi?id=213729

            Bug ID: 213729
           Summary: PMTUD failure with ECMP.
           Product: Networking
           Version: 2.5
    Kernel Version: 5.13.0-rc5
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
          Assignee: stephen@networkplumber.org
          Reporter: skappen@mvista.com
        Regression: No

Created attachment 297849
  --> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit  
Ecmp pmtud test setup

PMTUD failure with ECMP.

We have observed failures when PMTUD and ECMP work together.
Ping fails either through gateway1 or gateway2 when using MTU greater than
1500.
The Issue has been tested and reproduced on CentOS 8 and mainline kernels. 


Kernel versions: 
[root@localhost ~]# uname -a
Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost skappen]# uname -a
Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
x86_64 x86_64 x86_64 GNU/Linux


Static routes with ECMP are configured like this:

[root@localhost skappen]#ip route
default proto static 
        nexthop via 192.168.0.11 dev enp0s3 weight 1 
        nexthop via 192.168.0.12 dev enp0s3 weight 1 
192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100

So the host would pick the first or the second nexthop depending on ECMP's
hashing algorithm.

When pinging the destination with MTU greater than 1500 it works through the
first gateway.

[root@localhost skappen]# ping -s1700 10.0.3.17
PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
^C
--- 10.0.3.17 ping statistics ---
3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms

The MTU also gets cached for this route as per rfc6754:

[root@localhost skappen]# ip route get 10.0.3.17
10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0 
    cache expires 540sec mtu 1500 

[root@localhost skappen]# tracepath -n 10.0.3.17
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.0.11                                          1.475ms 
 1:  192.168.0.11                                          0.995ms 
 2:  192.168.0.11                                          1.075ms !H
     Resume: pmtu 1500         

However when the second nexthop is picked PMTUD breaks. In this example I ping
a second interface configured on the same destination
from the same host, using the same routes and gateways. Based on ECMP's hashing
algorithm this host would pick the second nexthop (.2):

[root@localhost skappen]# ping -s1700 10.0.3.18
PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
^C
--- 10.0.3.18 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
[root@localhost skappen]# ip route get 10.0.3.18
10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0 
    cache 

[root@localhost skappen]# tracepath -n 10.0.3.18
 1?: [LOCALHOST]                      pmtu 9000
 1:  192.168.0.12                                          3.147ms 
 1:  192.168.0.12                                          0.696ms 
 2:  192.168.0.12                                          0.648ms pmtu 1500
 2:  192.168.0.12                                          0.761ms !H
     Resume: pmtu 1500     

The ICMP frag needed reaches the host, but in this case it is ignored.
The MTU for this route does not get cached either.


It looks like mtu value from the next hop is not properly updated for some
reason. 


Test Case:
Create 2 networks: Internal, External
Create 4 virtual machines: Client, GW-1, GW-2, Destination

Client
configure 1 NIC to internal with MTU 9000
configure static route with ECMP to GW-1 and GW-2 internal address

GW-1, GW-2
configure 2 NICs
- to internal with MTU 9000
- to external MTU 1500
- enable ip_forward
- enable packet forward

Target
configure 1 NIC to external MTU with 1500
configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
ECMP's hashing algorithm would pick different routes

Test
ping from client to target with larger than 1500 bytes
ping the other addresses of the target so ECMP would use the other route too

Results observed:
Through GW-1 PMTUD works, after the first frag needed message the MTU is
lowered on the client side for this target. Through the GW-2 PMTUD does not,
all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
In all failure cases mtu is not cashed on "ip route get".

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-14 15:13 Fw: [Bug 213729] New: PMTUD failure with ECMP Stephen Hemminger
@ 2021-07-14 16:11 ` Vadim Fedorenko
  2021-07-14 16:30   ` Ido Schimmel
  0 siblings, 1 reply; 7+ messages in thread
From: Vadim Fedorenko @ 2021-07-14 16:11 UTC (permalink / raw)
  To: Stephen Hemminger, netdev

On 14.07.2021 16:13, Stephen Hemminger wrote:
> 
> 
> Begin forwarded message:
> 
> Date: Wed, 14 Jul 2021 13:43:51 +0000
> From: bugzilla-daemon@bugzilla.kernel.org
> To: stephen@networkplumber.org
> Subject: [Bug 213729] New: PMTUD failure with ECMP.
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=213729
> 
>              Bug ID: 213729
>             Summary: PMTUD failure with ECMP.
>             Product: Networking
>             Version: 2.5
>      Kernel Version: 5.13.0-rc5
>            Hardware: x86-64
>                  OS: Linux
>                Tree: Mainline
>              Status: NEW
>            Severity: normal
>            Priority: P1
>           Component: IPV4
>            Assignee: stephen@networkplumber.org
>            Reporter: skappen@mvista.com
>          Regression: No
> 
> Created attachment 297849
>    --> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
> Ecmp pmtud test setup
> 
> PMTUD failure with ECMP.
> 
> We have observed failures when PMTUD and ECMP work together.
> Ping fails either through gateway1 or gateway2 when using MTU greater than
> 1500.
> The Issue has been tested and reproduced on CentOS 8 and mainline kernels.
> 
> 
> Kernel versions:
> [root@localhost ~]# uname -a
> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@localhost skappen]# uname -a
> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
> x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> Static routes with ECMP are configured like this:
> 
> [root@localhost skappen]#ip route
> default proto static
>          nexthop via 192.168.0.11 dev enp0s3 weight 1
>          nexthop via 192.168.0.12 dev enp0s3 weight 1
> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100
> 
> So the host would pick the first or the second nexthop depending on ECMP's
> hashing algorithm.
> 
> When pinging the destination with MTU greater than 1500 it works through the
> first gateway.
> 
> [root@localhost skappen]# ping -s1700 10.0.3.17
> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>  From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
> ^C
> --- 10.0.3.17 ping statistics ---
> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
> 
> The MTU also gets cached for this route as per rfc6754:
> 
> [root@localhost skappen]# ip route get 10.0.3.17
> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>      cache expires 540sec mtu 1500
> 
> [root@localhost skappen]# tracepath -n 10.0.3.17
>   1?: [LOCALHOST]                      pmtu 1500
>   1:  192.168.0.11                                          1.475ms
>   1:  192.168.0.11                                          0.995ms
>   2:  192.168.0.11                                          1.075ms !H
>       Resume: pmtu 1500
> 
> However when the second nexthop is picked PMTUD breaks. In this example I ping
> a second interface configured on the same destination
> from the same host, using the same routes and gateways. Based on ECMP's hashing
> algorithm this host would pick the second nexthop (.2):
> 
> [root@localhost skappen]# ping -s1700 10.0.3.18
> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>  From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>  From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>  From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
> ^C
> --- 10.0.3.18 ping statistics ---
> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
> [root@localhost skappen]# ip route get 10.0.3.18
> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>      cache
> 
> [root@localhost skappen]# tracepath -n 10.0.3.18
>   1?: [LOCALHOST]                      pmtu 9000
>   1:  192.168.0.12                                          3.147ms
>   1:  192.168.0.12                                          0.696ms
>   2:  192.168.0.12                                          0.648ms pmtu 1500
>   2:  192.168.0.12                                          0.761ms !H
>       Resume: pmtu 1500
> 
> The ICMP frag needed reaches the host, but in this case it is ignored.
> The MTU for this route does not get cached either.
> 
> 
> It looks like mtu value from the next hop is not properly updated for some
> reason.
> 
> 
> Test Case:
> Create 2 networks: Internal, External
> Create 4 virtual machines: Client, GW-1, GW-2, Destination
> 
> Client
> configure 1 NIC to internal with MTU 9000
> configure static route with ECMP to GW-1 and GW-2 internal address
> 
> GW-1, GW-2
> configure 2 NICs
> - to internal with MTU 9000
> - to external MTU 1500
> - enable ip_forward
> - enable packet forward
> 
> Target
> configure 1 NIC to external MTU with 1500
> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
> ECMP's hashing algorithm would pick different routes
> 
> Test
> ping from client to target with larger than 1500 bytes
> ping the other addresses of the target so ECMP would use the other route too
> 
> Results observed:
> Through GW-1 PMTUD works, after the first frag needed message the MTU is
> lowered on the client side for this target. Through the GW-2 PMTUD does not,
> all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
> In all failure cases mtu is not cashed on "ip route get".
> 
Looks like I'm in context of PMTU and also I'm working on implementing several
new test cases for pmtu.sh test, so I will take care of this one too

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-14 16:11 ` Vadim Fedorenko
@ 2021-07-14 16:30   ` Ido Schimmel
  2021-07-14 17:51     ` Vadim Fedorenko
  0 siblings, 1 reply; 7+ messages in thread
From: Ido Schimmel @ 2021-07-14 16:30 UTC (permalink / raw)
  To: Vadim Fedorenko; +Cc: Stephen Hemminger, netdev

On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
> On 14.07.2021 16:13, Stephen Hemminger wrote:
> > 
> > 
> > Begin forwarded message:
> > 
> > Date: Wed, 14 Jul 2021 13:43:51 +0000
> > From: bugzilla-daemon@bugzilla.kernel.org
> > To: stephen@networkplumber.org
> > Subject: [Bug 213729] New: PMTUD failure with ECMP.
> > 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=213729
> > 
> >              Bug ID: 213729
> >             Summary: PMTUD failure with ECMP.
> >             Product: Networking
> >             Version: 2.5
> >      Kernel Version: 5.13.0-rc5
> >            Hardware: x86-64
> >                  OS: Linux
> >                Tree: Mainline
> >              Status: NEW
> >            Severity: normal
> >            Priority: P1
> >           Component: IPV4
> >            Assignee: stephen@networkplumber.org
> >            Reporter: skappen@mvista.com
> >          Regression: No
> > 
> > Created attachment 297849
> >    --> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
> > Ecmp pmtud test setup
> > 
> > PMTUD failure with ECMP.
> > 
> > We have observed failures when PMTUD and ECMP work together.
> > Ping fails either through gateway1 or gateway2 when using MTU greater than
> > 1500.
> > The Issue has been tested and reproduced on CentOS 8 and mainline kernels.
> > 
> > 
> > Kernel versions:
> > [root@localhost ~]# uname -a
> > Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
> > UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > 
> > [root@localhost skappen]# uname -a
> > Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
> > x86_64 x86_64 x86_64 GNU/Linux
> > 
> > 
> > Static routes with ECMP are configured like this:
> > 
> > [root@localhost skappen]#ip route
> > default proto static
> >          nexthop via 192.168.0.11 dev enp0s3 weight 1
> >          nexthop via 192.168.0.12 dev enp0s3 weight 1
> > 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100
> > 
> > So the host would pick the first or the second nexthop depending on ECMP's
> > hashing algorithm.
> > 
> > When pinging the destination with MTU greater than 1500 it works through the
> > first gateway.
> > 
> > [root@localhost skappen]# ping -s1700 10.0.3.17
> > PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
> >  From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> > 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
> > 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
> > ^C
> > --- 10.0.3.17 ping statistics ---
> > 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
> > rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
> > 
> > The MTU also gets cached for this route as per rfc6754:
> > 
> > [root@localhost skappen]# ip route get 10.0.3.17
> > 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
> >      cache expires 540sec mtu 1500
> > 
> > [root@localhost skappen]# tracepath -n 10.0.3.17
> >   1?: [LOCALHOST]                      pmtu 1500
> >   1:  192.168.0.11                                          1.475ms
> >   1:  192.168.0.11                                          0.995ms
> >   2:  192.168.0.11                                          1.075ms !H
> >       Resume: pmtu 1500
> > 
> > However when the second nexthop is picked PMTUD breaks. In this example I ping
> > a second interface configured on the same destination
> > from the same host, using the same routes and gateways. Based on ECMP's hashing
> > algorithm this host would pick the second nexthop (.2):
> > 
> > [root@localhost skappen]# ping -s1700 10.0.3.18
> > PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
> >  From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> >  From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
> >  From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
> > ^C
> > --- 10.0.3.18 ping statistics ---
> > 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
> > [root@localhost skappen]# ip route get 10.0.3.18
> > 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
> >      cache
> > 
> > [root@localhost skappen]# tracepath -n 10.0.3.18
> >   1?: [LOCALHOST]                      pmtu 9000
> >   1:  192.168.0.12                                          3.147ms
> >   1:  192.168.0.12                                          0.696ms
> >   2:  192.168.0.12                                          0.648ms pmtu 1500
> >   2:  192.168.0.12                                          0.761ms !H
> >       Resume: pmtu 1500
> > 
> > The ICMP frag needed reaches the host, but in this case it is ignored.
> > The MTU for this route does not get cached either.
> > 
> > 
> > It looks like mtu value from the next hop is not properly updated for some
> > reason.
> > 
> > 
> > Test Case:
> > Create 2 networks: Internal, External
> > Create 4 virtual machines: Client, GW-1, GW-2, Destination
> > 
> > Client
> > configure 1 NIC to internal with MTU 9000
> > configure static route with ECMP to GW-1 and GW-2 internal address
> > 
> > GW-1, GW-2
> > configure 2 NICs
> > - to internal with MTU 9000
> > - to external MTU 1500
> > - enable ip_forward
> > - enable packet forward
> > 
> > Target
> > configure 1 NIC to external MTU with 1500
> > configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
> > ECMP's hashing algorithm would pick different routes
> > 
> > Test
> > ping from client to target with larger than 1500 bytes
> > ping the other addresses of the target so ECMP would use the other route too
> > 
> > Results observed:
> > Through GW-1 PMTUD works, after the first frag needed message the MTU is
> > lowered on the client side for this target. Through the GW-2 PMTUD does not,
> > all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
> > In all failure cases mtu is not cashed on "ip route get".
> > 
> Looks like I'm in context of PMTU and also I'm working on implementing several
> new test cases for pmtu.sh test, so I will take care of this one too

Thanks

There was a similar report from around a year ago that might give you
more info:

https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-14 16:30   ` Ido Schimmel
@ 2021-07-14 17:51     ` Vadim Fedorenko
  2021-07-14 18:12       ` David Ahern
  0 siblings, 1 reply; 7+ messages in thread
From: Vadim Fedorenko @ 2021-07-14 17:51 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Stephen Hemminger, netdev

On 14.07.2021 17:30, Ido Schimmel wrote:
> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
>> On 14.07.2021 16:13, Stephen Hemminger wrote:
>>>
>>>
>>> Begin forwarded message:
>>>
>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
>>> From: bugzilla-daemon@bugzilla.kernel.org
>>> To: stephen@networkplumber.org
>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
>>>
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
>>>
>>>               Bug ID: 213729
>>>              Summary: PMTUD failure with ECMP.
>>>              Product: Networking
>>>              Version: 2.5
>>>       Kernel Version: 5.13.0-rc5
>>>             Hardware: x86-64
>>>                   OS: Linux
>>>                 Tree: Mainline
>>>               Status: NEW
>>>             Severity: normal
>>>             Priority: P1
>>>            Component: IPV4
>>>             Assignee: stephen@networkplumber.org
>>>             Reporter: skappen@mvista.com
>>>           Regression: No
>>>
>>> Created attachment 297849
>>>     --> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
>>> Ecmp pmtud test setup
>>>
>>> PMTUD failure with ECMP.
>>>
>>> We have observed failures when PMTUD and ECMP work together.
>>> Ping fails either through gateway1 or gateway2 when using MTU greater than
>>> 1500.
>>> The Issue has been tested and reproduced on CentOS 8 and mainline kernels.
>>>
>>>
>>> Kernel versions:
>>> [root@localhost ~]# uname -a
>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> [root@localhost skappen]# uname -a
>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> Static routes with ECMP are configured like this:
>>>
>>> [root@localhost skappen]#ip route
>>> default proto static
>>>           nexthop via 192.168.0.11 dev enp0s3 weight 1
>>>           nexthop via 192.168.0.12 dev enp0s3 weight 1
>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100
>>>
>>> So the host would pick the first or the second nexthop depending on ECMP's
>>> hashing algorithm.
>>>
>>> When pinging the destination with MTU greater than 1500 it works through the
>>> first gateway.
>>>
>>> [root@localhost skappen]# ping -s1700 10.0.3.17
>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>>>   From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
>>> ^C
>>> --- 10.0.3.17 ping statistics ---
>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
>>>
>>> The MTU also gets cached for this route as per rfc6754:
>>>
>>> [root@localhost skappen]# ip route get 10.0.3.17
>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>>>       cache expires 540sec mtu 1500
>>>
>>> [root@localhost skappen]# tracepath -n 10.0.3.17
>>>    1?: [LOCALHOST]                      pmtu 1500
>>>    1:  192.168.0.11                                          1.475ms
>>>    1:  192.168.0.11                                          0.995ms
>>>    2:  192.168.0.11                                          1.075ms !H
>>>        Resume: pmtu 1500
>>>
>>> However when the second nexthop is picked PMTUD breaks. In this example I ping
>>> a second interface configured on the same destination
>>> from the same host, using the same routes and gateways. Based on ECMP's hashing
>>> algorithm this host would pick the second nexthop (.2):
>>>
>>> [root@localhost skappen]# ping -s1700 10.0.3.18
>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>>>   From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>   From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>>>   From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
>>> ^C
>>> --- 10.0.3.18 ping statistics ---
>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
>>> [root@localhost skappen]# ip route get 10.0.3.18
>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>>>       cache
>>>
>>> [root@localhost skappen]# tracepath -n 10.0.3.18
>>>    1?: [LOCALHOST]                      pmtu 9000
>>>    1:  192.168.0.12                                          3.147ms
>>>    1:  192.168.0.12                                          0.696ms
>>>    2:  192.168.0.12                                          0.648ms pmtu 1500
>>>    2:  192.168.0.12                                          0.761ms !H
>>>        Resume: pmtu 1500
>>>
>>> The ICMP frag needed reaches the host, but in this case it is ignored.
>>> The MTU for this route does not get cached either.
>>>
>>>
>>> It looks like mtu value from the next hop is not properly updated for some
>>> reason.
>>>
>>>
>>> Test Case:
>>> Create 2 networks: Internal, External
>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
>>>
>>> Client
>>> configure 1 NIC to internal with MTU 9000
>>> configure static route with ECMP to GW-1 and GW-2 internal address
>>>
>>> GW-1, GW-2
>>> configure 2 NICs
>>> - to internal with MTU 9000
>>> - to external MTU 1500
>>> - enable ip_forward
>>> - enable packet forward
>>>
>>> Target
>>> configure 1 NIC to external MTU with 1500
>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
>>> ECMP's hashing algorithm would pick different routes
>>>
>>> Test
>>> ping from client to target with larger than 1500 bytes
>>> ping the other addresses of the target so ECMP would use the other route too
>>>
>>> Results observed:
>>> Through GW-1 PMTUD works, after the first frag needed message the MTU is
>>> lowered on the client side for this target. Through the GW-2 PMTUD does not,
>>> all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
>>> In all failure cases mtu is not cashed on "ip route get".
>>>
>> Looks like I'm in context of PMTU and also I'm working on implementing several
>> new test cases for pmtu.sh test, so I will take care of this one too
> 
> Thanks
> 
> There was a similar report from around a year ago that might give you
> more info:
> 
> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
> 

Thanks Ido, will definitely look at it!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-14 17:51     ` Vadim Fedorenko
@ 2021-07-14 18:12       ` David Ahern
  2021-07-19 23:29         ` Sam Kappen
  0 siblings, 1 reply; 7+ messages in thread
From: David Ahern @ 2021-07-14 18:12 UTC (permalink / raw)
  To: Vadim Fedorenko, Ido Schimmel; +Cc: Stephen Hemminger, netdev

On 7/14/21 11:51 AM, Vadim Fedorenko wrote:
> On 14.07.2021 17:30, Ido Schimmel wrote:
>> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
>>> On 14.07.2021 16:13, Stephen Hemminger wrote:
>>>>
>>>>
>>>> Begin forwarded message:
>>>>
>>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
>>>> From: bugzilla-daemon@bugzilla.kernel.org
>>>> To: stephen@networkplumber.org
>>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
>>>>
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
>>>>
>>>>               Bug ID: 213729
>>>>              Summary: PMTUD failure with ECMP.
>>>>              Product: Networking
>>>>              Version: 2.5
>>>>       Kernel Version: 5.13.0-rc5
>>>>             Hardware: x86-64
>>>>                   OS: Linux
>>>>                 Tree: Mainline
>>>>               Status: NEW
>>>>             Severity: normal
>>>>             Priority: P1
>>>>            Component: IPV4
>>>>             Assignee: stephen@networkplumber.org
>>>>             Reporter: skappen@mvista.com
>>>>           Regression: No
>>>>
>>>> Created attachment 297849
>>>>     -->
>>>> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
>>>> Ecmp pmtud test setup
>>>>
>>>> PMTUD failure with ECMP.
>>>>
>>>> We have observed failures when PMTUD and ECMP work together.
>>>> Ping fails either through gateway1 or gateway2 when using MTU
>>>> greater than
>>>> 1500.
>>>> The Issue has been tested and reproduced on CentOS 8 and mainline
>>>> kernels.
>>>>
>>>>
>>>> Kernel versions:
>>>> [root@localhost ~]# uname -a
>>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun
>>>> 1 16:14:33
>>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>> [root@localhost skappen]# uname -a
>>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28
>>>> EDT 2021
>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>>
>>>> Static routes with ECMP are configured like this:
>>>>
>>>> [root@localhost skappen]#ip route
>>>> default proto static
>>>>           nexthop via 192.168.0.11 dev enp0s3 weight 1
>>>>           nexthop via 192.168.0.12 dev enp0s3 weight 1
>>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4
>>>> metric 100
>>>>
>>>> So the host would pick the first or the second nexthop depending on
>>>> ECMP's
>>>> hashing algorithm.
>>>>
>>>> When pinging the destination with MTU greater than 1500 it works
>>>> through the
>>>> first gateway.
>>>>
>>>> [root@localhost skappen]# ping -s1700 10.0.3.17
>>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>>>>   From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
>>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
>>>> ^C
>>>> --- 10.0.3.17 ping statistics ---
>>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss,
>>>> time 2003ms
>>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
>>>>
>>>> The MTU also gets cached for this route as per rfc6754:
>>>>
>>>> [root@localhost skappen]# ip route get 10.0.3.17
>>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>>>>       cache expires 540sec mtu 1500
>>>>
>>>> [root@localhost skappen]# tracepath -n 10.0.3.17
>>>>    1?: [LOCALHOST]                      pmtu 1500
>>>>    1:  192.168.0.11                                          1.475ms
>>>>    1:  192.168.0.11                                          0.995ms
>>>>    2:  192.168.0.11                                          1.075ms !H
>>>>        Resume: pmtu 1500
>>>>
>>>> However when the second nexthop is picked PMTUD breaks. In this
>>>> example I ping
>>>> a second interface configured on the same destination
>>>> from the same host, using the same routes and gateways. Based on
>>>> ECMP's hashing
>>>> algorithm this host would pick the second nexthop (.2):
>>>>
>>>> [root@localhost skappen]# ping -s1700 10.0.3.18
>>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>>>>   From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>>   From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>>>>   From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
>>>> ^C
>>>> --- 10.0.3.18 ping statistics ---
>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time
>>>> 2062ms
>>>> [root@localhost skappen]# ip route get 10.0.3.18
>>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>>>>       cache
>>>>
>>>> [root@localhost skappen]# tracepath -n 10.0.3.18
>>>>    1?: [LOCALHOST]                      pmtu 9000
>>>>    1:  192.168.0.12                                          3.147ms
>>>>    1:  192.168.0.12                                          0.696ms
>>>>    2:  192.168.0.12                                          0.648ms
>>>> pmtu 1500
>>>>    2:  192.168.0.12                                          0.761ms !H
>>>>        Resume: pmtu 1500
>>>>
>>>> The ICMP frag needed reaches the host, but in this case it is ignored.
>>>> The MTU for this route does not get cached either.
>>>>
>>>>
>>>> It looks like mtu value from the next hop is not properly updated
>>>> for some
>>>> reason.
>>>>
>>>>
>>>> Test Case:
>>>> Create 2 networks: Internal, External
>>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
>>>>
>>>> Client
>>>> configure 1 NIC to internal with MTU 9000
>>>> configure static route with ECMP to GW-1 and GW-2 internal address
>>>>
>>>> GW-1, GW-2
>>>> configure 2 NICs
>>>> - to internal with MTU 9000
>>>> - to external MTU 1500
>>>> - enable ip_forward
>>>> - enable packet forward
>>>>
>>>> Target
>>>> configure 1 NIC to external MTU with 1500
>>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same
>>>> interface, so
>>>> ECMP's hashing algorithm would pick different routes
>>>>
>>>> Test
>>>> ping from client to target with larger than 1500 bytes
>>>> ping the other addresses of the target so ECMP would use the other
>>>> route too
>>>>
>>>> Results observed:
>>>> Through GW-1 PMTUD works, after the first frag needed message the
>>>> MTU is
>>>> lowered on the client side for this target. Through the GW-2 PMTUD
>>>> does not,
>>>> all responses to ping are ICMP frag needed, which are not obeyed by
>>>> the kernel.
>>>> In all failure cases mtu is not cashed on "ip route get".
>>>>
>>> Looks like I'm in context of PMTU and also I'm working on
>>> implementing several
>>> new test cases for pmtu.sh test, so I will take care of this one too
>>
>> Thanks
>>
>> There was a similar report from around a year ago that might give you
>> more info:
>>
>> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
>>
>>
> 
> Thanks Ido, will definitely look at it!

I believe that one is fixed by 2fbc6e89b2f1403189e624cabaf73e189c5e50c6

The root cause of this problem is icmp's taking a path that the original
packet did not. i.e., the ICMP is received on device 1 and the exception
is created on that device but Rx chooses device 2 (a different leg in
the ECMP).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-14 18:12       ` David Ahern
@ 2021-07-19 23:29         ` Sam Kappen
  2021-07-20  0:02           ` Vadim Fedorenko
  0 siblings, 1 reply; 7+ messages in thread
From: Sam Kappen @ 2021-07-19 23:29 UTC (permalink / raw)
  To: David Ahern; +Cc: Vadim Fedorenko, Ido Schimmel, Stephen Hemminger, netdev

On Wed, Jul 14, 2021 at 11:42 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 7/14/21 11:51 AM, Vadim Fedorenko wrote:
> > On 14.07.2021 17:30, Ido Schimmel wrote:
> >> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
> >>> On 14.07.2021 16:13, Stephen Hemminger wrote:
> >>>>
> >>>>
> >>>> Begin forwarded message:
> >>>>
> >>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
> >>>> From: bugzilla-daemon@bugzilla.kernel.org
> >>>> To: stephen@networkplumber.org
> >>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
> >>>>
> >>>>
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
> >>>>
> >>>>               Bug ID: 213729
> >>>>              Summary: PMTUD failure with ECMP.
> >>>>              Product: Networking
> >>>>              Version: 2.5
> >>>>       Kernel Version: 5.13.0-rc5
> >>>>             Hardware: x86-64
> >>>>                   OS: Linux
> >>>>                 Tree: Mainline
> >>>>               Status: NEW
> >>>>             Severity: normal
> >>>>             Priority: P1
> >>>>            Component: IPV4
> >>>>             Assignee: stephen@networkplumber.org
> >>>>             Reporter: skappen@mvista.com
> >>>>           Regression: No
> >>>>
> >>>> Created attachment 297849
> >>>>     -->
> >>>> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
> >>>> Ecmp pmtud test setup
> >>>>
> >>>> PMTUD failure with ECMP.
> >>>>
> >>>> We have observed failures when PMTUD and ECMP work together.
> >>>> Ping fails either through gateway1 or gateway2 when using MTU
> >>>> greater than
> >>>> 1500.
> >>>> The Issue has been tested and reproduced on CentOS 8 and mainline
> >>>> kernels.
> >>>>
> >>>>
> >>>> Kernel versions:
> >>>> [root@localhost ~]# uname -a
> >>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun
> >>>> 1 16:14:33
> >>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>>
> >>>> [root@localhost skappen]# uname -a
> >>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28
> >>>> EDT 2021
> >>>> x86_64 x86_64 x86_64 GNU/Linux
> >>>>
> >>>>
> >>>> Static routes with ECMP are configured like this:
> >>>>
> >>>> [root@localhost skappen]#ip route
> >>>> default proto static
> >>>>           nexthop via 192.168.0.11 dev enp0s3 weight 1
> >>>>           nexthop via 192.168.0.12 dev enp0s3 weight 1
> >>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4
> >>>> metric 100
> >>>>
> >>>> So the host would pick the first or the second nexthop depending on
> >>>> ECMP's
> >>>> hashing algorithm.
> >>>>
> >>>> When pinging the destination with MTU greater than 1500 it works
> >>>> through the
> >>>> first gateway.
> >>>>
> >>>> [root@localhost skappen]# ping -s1700 10.0.3.17
> >>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
> >>>>   From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> >>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
> >>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
> >>>> ^C
> >>>> --- 10.0.3.17 ping statistics ---
> >>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss,
> >>>> time 2003ms
> >>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
> >>>>
> >>>> The MTU also gets cached for this route as per rfc6754:
> >>>>
> >>>> [root@localhost skappen]# ip route get 10.0.3.17
> >>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
> >>>>       cache expires 540sec mtu 1500
> >>>>
> >>>> [root@localhost skappen]# tracepath -n 10.0.3.17
> >>>>    1?: [LOCALHOST]                      pmtu 1500
> >>>>    1:  192.168.0.11                                          1.475ms
> >>>>    1:  192.168.0.11                                          0.995ms
> >>>>    2:  192.168.0.11                                          1.075ms !H
> >>>>        Resume: pmtu 1500
> >>>>
> >>>> However when the second nexthop is picked PMTUD breaks. In this
> >>>> example I ping
> >>>> a second interface configured on the same destination
> >>>> from the same host, using the same routes and gateways. Based on
> >>>> ECMP's hashing
> >>>> algorithm this host would pick the second nexthop (.2):
> >>>>
> >>>> [root@localhost skappen]# ping -s1700 10.0.3.18
> >>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
> >>>>   From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> >>>>   From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
> >>>>   From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
> >>>> ^C
> >>>> --- 10.0.3.18 ping statistics ---
> >>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time
> >>>> 2062ms
> >>>> [root@localhost skappen]# ip route get 10.0.3.18
> >>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
> >>>>       cache
> >>>>
> >>>> [root@localhost skappen]# tracepath -n 10.0.3.18
> >>>>    1?: [LOCALHOST]                      pmtu 9000
> >>>>    1:  192.168.0.12                                          3.147ms
> >>>>    1:  192.168.0.12                                          0.696ms
> >>>>    2:  192.168.0.12                                          0.648ms
> >>>> pmtu 1500
> >>>>    2:  192.168.0.12                                          0.761ms !H
> >>>>        Resume: pmtu 1500
> >>>>
> >>>> The ICMP frag needed reaches the host, but in this case it is ignored.
> >>>> The MTU for this route does not get cached either.
> >>>>
> >>>>
> >>>> It looks like mtu value from the next hop is not properly updated
> >>>> for some
> >>>> reason.
> >>>>
> >>>>
> >>>> Test Case:
> >>>> Create 2 networks: Internal, External
> >>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
> >>>>
> >>>> Client
> >>>> configure 1 NIC to internal with MTU 9000
> >>>> configure static route with ECMP to GW-1 and GW-2 internal address
> >>>>
> >>>> GW-1, GW-2
> >>>> configure 2 NICs
> >>>> - to internal with MTU 9000
> >>>> - to external MTU 1500
> >>>> - enable ip_forward
> >>>> - enable packet forward
> >>>>
> >>>> Target
> >>>> configure 1 NIC to external MTU with 1500
> >>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same
> >>>> interface, so
> >>>> ECMP's hashing algorithm would pick different routes
> >>>>
> >>>> Test
> >>>> ping from client to target with larger than 1500 bytes
> >>>> ping the other addresses of the target so ECMP would use the other
> >>>> route too
> >>>>
> >>>> Results observed:
> >>>> Through GW-1 PMTUD works, after the first frag needed message the
> >>>> MTU is
> >>>> lowered on the client side for this target. Through the GW-2 PMTUD
> >>>> does not,
> >>>> all responses to ping are ICMP frag needed, which are not obeyed by
> >>>> the kernel.
> >>>> In all failure cases mtu is not cashed on "ip route get".
> >>>>
> >>> Looks like I'm in context of PMTU and also I'm working on
> >>> implementing several
> >>> new test cases for pmtu.sh test, so I will take care of this one too
> >>
> >> Thanks
> >>
> >> There was a similar report from around a year ago that might give you
> >> more info:
> >>
> >> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
> >>
> >>
> >
> > Thanks Ido, will definitely look at it!
>
> I believe that one is fixed by 2fbc6e89b2f1403189e624cabaf73e189c5e50c6
>
> The root cause of this problem is icmp's taking a path that the original
> packet did not. i.e., the ICMP is received on device 1 and the exception
> is created on that device but Rx chooses device 2 (a different leg in
> the ECMP).

Actual test was carried out in  5.13.0-rc5  kernel and also tested
5.14-rc1 kernel as well. This Issue is still  reproduced.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
  2021-07-19 23:29         ` Sam Kappen
@ 2021-07-20  0:02           ` Vadim Fedorenko
  0 siblings, 0 replies; 7+ messages in thread
From: Vadim Fedorenko @ 2021-07-20  0:02 UTC (permalink / raw)
  To: Sam Kappen, David Ahern; +Cc: Ido Schimmel, Stephen Hemminger, netdev

On 20.07.2021 00:29, Sam Kappen wrote:
> On Wed, Jul 14, 2021 at 11:42 PM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 7/14/21 11:51 AM, Vadim Fedorenko wrote:
>>> On 14.07.2021 17:30, Ido Schimmel wrote:
>>>> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
>>>>> On 14.07.2021 16:13, Stephen Hemminger wrote:
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
>>>>>> From: bugzilla-daemon@bugzilla.kernel.org
>>>>>> To: stephen@networkplumber.org
>>>>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
>>>>>>
>>>>>>
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
>>>>>>
>>>>>>                Bug ID: 213729
>>>>>>               Summary: PMTUD failure with ECMP.
>>>>>>               Product: Networking
>>>>>>               Version: 2.5
>>>>>>        Kernel Version: 5.13.0-rc5
>>>>>>              Hardware: x86-64
>>>>>>                    OS: Linux
>>>>>>                  Tree: Mainline
>>>>>>                Status: NEW
>>>>>>              Severity: normal
>>>>>>              Priority: P1
>>>>>>             Component: IPV4
>>>>>>              Assignee: stephen@networkplumber.org
>>>>>>              Reporter: skappen@mvista.com
>>>>>>            Regression: No
>>>>>>
>>>>>> Created attachment 297849
>>>>>>      -->
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
>>>>>> Ecmp pmtud test setup
>>>>>>
>>>>>> PMTUD failure with ECMP.
>>>>>>
>>>>>> We have observed failures when PMTUD and ECMP work together.
>>>>>> Ping fails either through gateway1 or gateway2 when using MTU
>>>>>> greater than
>>>>>> 1500.
>>>>>> The Issue has been tested and reproduced on CentOS 8 and mainline
>>>>>> kernels.
>>>>>>
>>>>>>
>>>>>> Kernel versions:
>>>>>> [root@localhost ~]# uname -a
>>>>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun
>>>>>> 1 16:14:33
>>>>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>> [root@localhost skappen]# uname -a
>>>>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28
>>>>>> EDT 2021
>>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>>
>>>>>> Static routes with ECMP are configured like this:
>>>>>>
>>>>>> [root@localhost skappen]#ip route
>>>>>> default proto static
>>>>>>            nexthop via 192.168.0.11 dev enp0s3 weight 1
>>>>>>            nexthop via 192.168.0.12 dev enp0s3 weight 1
>>>>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4
>>>>>> metric 100
>>>>>>
>>>>>> So the host would pick the first or the second nexthop depending on
>>>>>> ECMP's
>>>>>> hashing algorithm.
>>>>>>
>>>>>> When pinging the destination with MTU greater than 1500 it works
>>>>>> through the
>>>>>> first gateway.
>>>>>>
>>>>>> [root@localhost skappen]# ping -s1700 10.0.3.17
>>>>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>>>>>>    From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
>>>>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
>>>>>> ^C
>>>>>> --- 10.0.3.17 ping statistics ---
>>>>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss,
>>>>>> time 2003ms
>>>>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
>>>>>>
>>>>>> The MTU also gets cached for this route as per rfc6754:
>>>>>>
>>>>>> [root@localhost skappen]# ip route get 10.0.3.17
>>>>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>>>>>>        cache expires 540sec mtu 1500
>>>>>>
>>>>>> [root@localhost skappen]# tracepath -n 10.0.3.17
>>>>>>     1?: [LOCALHOST]                      pmtu 1500
>>>>>>     1:  192.168.0.11                                          1.475ms
>>>>>>     1:  192.168.0.11                                          0.995ms
>>>>>>     2:  192.168.0.11                                          1.075ms !H
>>>>>>         Resume: pmtu 1500
>>>>>>
>>>>>> However when the second nexthop is picked PMTUD breaks. In this
>>>>>> example I ping
>>>>>> a second interface configured on the same destination
>>>>>> from the same host, using the same routes and gateways. Based on
>>>>>> ECMP's hashing
>>>>>> algorithm this host would pick the second nexthop (.2):
>>>>>>
>>>>>> [root@localhost skappen]# ping -s1700 10.0.3.18
>>>>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>>>>>>    From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>>>>    From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>>>>>>    From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
>>>>>> ^C
>>>>>> --- 10.0.3.18 ping statistics ---
>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time
>>>>>> 2062ms
>>>>>> [root@localhost skappen]# ip route get 10.0.3.18
>>>>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>>>>>>        cache
>>>>>>
>>>>>> [root@localhost skappen]# tracepath -n 10.0.3.18
>>>>>>     1?: [LOCALHOST]                      pmtu 9000
>>>>>>     1:  192.168.0.12                                          3.147ms
>>>>>>     1:  192.168.0.12                                          0.696ms
>>>>>>     2:  192.168.0.12                                          0.648ms
>>>>>> pmtu 1500
>>>>>>     2:  192.168.0.12                                          0.761ms !H
>>>>>>         Resume: pmtu 1500
>>>>>>
>>>>>> The ICMP frag needed reaches the host, but in this case it is ignored.
>>>>>> The MTU for this route does not get cached either.
>>>>>>
>>>>>>
>>>>>> It looks like mtu value from the next hop is not properly updated
>>>>>> for some
>>>>>> reason.
>>>>>>
>>>>>>
>>>>>> Test Case:
>>>>>> Create 2 networks: Internal, External
>>>>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
>>>>>>
>>>>>> Client
>>>>>> configure 1 NIC to internal with MTU 9000
>>>>>> configure static route with ECMP to GW-1 and GW-2 internal address
>>>>>>
>>>>>> GW-1, GW-2
>>>>>> configure 2 NICs
>>>>>> - to internal with MTU 9000
>>>>>> - to external MTU 1500
>>>>>> - enable ip_forward
>>>>>> - enable packet forward
>>>>>>
>>>>>> Target
>>>>>> configure 1 NIC to external MTU with 1500
>>>>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same
>>>>>> interface, so
>>>>>> ECMP's hashing algorithm would pick different routes
>>>>>>
>>>>>> Test
>>>>>> ping from client to target with larger than 1500 bytes
>>>>>> ping the other addresses of the target so ECMP would use the other
>>>>>> route too
>>>>>>
>>>>>> Results observed:
>>>>>> Through GW-1 PMTUD works, after the first frag needed message the
>>>>>> MTU is
>>>>>> lowered on the client side for this target. Through the GW-2 PMTUD
>>>>>> does not,
>>>>>> all responses to ping are ICMP frag needed, which are not obeyed by
>>>>>> the kernel.
>>>>>> In all failure cases mtu is not cashed on "ip route get".
>>>>>>
>>>>> Looks like I'm in context of PMTU and also I'm working on
>>>>> implementing several
>>>>> new test cases for pmtu.sh test, so I will take care of this one too
>>>>
>>>> Thanks
>>>>
>>>> There was a similar report from around a year ago that might give you
>>>> more info:
>>>>
>>>> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
>>>>
>>>>
>>>
>>> Thanks Ido, will definitely look at it!
>>
>> I believe that one is fixed by 2fbc6e89b2f1403189e624cabaf73e189c5e50c6
>>
>> The root cause of this problem is icmp's taking a path that the original
>> packet did not. i.e., the ICMP is received on device 1 and the exception
>> is created on that device but Rx chooses device 2 (a different leg in
>> the ECMP).
> 
> Actual test was carried out in  5.13.0-rc5  kernel and also tested
> 5.14-rc1 kernel as well. This Issue is still  reproduced.
> 
Sorry for being late, will take of it tomorrow

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-20  2:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 15:13 Fw: [Bug 213729] New: PMTUD failure with ECMP Stephen Hemminger
2021-07-14 16:11 ` Vadim Fedorenko
2021-07-14 16:30   ` Ido Schimmel
2021-07-14 17:51     ` Vadim Fedorenko
2021-07-14 18:12       ` David Ahern
2021-07-19 23:29         ` Sam Kappen
2021-07-20  0:02           ` Vadim Fedorenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).