Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
@ 2021-08-05 20:53 Martin Zaharinov
2021-08-06 4:40 ` Greg KH
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-05 20:53 UTC (permalink / raw)
To: netdev, gregkh, Eric Dumazet
Hi Net dev team
Please check this error :
Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
But not find any solution.
Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
Server is work fine users is down/up 500+ users .
But in one moment server make spike and affect other vlans in same server .
And in accel I see many row with this error.
Is there options to find and fix this bug.
With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
[2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
Martin
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-05 20:53 Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected Martin Zaharinov
@ 2021-08-06 4:40 ` Greg KH
2021-08-06 5:40 ` Martin Zaharinov
2021-08-08 15:14 ` Martin Zaharinov
0 siblings, 2 replies; 23+ messages in thread
From: Greg KH @ 2021-08-06 4:40 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: netdev, Eric Dumazet
On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> Hi Net dev team
>
>
> Please check this error :
> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>
> But not find any solution.
>
> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>
> Server is work fine users is down/up 500+ users .
> But in one moment server make spike and affect other vlans in same server .
> And in accel I see many row with this error.
>
> Is there options to find and fix this bug.
>
> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>
>
> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
These are userspace error messages, not kernel messages.
What kernel version are you using?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-06 4:40 ` Greg KH
@ 2021-08-06 5:40 ` Martin Zaharinov
2021-08-08 15:14 ` Martin Zaharinov
1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-06 5:40 UTC (permalink / raw)
To: Greg KH; +Cc: netdev, Eric Dumazet
Hi Greg
Latest kernel 5.13.8.
I try old version from 5.10 to 5.13 and its same error.
Martin
> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>> Hi Net dev team
>>
>>
>> Please check this error :
>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>
>> But not find any solution.
>>
>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>>
>> Server is work fine users is down/up 500+ users .
>> But in one moment server make spike and affect other vlans in same server .
>> And in accel I see many row with this error.
>>
>> Is there options to find and fix this bug.
>>
>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>>
>>
>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>
> These are userspace error messages, not kernel messages.
>
> What kernel version are you using?
>
> thanks,
>
> greg k-h
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-06 4:40 ` Greg KH
2021-08-06 5:40 ` Martin Zaharinov
@ 2021-08-08 15:14 ` Martin Zaharinov
2021-08-08 15:23 ` Pali Rohár
1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-08 15:14 UTC (permalink / raw)
To: Greg KH; +Cc: netdev, Eric Dumazet, pali
Add Pali Rohár,
If have any idea .
Martin
> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>> Hi Net dev team
>>
>>
>> Please check this error :
>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>
>> But not find any solution.
>>
>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>>
>> Server is work fine users is down/up 500+ users .
>> But in one moment server make spike and affect other vlans in same server .
>> And in accel I see many row with this error.
>>
>> Is there options to find and fix this bug.
>>
>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>>
>>
>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>
> These are userspace error messages, not kernel messages.
>
> What kernel version are you using?
>
> thanks,
>
> greg k-h
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-08 15:14 ` Martin Zaharinov
@ 2021-08-08 15:23 ` Pali Rohár
2021-08-08 15:29 ` Martin Zaharinov
0 siblings, 1 reply; 23+ messages in thread
From: Pali Rohár @ 2021-08-08 15:23 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Greg KH, netdev, Eric Dumazet
Hello!
On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
> Add Pali Rohár,
>
> If have any idea .
>
> Martin
>
> > On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> >> Hi Net dev team
> >>
> >>
> >> Please check this error :
> >> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
> >>
> >> But not find any solution.
> >>
> >> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
> >>
> >> Server is work fine users is down/up 500+ users .
> >> But in one moment server make spike and affect other vlans in same server .
When this error started to happen? After kernel upgrade? After pppd
upgrade? Or after system upgrade? Or when more users started to
connecting?
> >> And in accel I see many row with this error.
> >>
> >> Is there options to find and fix this bug.
> >>
> >> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
> >>
> >>
> >> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >
> > These are userspace error messages, not kernel messages.
> >
> > What kernel version are you using?
Yes, we need to know, what kernel version are you using.
> > thanks,
> >
> > greg k-h
>
And also another question, what version of pppd daemon are you using?
Also, are you able to dump state of ppp channels and ppp units? It is
needed to know to which tty device, file descriptor (or socket
extension) is (or should be) particular ppp channel bounded.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-08 15:23 ` Pali Rohár
@ 2021-08-08 15:29 ` Martin Zaharinov
2021-08-09 15:15 ` Pali Rohár
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-08 15:29 UTC (permalink / raw)
To: Pali Rohár; +Cc: Greg KH, netdev, Eric Dumazet
Hi Pali
Kernel 5.13.8
The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
I use accel-pppd daemon (not pppd) .
And yes after users started to connecting .
When system boot and connect first time all user connect without any problem .
In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection.
And in this time in log get many lines with ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
Martin
> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
>
> Hello!
>
> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>> Add Pali Rohár,
>>
>> If have any idea .
>>
>> Martin
>>
>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>
>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>> Hi Net dev team
>>>>
>>>>
>>>> Please check this error :
>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>>
>>>> But not find any solution.
>>>>
>>>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>>>>
>>>> Server is work fine users is down/up 500+ users .
>>>> But in one moment server make spike and affect other vlans in same server .
>
> When this error started to happen? After kernel upgrade? After pppd
> upgrade? Or after system upgrade? Or when more users started to
> connecting?
>
>>>> And in accel I see many row with this error.
>>>>
>>>> Is there options to find and fix this bug.
>>>>
>>>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>>
>>>>
>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>
>>> These are userspace error messages, not kernel messages.
>>>
>>> What kernel version are you using?
>
> Yes, we need to know, what kernel version are you using.
>
>>> thanks,
>>>
>>> greg k-h
>>
>
> And also another question, what version of pppd daemon are you using?
>
> Also, are you able to dump state of ppp channels and ppp units? It is
> needed to know to which tty device, file descriptor (or socket
> extension) is (or should be) particular ppp channel bounded.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-08 15:29 ` Martin Zaharinov
@ 2021-08-09 15:15 ` Pali Rohár
2021-08-10 18:27 ` Martin Zaharinov
2021-08-11 11:10 ` Martin Zaharinov
0 siblings, 2 replies; 23+ messages in thread
From: Pali Rohár @ 2021-08-09 15:15 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Greg KH, netdev, Eric Dumazet
On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
> Hi Pali
>
> Kernel 5.13.8
>
>
> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
>
> I use accel-pppd daemon (not pppd) .
I'm not using accel-pppd, so cannot help here.
I would suggest to try "git bisect" kernel version which started to be
problematic for accel-pppd.
Providing state of ppp channels and ppp units could help to debug this
issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
process which has ppp file descriptors can retrieve and dump this
information.
> And yes after users started to connecting .
>
> When system boot and connect first time all user connect without any problem .
> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection.
> And in this time in log get many lines with ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
>
> Martin
>
> > On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
> >
> > Hello!
> >
> > On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
> >> Add Pali Rohár,
> >>
> >> If have any idea .
> >>
> >> Martin
> >>
> >>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> >>>
> >>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> >>>> Hi Net dev team
> >>>>
> >>>>
> >>>> Please check this error :
> >>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
> >>>>
> >>>> But not find any solution.
> >>>>
> >>>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
> >>>>
> >>>> Server is work fine users is down/up 500+ users .
> >>>> But in one moment server make spike and affect other vlans in same server .
> >
> > When this error started to happen? After kernel upgrade? After pppd
> > upgrade? Or after system upgrade? Or when more users started to
> > connecting?
> >
> >>>> And in accel I see many row with this error.
> >>>>
> >>>> Is there options to find and fix this bug.
> >>>>
> >>>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
> >>>>
> >>>>
> >>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>
> >>> These are userspace error messages, not kernel messages.
> >>>
> >>> What kernel version are you using?
> >
> > Yes, we need to know, what kernel version are you using.
> >
> >>> thanks,
> >>>
> >>> greg k-h
> >>
> >
> > And also another question, what version of pppd daemon are you using?
> >
> > Also, are you able to dump state of ppp channels and ppp units? It is
> > needed to know to which tty device, file descriptor (or socket
> > extension) is (or should be) particular ppp channel bounded.
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-09 15:15 ` Pali Rohár
@ 2021-08-10 18:27 ` Martin Zaharinov
2021-08-11 16:40 ` Guillaume Nault
2021-08-11 11:10 ` Martin Zaharinov
1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-10 18:27 UTC (permalink / raw)
To: Pali Rohár; +Cc: Greg KH, netdev, Eric Dumazet, Guillaume Nault
Add Guillaume Nault
> On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
>
> On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
>> Hi Pali
>>
>> Kernel 5.13.8
>>
>>
>> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
>>
>> I use accel-pppd daemon (not pppd) .
>
> I'm not using accel-pppd, so cannot help here.
>
> I would suggest to try "git bisect" kernel version which started to be
> problematic for accel-pppd.
>
> Providing state of ppp channels and ppp units could help to debug this
> issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
> process which has ppp file descriptors can retrieve and dump this
> information.
>
>> And yes after users started to connecting .
>>
>> When system boot and connect first time all user connect without any problem .
>> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection.
>> And in this time in log get many lines with ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
>>
>> Martin
>>
>>> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
>>>
>>> Hello!
>>>
>>> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>>>> Add Pali Rohár,
>>>>
>>>> If have any idea .
>>>>
>>>> Martin
>>>>
>>>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>>>
>>>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>>>> Hi Net dev team
>>>>>>
>>>>>>
>>>>>> Please check this error :
>>>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>>>>
>>>>>> But not find any solution.
>>>>>>
>>>>>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>>>>>>
>>>>>> Server is work fine users is down/up 500+ users .
>>>>>> But in one moment server make spike and affect other vlans in same server .
>>>
>>> When this error started to happen? After kernel upgrade? After pppd
>>> upgrade? Or after system upgrade? Or when more users started to
>>> connecting?
>>>
>>>>>> And in accel I see many row with this error.
>>>>>>
>>>>>> Is there options to find and fix this bug.
>>>>>>
>>>>>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>>>>
>>>>>>
>>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>
>>>>> These are userspace error messages, not kernel messages.
>>>>>
>>>>> What kernel version are you using?
>>>
>>> Yes, we need to know, what kernel version are you using.
>>>
>>>>> thanks,
>>>>>
>>>>> greg k-h
>>>>
>>>
>>> And also another question, what version of pppd daemon are you using?
>>>
>>> Also, are you able to dump state of ppp channels and ppp units? It is
>>> needed to know to which tty device, file descriptor (or socket
>>> extension) is (or should be) particular ppp channel bounded.
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-09 15:15 ` Pali Rohár
2021-08-10 18:27 ` Martin Zaharinov
@ 2021-08-11 11:10 ` Martin Zaharinov
2021-08-11 16:48 ` Guillaume Nault
1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-11 11:10 UTC (permalink / raw)
To: Pali Rohár, Guillaume Nault; +Cc: Greg KH, netdev, Eric Dumazet
And one more that see.
Problem is come when accel start finishing sessions,
Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
May be kernel destroy old session slow and entrained other users by locking other sessions.
is there a way to speed up the closing of stopped/dead sessions.
Martin
> On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
>
> On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
>> Hi Pali
>>
>> Kernel 5.13.8
>>
>>
>> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
>>
>> I use accel-pppd daemon (not pppd) .
>
> I'm not using accel-pppd, so cannot help here.
>
> I would suggest to try "git bisect" kernel version which started to be
> problematic for accel-pppd.
>
> Providing state of ppp channels and ppp units could help to debug this
> issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
> process which has ppp file descriptors can retrieve and dump this
> information.
>
>> And yes after users started to connecting .
>>
>> When system boot and connect first time all user connect without any problem .
>> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection.
>> And in this time in log get many lines with ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
>>
>> Martin
>>
>>> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
>>>
>>> Hello!
>>>
>>> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>>>> Add Pali Rohár,
>>>>
>>>> If have any idea .
>>>>
>>>> Martin
>>>>
>>>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>>>
>>>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>>>> Hi Net dev team
>>>>>>
>>>>>>
>>>>>> Please check this error :
>>>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>>>>
>>>>>> But not find any solution.
>>>>>>
>>>>>> Config of server is : Bonding port channel (LACP) > Accel PPP server > Huawei switch.
>>>>>>
>>>>>> Server is work fine users is down/up 500+ users .
>>>>>> But in one moment server make spike and affect other vlans in same server .
>>>
>>> When this error started to happen? After kernel upgrade? After pppd
>>> upgrade? Or after system upgrade? Or when more users started to
>>> connecting?
>>>
>>>>>> And in accel I see many row with this error.
>>>>>>
>>>>>> Is there options to find and fix this bug.
>>>>>>
>>>>>> With accel team I discus this problem and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>>>>
>>>>>>
>>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>
>>>>> These are userspace error messages, not kernel messages.
>>>>>
>>>>> What kernel version are you using?
>>>
>>> Yes, we need to know, what kernel version are you using.
>>>
>>>>> thanks,
>>>>>
>>>>> greg k-h
>>>>
>>>
>>> And also another question, what version of pppd daemon are you using?
>>>
>>> Also, are you able to dump state of ppp channels and ppp units? It is
>>> needed to know to which tty device, file descriptor (or socket
>>> extension) is (or should be) particular ppp channel bounded.
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-10 18:27 ` Martin Zaharinov
@ 2021-08-11 16:40 ` Guillaume Nault
0 siblings, 0 replies; 23+ messages in thread
From: Guillaume Nault @ 2021-08-11 16:40 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
On Tue, Aug 10, 2021 at 09:27:14PM +0300, Martin Zaharinov wrote:
> Add Guillaume Nault
>
> > On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
> >
> > On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
> >>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
The PPPIOCCONNECT ioctl returns -ENOTCONN if the ppp channel has been
unregistered.
From a user space point of view, this means that accel-ppp establishes
PPPoE sessions, starts negociating PPP connection parameters on top of
them (LCP and authentication) and finally the PPPoE sessions get
disconnected before accel-ppp connects them to ppp units (units are
roughly the "pppX" network devices).
Unregistration of PPPoE channels can happen for the following reasons:
* Changing some parameters of the network interface used by the
PPPoE connection: MAC address, MTU, bringing the device down.
* Reception of a PADT (PPPoE disconnection message sent from the peer).
* Closing the PPPoE socket.
* Re-connecting a PPPoE socket with a different session ID (this
unregisters the previous channel and creates a new one, so that
shouldn't be the problem you're facing here).
Given that this seems to affect all PPPoE connections, I guess
something happened to the underlying network interface (1st bullet
point).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-11 11:10 ` Martin Zaharinov
@ 2021-08-11 16:48 ` Guillaume Nault
2021-09-07 6:16 ` Martin Zaharinov
0 siblings, 1 reply; 23+ messages in thread
From: Guillaume Nault @ 2021-08-11 16:48 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
> And one more that see.
>
> Problem is come when accel start finishing sessions,
> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
> May be kernel destroy old session slow and entrained other users by locking other sessions.
> is there a way to speed up the closing of stopped/dead sessions.
What are the CPU stats when that happen? Is it users space or kernel
space that keeps it busy?
One easy way to check is to run "mpstat 1" for a few seconds when the
problem occurs.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-08-11 16:48 ` Guillaume Nault
@ 2021-09-07 6:16 ` Martin Zaharinov
2021-09-07 6:42 ` Martin Zaharinov
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-07 6:16 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
Hi
Sorry for delay but not easy to catch moment .
See this is mpstatl 1 :
Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
I attache and one screenshot from perf top (Screenshot is send on preview mail)
And I see in lsmod
pppoe 20480 8198
pppox 16384 1 pppoe
ppp_generic 45056 16364 pppox,pppoe
slhc 16384 1 ppp_generic
To slow remove pppoe session .
And from log :
[2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>
> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>> And one more that see.
>>
>> Problem is come when accel start finishing sessions,
>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>> is there a way to speed up the closing of stopped/dead sessions.
>
> What are the CPU stats when that happen? Is it users space or kernel
> space that keeps it busy?
>
> One easy way to check is to run "mpstat 1" for a few seconds when the
> problem occurs.
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-07 6:16 ` Martin Zaharinov
@ 2021-09-07 6:42 ` Martin Zaharinov
2021-09-11 6:26 ` Martin Zaharinov
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-07 6:42 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
Perf top from text
PerfTop: 28391 irqs/sec kernel:98.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
17.01% [nf_conntrack] [k] nf_ct_iterate_cleanup
9.73% [kernel] [k] mutex_spin_on_owner
9.07% [pppoe] [k] pppoe_rcv
2.77% [nf_nat] [k] device_cmp
1.66% [kernel] [k] osq_lock
1.65% [kernel] [k] _raw_spin_lock
1.61% [kernel] [k] __local_bh_enable_ip
1.35% [nf_nat] [k] inet_cmp
1.30% [kernel] [k] __netif_receive_skb_core.constprop.0
1.16% [kernel] [k] menu_select
0.99% [kernel] [k] cpuidle_enter_state
0.96% [ixgbe] [k] ixgbe_clean_rx_irq
0.86% [kernel] [k] __dev_queue_xmit
0.70% [kernel] [k] __cond_resched
0.69% [sch_cake] [k] cake_dequeue
0.67% [nf_tables] [k] nft_do_chain
0.63% [kernel] [k] rcu_all_qs
0.61% [kernel] [k] fib_table_lookup
0.57% [kernel] [k] __schedule
0.57% [kernel] [k] skb_release_data
0.54% [kernel] [k] sched_clock
0.54% [kernel] [k] __copy_skb_header
0.53% [kernel] [k] dev_queue_xmit_nit
0.53% [kernel] [k] _raw_spin_lock_irqsave
0.50% [kernel] [k] kmem_cache_free
0.48% libfrr.so.0.0.0 [.] 0x00000000000ce970
0.47% [ixgbe] [k] ixgbe_clean_tx_irq
0.45% [kernel] [k] timerqueue_add
0.45% [kernel] [k] lapic_next_deadline
0.45% [kernel] [k] csum_partial_copy_generic
0.44% [nf_flow_table] [k] nf_flow_offload_ip_hook
0.44% [kernel] [k] kmem_cache_alloc
0.44% [nf_conntrack] [k] nf_conntrack_lock
> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
>
> Hi
> Sorry for delay but not easy to catch moment .
>
>
> See this is mpstatl 1 :
>
> Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
>
> 11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
> 11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
> 11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
> 11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
> 11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
> 11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
> 11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
> 11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
> 11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
> 11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
> 11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
> 11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
> 11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
> 11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
> 11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
> 11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
> Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
>
>
> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>
> And I see in lsmod
>
> pppoe 20480 8198
> pppox 16384 1 pppoe
> ppp_generic 45056 16364 pppox,pppoe
> slhc 16384 1 ppp_generic
>
> To slow remove pppoe session .
>
> And from log :
>
> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>
>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>>
>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>> And one more that see.
>>>
>>> Problem is come when accel start finishing sessions,
>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>> is there a way to speed up the closing of stopped/dead sessions.
>>
>> What are the CPU stats when that happen? Is it users space or kernel
>> space that keeps it busy?
>>
>> One easy way to check is to run "mpstat 1" for a few seconds when the
>> problem occurs.
>>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-07 6:42 ` Martin Zaharinov
@ 2021-09-11 6:26 ` Martin Zaharinov
2021-09-14 6:16 ` Martin Zaharinov
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-11 6:26 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
Hi Guillaume
Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
But how to find where is a problem any locking or other.
And is there options to make fast remove ppp interface from kernel to reduce this load.
Martin
> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
>
> Perf top from text
>
>
> PerfTop: 28391 irqs/sec kernel:98.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 17.01% [nf_conntrack] [k] nf_ct_iterate_cleanup
> 9.73% [kernel] [k] mutex_spin_on_owner
> 9.07% [pppoe] [k] pppoe_rcv
> 2.77% [nf_nat] [k] device_cmp
> 1.66% [kernel] [k] osq_lock
> 1.65% [kernel] [k] _raw_spin_lock
> 1.61% [kernel] [k] __local_bh_enable_ip
> 1.35% [nf_nat] [k] inet_cmp
> 1.30% [kernel] [k] __netif_receive_skb_core.constprop.0
> 1.16% [kernel] [k] menu_select
> 0.99% [kernel] [k] cpuidle_enter_state
> 0.96% [ixgbe] [k] ixgbe_clean_rx_irq
> 0.86% [kernel] [k] __dev_queue_xmit
> 0.70% [kernel] [k] __cond_resched
> 0.69% [sch_cake] [k] cake_dequeue
> 0.67% [nf_tables] [k] nft_do_chain
> 0.63% [kernel] [k] rcu_all_qs
> 0.61% [kernel] [k] fib_table_lookup
> 0.57% [kernel] [k] __schedule
> 0.57% [kernel] [k] skb_release_data
> 0.54% [kernel] [k] sched_clock
> 0.54% [kernel] [k] __copy_skb_header
> 0.53% [kernel] [k] dev_queue_xmit_nit
> 0.53% [kernel] [k] _raw_spin_lock_irqsave
> 0.50% [kernel] [k] kmem_cache_free
> 0.48% libfrr.so.0.0.0 [.] 0x00000000000ce970
> 0.47% [ixgbe] [k] ixgbe_clean_tx_irq
> 0.45% [kernel] [k] timerqueue_add
> 0.45% [kernel] [k] lapic_next_deadline
> 0.45% [kernel] [k] csum_partial_copy_generic
> 0.44% [nf_flow_table] [k] nf_flow_offload_ip_hook
> 0.44% [kernel] [k] kmem_cache_alloc
> 0.44% [nf_conntrack] [k] nf_conntrack_lock
>
>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
>>
>> Hi
>> Sorry for delay but not easy to catch moment .
>>
>>
>> See this is mpstatl 1 :
>>
>> Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
>>
>> 11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
>> 11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
>> 11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
>> 11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
>> 11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
>> 11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
>> 11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
>> 11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
>> 11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
>> 11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
>> 11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
>> 11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
>> 11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
>> 11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
>> 11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
>> 11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
>> Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
>>
>>
>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>>
>> And I see in lsmod
>>
>> pppoe 20480 8198
>> pppox 16384 1 pppoe
>> ppp_generic 45056 16364 pppox,pppoe
>> slhc 16384 1 ppp_generic
>>
>> To slow remove pppoe session .
>>
>> And from log :
>>
>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>
>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>>>
>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>> And one more that see.
>>>>
>>>> Problem is come when accel start finishing sessions,
>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>> is there a way to speed up the closing of stopped/dead sessions.
>>>
>>> What are the CPU stats when that happen? Is it users space or kernel
>>> space that keeps it busy?
>>>
>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>> problem occurs.
>>>
>>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-11 6:26 ` Martin Zaharinov
@ 2021-09-14 6:16 ` Martin Zaharinov
2021-09-14 8:02 ` Guillaume Nault
0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14 6:16 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
Hi Nault
See this stats :
Linux 5.14.2 (testb) 09/14/21 _x86_64_ (12 CPU)
11:33:44 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:33:45 all 1.75 0.00 18.85 0.00 0.00 5.00 0.00 0.00 0.00 74.40
11:33:46 all 1.74 0.00 17.88 0.00 0.00 4.72 0.00 0.00 0.00 75.66
11:33:47 all 2.23 0.00 17.62 0.00 0.00 5.05 0.00 0.00 0.00 75.10
11:33:48 all 1.82 0.00 13.64 0.00 0.00 5.70 0.00 0.00 0.00 78.84
11:33:49 all 1.50 0.00 13.46 0.00 0.00 5.15 0.00 0.00 0.00 79.90
11:33:50 all 3.06 0.00 13.96 0.00 0.00 4.79 0.00 0.00 0.00 78.20
11:33:51 all 1.40 0.00 16.53 0.00 0.00 5.21 0.00 0.00 0.00 76.86
11:33:52 all 4.43 0.00 19.44 0.00 0.00 6.56 0.00 0.00 0.00 69.57
11:33:53 all 1.51 0.00 16.40 0.00 0.00 4.77 0.00 0.00 0.00 77.32
11:33:54 all 1.51 0.00 16.55 0.00 0.00 4.71 0.00 0.00 0.00 77.23
11:33:55 all 1.00 0.00 13.21 0.00 0.00 5.90 0.00 0.00 0.00 79.90
Average: all 2.00 0.00 16.14 0.00 0.00 5.23 0.00 0.00 0.00 76.63
PerfTop: 28046 irqs/sec kernel:96.3% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
23.37% [nf_conntrack] [k] nf_ct_iterate_cleanup
17.76% [kernel] [k] mutex_spin_on_owner
9.47% [pppoe] [k] pppoe_rcv
7.71% [kernel] [k] osq_lock
2.77% [nf_nat] [k] inet_cmp
2.59% [nf_nat] [k] device_cmp
2.55% [kernel] [k] __local_bh_enable_ip
2.04% [kernel] [k] _raw_spin_lock
1.23% [kernel] [k] __cond_resched
1.16% [kernel] [k] rcu_all_qs
1.13% libfrr.so.0.0.0 [.] 0x00000000000ce970
0.79% [nf_conntrack] [k] nf_conntrack_lock
0.75% libfrr.so.0.0.0 [.] 0x00000000000ce94e
0.53% [kernel] [k] __netif_receive_skb_core.constprop.0
0.46% [kernel] [k] fib_table_lookup
0.46% [ip_tables] [k] ipt_do_table
0.45% [ixgbe] [k] ixgbe_clean_rx_irq
0.37% [kernel] [k] __dev_queue_xmit
0.34% [nf_conntrack] [k] __nf_conntrack_find_get.isra.0
0.33% [ixgbe] [k] ixgbe_clean_tx_irq
0.30% [kernel] [k] menu_select
0.25% [kernel] [k] vlan_do_receive
0.21% [kernel] [k] ip_finish_output2
0.21% [ixgbe] [k] ixgbe_poll
0.20% [kernel] [k] _raw_spin_lock_irqsave
0.19% [kernel] [k] get_rps_cpu
0.19% libc.so.6 [.] 0x0000000000186afa
0.19% [kernel] [k] queued_read_lock_slowpath
0.19% [kernel] [k] do_poll.constprop.0
0.19% [kernel] [k] cpuidle_enter_state
0.18% [kernel] [k] dev_hard_start_xmit
0.18% [kernel] [k] ___slab_alloc.constprop.0
0.17% zebra [.] 0x00000000000b9271
0.16% [kernel] [k] csum_partial_copy_generic
0.16% zebra [.] 0x00000000000b91f1
0.16% [kernel] [k] page_frag_free
0.16% [kernel] [k] kmem_cache_alloc
0.15% [kernel] [k] __skb_flow_dissect
0.15% [kernel] [k] sched_clock
0.15% libc.so.6 [.] 0x00000000000965a2
0.15% [kernel] [k] kmem_cache_free_bulk.part.0
0.15% [pppoe] [k] pppoe_flush_dev
0.15% [ixgbe] [k] ixgbe_tx_map
0.14% [kernel] [k] _raw_spin_lock_bh
0.14% [kernel] [k] fib_table_flush
0.14% [kernel] [k] native_irq_return_iret
0.14% [kernel] [k] __dev_xmit_skb
0.13% [kernel] [k] nf_hook_slow
0.13% [kernel] [k] fib_lookup_good_nhc
0.12% [kernel] [k] __fget_files
0.12% [kernel] [k] process_backlog
0.12% [xt_dtvqos] [k] 0x00000000000008d1
0.12% [kernel] [k] __list_del_entry_valid
0.12% [kernel] [k] skb_release_data
0.12% [kernel] [k] ip_route_input_slow
0.11% [kernel] [k] netif_skb_features
0.11% [kernel] [k] sock_poll
0.11% [kernel] [k] __schedule
0.11% [kernel] [k] __softirqentry_text_start
And on time of problem when try to write : ip a
to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
In case need to know why system is overloaded when deconfig ppp interface.
Best regards,
Martin
> On 11 Sep 2021, at 9:26, Martin Zaharinov <micron10@gmail.com> wrote:
>
> Hi Guillaume
>
> Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
> But how to find where is a problem any locking or other.
> And is there options to make fast remove ppp interface from kernel to reduce this load.
>
>
> Martin
>
>> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
>>
>> Perf top from text
>>
>>
>> PerfTop: 28391 irqs/sec kernel:98.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> 17.01% [nf_conntrack] [k] nf_ct_iterate_cleanup
>> 9.73% [kernel] [k] mutex_spin_on_owner
>> 9.07% [pppoe] [k] pppoe_rcv
>> 2.77% [nf_nat] [k] device_cmp
>> 1.66% [kernel] [k] osq_lock
>> 1.65% [kernel] [k] _raw_spin_lock
>> 1.61% [kernel] [k] __local_bh_enable_ip
>> 1.35% [nf_nat] [k] inet_cmp
>> 1.30% [kernel] [k] __netif_receive_skb_core.constprop.0
>> 1.16% [kernel] [k] menu_select
>> 0.99% [kernel] [k] cpuidle_enter_state
>> 0.96% [ixgbe] [k] ixgbe_clean_rx_irq
>> 0.86% [kernel] [k] __dev_queue_xmit
>> 0.70% [kernel] [k] __cond_resched
>> 0.69% [sch_cake] [k] cake_dequeue
>> 0.67% [nf_tables] [k] nft_do_chain
>> 0.63% [kernel] [k] rcu_all_qs
>> 0.61% [kernel] [k] fib_table_lookup
>> 0.57% [kernel] [k] __schedule
>> 0.57% [kernel] [k] skb_release_data
>> 0.54% [kernel] [k] sched_clock
>> 0.54% [kernel] [k] __copy_skb_header
>> 0.53% [kernel] [k] dev_queue_xmit_nit
>> 0.53% [kernel] [k] _raw_spin_lock_irqsave
>> 0.50% [kernel] [k] kmem_cache_free
>> 0.48% libfrr.so.0.0.0 [.] 0x00000000000ce970
>> 0.47% [ixgbe] [k] ixgbe_clean_tx_irq
>> 0.45% [kernel] [k] timerqueue_add
>> 0.45% [kernel] [k] lapic_next_deadline
>> 0.45% [kernel] [k] csum_partial_copy_generic
>> 0.44% [nf_flow_table] [k] nf_flow_offload_ip_hook
>> 0.44% [kernel] [k] kmem_cache_alloc
>> 0.44% [nf_conntrack] [k] nf_conntrack_lock
>>
>>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
>>>
>>> Hi
>>> Sorry for delay but not easy to catch moment .
>>>
>>>
>>> See this is mpstatl 1 :
>>>
>>> Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
>>>
>>> 11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> 11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
>>> 11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
>>> 11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
>>> 11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
>>> 11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
>>> 11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
>>> 11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
>>> 11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
>>> 11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
>>> 11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
>>> 11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
>>> 11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
>>> 11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
>>> 11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
>>> 11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
>>> 11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
>>> Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
>>>
>>>
>>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>>>
>>> And I see in lsmod
>>>
>>> pppoe 20480 8198
>>> pppox 16384 1 pppoe
>>> ppp_generic 45056 16364 pppox,pppoe
>>> slhc 16384 1 ppp_generic
>>>
>>> To slow remove pppoe session .
>>>
>>> And from log :
>>>
>>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>
>>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>>>>
>>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>>> And one more that see.
>>>>>
>>>>> Problem is come when accel start finishing sessions,
>>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>>> is there a way to speed up the closing of stopped/dead sessions.
>>>>
>>>> What are the CPU stats when that happen? Is it users space or kernel
>>>> space that keeps it busy?
>>>>
>>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>>> problem occurs.
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 6:16 ` Martin Zaharinov
@ 2021-09-14 8:02 ` Guillaume Nault
2021-09-14 9:50 ` Florian Westphal
0 siblings, 1 reply; 23+ messages in thread
From: Guillaume Nault @ 2021-09-14 8:02 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet
On Tue, Sep 14, 2021 at 09:16:55AM +0300, Martin Zaharinov wrote:
> Hi Nault
>
> See this stats :
>
> Linux 5.14.2 (testb) 09/14/21 _x86_64_ (12 CPU)
>
> 11:33:44 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 11:33:45 all 1.75 0.00 18.85 0.00 0.00 5.00 0.00 0.00 0.00 74.40
> 11:33:46 all 1.74 0.00 17.88 0.00 0.00 4.72 0.00 0.00 0.00 75.66
> 11:33:47 all 2.23 0.00 17.62 0.00 0.00 5.05 0.00 0.00 0.00 75.10
> 11:33:48 all 1.82 0.00 13.64 0.00 0.00 5.70 0.00 0.00 0.00 78.84
> 11:33:49 all 1.50 0.00 13.46 0.00 0.00 5.15 0.00 0.00 0.00 79.90
> 11:33:50 all 3.06 0.00 13.96 0.00 0.00 4.79 0.00 0.00 0.00 78.20
> 11:33:51 all 1.40 0.00 16.53 0.00 0.00 5.21 0.00 0.00 0.00 76.86
> 11:33:52 all 4.43 0.00 19.44 0.00 0.00 6.56 0.00 0.00 0.00 69.57
> 11:33:53 all 1.51 0.00 16.40 0.00 0.00 4.77 0.00 0.00 0.00 77.32
> 11:33:54 all 1.51 0.00 16.55 0.00 0.00 4.71 0.00 0.00 0.00 77.23
> 11:33:55 all 1.00 0.00 13.21 0.00 0.00 5.90 0.00 0.00 0.00 79.90
> Average: all 2.00 0.00 16.14 0.00 0.00 5.23 0.00 0.00 0.00 76.63
>
>
> PerfTop: 28046 irqs/sec kernel:96.3% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 23.37% [nf_conntrack] [k] nf_ct_iterate_cleanup
> 17.76% [kernel] [k] mutex_spin_on_owner
> 9.47% [pppoe] [k] pppoe_rcv
> 7.71% [kernel] [k] osq_lock
> 2.77% [nf_nat] [k] inet_cmp
> 2.59% [nf_nat] [k] device_cmp
> 2.55% [kernel] [k] __local_bh_enable_ip
> 2.04% [kernel] [k] _raw_spin_lock
> 1.23% [kernel] [k] __cond_resched
> 1.16% [kernel] [k] rcu_all_qs
> 1.13% libfrr.so.0.0.0 [.] 0x00000000000ce970
> 0.79% [nf_conntrack] [k] nf_conntrack_lock
> 0.75% libfrr.so.0.0.0 [.] 0x00000000000ce94e
> 0.53% [kernel] [k] __netif_receive_skb_core.constprop.0
> 0.46% [kernel] [k] fib_table_lookup
> 0.46% [ip_tables] [k] ipt_do_table
> 0.45% [ixgbe] [k] ixgbe_clean_rx_irq
> 0.37% [kernel] [k] __dev_queue_xmit
> 0.34% [nf_conntrack] [k] __nf_conntrack_find_get.isra.0
> 0.33% [ixgbe] [k] ixgbe_clean_tx_irq
> 0.30% [kernel] [k] menu_select
> 0.25% [kernel] [k] vlan_do_receive
> 0.21% [kernel] [k] ip_finish_output2
> 0.21% [ixgbe] [k] ixgbe_poll
> 0.20% [kernel] [k] _raw_spin_lock_irqsave
> 0.19% [kernel] [k] get_rps_cpu
> 0.19% libc.so.6 [.] 0x0000000000186afa
> 0.19% [kernel] [k] queued_read_lock_slowpath
> 0.19% [kernel] [k] do_poll.constprop.0
> 0.19% [kernel] [k] cpuidle_enter_state
> 0.18% [kernel] [k] dev_hard_start_xmit
> 0.18% [kernel] [k] ___slab_alloc.constprop.0
> 0.17% zebra [.] 0x00000000000b9271
> 0.16% [kernel] [k] csum_partial_copy_generic
> 0.16% zebra [.] 0x00000000000b91f1
> 0.16% [kernel] [k] page_frag_free
> 0.16% [kernel] [k] kmem_cache_alloc
> 0.15% [kernel] [k] __skb_flow_dissect
> 0.15% [kernel] [k] sched_clock
> 0.15% libc.so.6 [.] 0x00000000000965a2
> 0.15% [kernel] [k] kmem_cache_free_bulk.part.0
> 0.15% [pppoe] [k] pppoe_flush_dev
> 0.15% [ixgbe] [k] ixgbe_tx_map
> 0.14% [kernel] [k] _raw_spin_lock_bh
> 0.14% [kernel] [k] fib_table_flush
> 0.14% [kernel] [k] native_irq_return_iret
> 0.14% [kernel] [k] __dev_xmit_skb
> 0.13% [kernel] [k] nf_hook_slow
> 0.13% [kernel] [k] fib_lookup_good_nhc
> 0.12% [kernel] [k] __fget_files
> 0.12% [kernel] [k] process_backlog
> 0.12% [xt_dtvqos] [k] 0x00000000000008d1
> 0.12% [kernel] [k] __list_del_entry_valid
> 0.12% [kernel] [k] skb_release_data
> 0.12% [kernel] [k] ip_route_input_slow
> 0.11% [kernel] [k] netif_skb_features
> 0.11% [kernel] [k] sock_poll
> 0.11% [kernel] [k] __schedule
> 0.11% [kernel] [k] __softirqentry_text_start
>
>
> And on time of problem when try to write : ip a
> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
Probably some contention on the rtnl lock.
> In case need to know why system is overloaded when deconfig ppp interface.
Does it help if you disable conntrack?
>
> Best regards,
> Martin
>
>
>
>
> > On 11 Sep 2021, at 9:26, Martin Zaharinov <micron10@gmail.com> wrote:
> >
> > Hi Guillaume
> >
> > Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
> > But how to find where is a problem any locking or other.
> > And is there options to make fast remove ppp interface from kernel to reduce this load.
> >
> >
> > Martin
> >
> >> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
> >>
> >> Perf top from text
> >>
> >>
> >> PerfTop: 28391 irqs/sec kernel:98.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>
> >> 17.01% [nf_conntrack] [k] nf_ct_iterate_cleanup
> >> 9.73% [kernel] [k] mutex_spin_on_owner
> >> 9.07% [pppoe] [k] pppoe_rcv
> >> 2.77% [nf_nat] [k] device_cmp
> >> 1.66% [kernel] [k] osq_lock
> >> 1.65% [kernel] [k] _raw_spin_lock
> >> 1.61% [kernel] [k] __local_bh_enable_ip
> >> 1.35% [nf_nat] [k] inet_cmp
> >> 1.30% [kernel] [k] __netif_receive_skb_core.constprop.0
> >> 1.16% [kernel] [k] menu_select
> >> 0.99% [kernel] [k] cpuidle_enter_state
> >> 0.96% [ixgbe] [k] ixgbe_clean_rx_irq
> >> 0.86% [kernel] [k] __dev_queue_xmit
> >> 0.70% [kernel] [k] __cond_resched
> >> 0.69% [sch_cake] [k] cake_dequeue
> >> 0.67% [nf_tables] [k] nft_do_chain
> >> 0.63% [kernel] [k] rcu_all_qs
> >> 0.61% [kernel] [k] fib_table_lookup
> >> 0.57% [kernel] [k] __schedule
> >> 0.57% [kernel] [k] skb_release_data
> >> 0.54% [kernel] [k] sched_clock
> >> 0.54% [kernel] [k] __copy_skb_header
> >> 0.53% [kernel] [k] dev_queue_xmit_nit
> >> 0.53% [kernel] [k] _raw_spin_lock_irqsave
> >> 0.50% [kernel] [k] kmem_cache_free
> >> 0.48% libfrr.so.0.0.0 [.] 0x00000000000ce970
> >> 0.47% [ixgbe] [k] ixgbe_clean_tx_irq
> >> 0.45% [kernel] [k] timerqueue_add
> >> 0.45% [kernel] [k] lapic_next_deadline
> >> 0.45% [kernel] [k] csum_partial_copy_generic
> >> 0.44% [nf_flow_table] [k] nf_flow_offload_ip_hook
> >> 0.44% [kernel] [k] kmem_cache_alloc
> >> 0.44% [nf_conntrack] [k] nf_conntrack_lock
> >>
> >>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
> >>>
> >>> Hi
> >>> Sorry for delay but not easy to catch moment .
> >>>
> >>>
> >>> See this is mpstatl 1 :
> >>>
> >>> Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
> >>>
> >>> 11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >>> 11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
> >>> 11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
> >>> 11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
> >>> 11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
> >>> 11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
> >>> 11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
> >>> 11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
> >>> 11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
> >>> 11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
> >>> 11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
> >>> 11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
> >>> 11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
> >>> 11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
> >>> 11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
> >>> 11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
> >>> 11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
> >>> Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
> >>>
> >>>
> >>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
> >>>
> >>> And I see in lsmod
> >>>
> >>> pppoe 20480 8198
> >>> pppox 16384 1 pppoe
> >>> ppp_generic 45056 16364 pppox,pppoe
> >>> slhc 16384 1 ppp_generic
> >>>
> >>> To slow remove pppoe session .
> >>>
> >>> And from log :
> >>>
> >>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>
> >>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
> >>>>
> >>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
> >>>>> And one more that see.
> >>>>>
> >>>>> Problem is come when accel start finishing sessions,
> >>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
> >>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
> >>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
> >>>>> is there a way to speed up the closing of stopped/dead sessions.
> >>>>
> >>>> What are the CPU stats when that happen? Is it users space or kernel
> >>>> space that keeps it busy?
> >>>>
> >>>> One easy way to check is to run "mpstat 1" for a few seconds when the
> >>>> problem occurs.
> >>>>
> >>>
> >>
> >
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 8:02 ` Guillaume Nault
@ 2021-09-14 9:50 ` Florian Westphal
2021-09-14 10:01 ` Martin Zaharinov
2021-09-14 10:53 ` Martin Zaharinov
0 siblings, 2 replies; 23+ messages in thread
From: Florian Westphal @ 2021-09-14 9:50 UTC (permalink / raw)
To: Guillaume Nault
Cc: Martin Zaharinov, Pali Rohár, Greg KH, netdev, Eric Dumazet
Guillaume Nault <gnault@redhat.com> wrote:
> > And on time of problem when try to write : ip a
> > to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
>
> Probably some contention on the rtnl lock.
Yes, I'll create a patch.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 9:50 ` Florian Westphal
@ 2021-09-14 10:01 ` Martin Zaharinov
2021-09-14 11:00 ` Florian Westphal
2021-09-14 10:53 ` Martin Zaharinov
1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14 10:01 UTC (permalink / raw)
To: Florian Westphal
Cc: Guillaume Nault, Pali Rohár, Greg KH, netdev, Eric Dumazet
Hi Nault and Florian
Nault :
No not test need conntrack to log user traffic.
Florian:
If you make patch send to test please.
Martin
> On 14 Sep 2021, at 12:50, Florian Westphal <fw@strlen.de> wrote:
>
> Guillaume Nault <gnault@redhat.com> wrote:
>>> And on time of problem when try to write : ip a
>>> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
>>
>> Probably some contention on the rtnl lock.
>
> Yes, I'll create a patch.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 9:50 ` Florian Westphal
2021-09-14 10:01 ` Martin Zaharinov
@ 2021-09-14 10:53 ` Martin Zaharinov
1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14 10:53 UTC (permalink / raw)
To: Florian Westphal
Cc: Guillaume Nault, Pali Rohár, Greg KH, netdev, Eric Dumazet
Florian Hi
One more
please see i try to remove nf_nat and xt_MASQUERADE
and on time of problem need 50-80 sec to remove and overload system .
see perf from this moment:
PerfTop: 1738 irqs/sec kernel:85.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
40.63% [nf_conntrack] [k] nf_ct_iterate_cleanup
21.23% [kernel] [k] __local_bh_enable_ip
10.93% [kernel] [k] __cond_resched
9.20% [kernel] [k] _raw_spin_lock
8.91% [kernel] [k] rcu_all_qs
5.83% [nf_conntrack] [k] nf_conntrack_lock
0.10% [kernel] [k] mutex_spin_on_owner
0.08% telegraf [.] 0x0000000000021bf0
0.06% [kernel] [k] osq_lock
0.06% [kernel] [k] kallsyms_expand_symbol.constprop.0
0.05% [kernel] [k] format_decode
0.04% [kernel] [k] rtnl_fill_ifinfo.constprop.0.isra.0
0.04% perf [.] 0x00000000000bc7b3
0.04% [kernel] [k] memcpy_erms
0.03% [kernel] [k] string
0.03% [kernel] [k] menu_select
0.03% [kernel] [k] nla_put
0.03% [kernel] [k] vsnprintf
Martin
> On 14 Sep 2021, at 12:50, Florian Westphal <fw@strlen.de> wrote:
>
> Guillaume Nault <gnault@redhat.com> wrote:
>>> And on time of problem when try to write : ip a
>>> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
>>
>> Probably some contention on the rtnl lock.
>
> Yes, I'll create a patch.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 10:01 ` Martin Zaharinov
@ 2021-09-14 11:00 ` Florian Westphal
2021-09-15 14:25 ` Martin Zaharinov
2021-09-16 20:00 ` Martin Zaharinov
0 siblings, 2 replies; 23+ messages in thread
From: Florian Westphal @ 2021-09-14 11:00 UTC (permalink / raw)
To: Martin Zaharinov; +Cc: Florian Westphal, Guillaume Nault, netdev
[-- Attachment #1: Type: text/plain, Size: 238 bytes --]
Martin Zaharinov <micron10@gmail.com> wrote:
[ Trimming CC list ]
> Florian:
>
> If you make patch send to test please.
Attached. No idea if it helps, but 'ip' should stay responsive
even when masquerade processes netdevice events.
[-- Attachment #2: defer_masq_work.diff --]
[-- Type: text/x-diff, Size: 6674 bytes --]
diff --git a/net/netfilter/nf_nat_masquerade.c b/net/netfilter/nf_nat_masquerade.c
index 8e8a65d46345..50c6d6992ed6 100644
--- a/net/netfilter/nf_nat_masquerade.c
+++ b/net/netfilter/nf_nat_masquerade.c
@@ -9,8 +9,19 @@
#include <net/netfilter/nf_nat_masquerade.h>
+struct masq_dev_work {
+ struct work_struct work;
+ struct net *net;
+ union nf_inet_addr addr;
+ int ifindex;
+ int (*iter)(struct nf_conn *i, void *data);
+};
+
+#define MAX_MASQ_WORKER_COUNT 16
+
static DEFINE_MUTEX(masq_mutex);
static unsigned int masq_refcnt __read_mostly;
+static atomic_t masq_worker_count __read_mostly;
unsigned int
nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int hooknum,
@@ -63,13 +74,68 @@ nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int hooknum,
}
EXPORT_SYMBOL_GPL(nf_nat_masquerade_ipv4);
-static int device_cmp(struct nf_conn *i, void *ifindex)
+static void iterate_cleanup_work(struct work_struct *work)
+{
+ struct masq_dev_work *w;
+
+ w = container_of(work, struct masq_dev_work, work);
+
+ nf_ct_iterate_cleanup_net(w->net, w->iter, (void *)w, 0, 0);
+
+ put_net(w->net);
+ kfree(w);
+ atomic_dec(&masq_worker_count);
+ module_put(THIS_MODULE);
+}
+
+/* Iterate conntrack table in the background and remove conntrack entries
+ * that use the device/address being removed.
+ *
+ * In case too many work items have been queued already or memory allocation
+ * fails iteration is skipped, conntrack entries will time out eventually.
+ */
+static void nf_nat_masq_schedule(struct net *net, union nf_inet_addr *addr,
+ int ifindex,
+ int (*iter)(struct nf_conn *i, void *data),
+ gfp_t gfp_flags)
+{
+ struct masq_dev_work *w;
+
+ net = maybe_get_net(net);
+ if (!net)
+ return;
+
+ if (!try_module_get(THIS_MODULE))
+ goto err_module;
+
+ w = kzalloc(sizeof(*w), gfp_flags);
+ if (w) {
+ /* We can overshoot MAX_MASQ_WORKER_COUNT, no big deal */
+ atomic_inc(&masq_worker_count);
+
+ INIT_WORK(&w->work, iterate_cleanup_work);
+ w->ifindex = ifindex;
+ w->net = net;
+ w->iter = iter;
+ if (addr)
+ w->addr = *addr;
+ schedule_work(&w->work);
+ return;
+ }
+
+ module_put(THIS_MODULE);
+ err_module:
+ put_net(net);
+}
+
+static int device_cmp(struct nf_conn *i, void *arg)
{
const struct nf_conn_nat *nat = nfct_nat(i);
+ const struct masq_dev_work *w = arg;
if (!nat)
return 0;
- return nat->masq_index == (int)(long)ifindex;
+ return nat->masq_index == w->ifindex;
}
static int masq_device_event(struct notifier_block *this,
@@ -85,8 +151,8 @@ static int masq_device_event(struct notifier_block *this,
* and forget them.
*/
- nf_ct_iterate_cleanup_net(net, device_cmp,
- (void *)(long)dev->ifindex, 0, 0);
+ nf_nat_masq_schedule(net, NULL, dev->ifindex,
+ device_cmp, GFP_KERNEL);
}
return NOTIFY_DONE;
@@ -94,35 +160,45 @@ static int masq_device_event(struct notifier_block *this,
static int inet_cmp(struct nf_conn *ct, void *ptr)
{
- struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
- struct net_device *dev = ifa->ifa_dev->dev;
struct nf_conntrack_tuple *tuple;
+ struct masq_dev_work *w = ptr;
- if (!device_cmp(ct, (void *)(long)dev->ifindex))
+ if (!device_cmp(ct, ptr))
return 0;
tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
- return ifa->ifa_address == tuple->dst.u3.ip;
+ return nf_inet_addr_cmp(&w->addr, &tuple->dst.u3);
}
static int masq_inet_event(struct notifier_block *this,
unsigned long event,
void *ptr)
{
- struct in_device *idev = ((struct in_ifaddr *)ptr)->ifa_dev;
- struct net *net = dev_net(idev->dev);
+ const struct in_ifaddr *ifa = ptr;
+ const struct in_device *idev;
+ const struct net_device *dev;
+ union nf_inet_addr addr;
+
+ if (event != NETDEV_DOWN)
+ return NOTIFY_DONE;
/* The masq_dev_notifier will catch the case of the device going
* down. So if the inetdev is dead and being destroyed we have
* no work to do. Otherwise this is an individual address removal
* and we have to perform the flush.
*/
+ idev = ifa->ifa_dev;
if (idev->dead)
return NOTIFY_DONE;
- if (event == NETDEV_DOWN)
- nf_ct_iterate_cleanup_net(net, inet_cmp, ptr, 0, 0);
+ memset(&addr, 0, sizeof(addr));
+
+ addr.ip = ifa->ifa_address;
+
+ dev = idev->dev;
+ nf_nat_masq_schedule(dev_net(idev->dev), &addr, dev->ifindex,
+ inet_cmp, GFP_KERNEL);
return NOTIFY_DONE;
}
@@ -136,8 +212,6 @@ static struct notifier_block masq_inet_notifier = {
};
#if IS_ENABLED(CONFIG_IPV6)
-static atomic_t v6_worker_count __read_mostly;
-
static int
nat_ipv6_dev_get_saddr(struct net *net, const struct net_device *dev,
const struct in6_addr *daddr, unsigned int srcprefs,
@@ -187,40 +261,6 @@ nf_nat_masquerade_ipv6(struct sk_buff *skb, const struct nf_nat_range2 *range,
}
EXPORT_SYMBOL_GPL(nf_nat_masquerade_ipv6);
-struct masq_dev_work {
- struct work_struct work;
- struct net *net;
- struct in6_addr addr;
- int ifindex;
-};
-
-static int inet6_cmp(struct nf_conn *ct, void *work)
-{
- struct masq_dev_work *w = (struct masq_dev_work *)work;
- struct nf_conntrack_tuple *tuple;
-
- if (!device_cmp(ct, (void *)(long)w->ifindex))
- return 0;
-
- tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
-
- return ipv6_addr_equal(&w->addr, &tuple->dst.u3.in6);
-}
-
-static void iterate_cleanup_work(struct work_struct *work)
-{
- struct masq_dev_work *w;
-
- w = container_of(work, struct masq_dev_work, work);
-
- nf_ct_iterate_cleanup_net(w->net, inet6_cmp, (void *)w, 0, 0);
-
- put_net(w->net);
- kfree(w);
- atomic_dec(&v6_worker_count);
- module_put(THIS_MODULE);
-}
-
/* atomic notifier; can't call nf_ct_iterate_cleanup_net (it can sleep).
*
* Defer it to the system workqueue.
@@ -233,36 +273,19 @@ static int masq_inet6_event(struct notifier_block *this,
{
struct inet6_ifaddr *ifa = ptr;
const struct net_device *dev;
- struct masq_dev_work *w;
- struct net *net;
+ union nf_inet_addr addr;
- if (event != NETDEV_DOWN || atomic_read(&v6_worker_count) >= 16)
+ if (event != NETDEV_DOWN)
return NOTIFY_DONE;
dev = ifa->idev->dev;
- net = maybe_get_net(dev_net(dev));
- if (!net)
- return NOTIFY_DONE;
-
- if (!try_module_get(THIS_MODULE))
- goto err_module;
- w = kmalloc(sizeof(*w), GFP_ATOMIC);
- if (w) {
- atomic_inc(&v6_worker_count);
+ memset(&addr, 0, sizeof(addr));
- INIT_WORK(&w->work, iterate_cleanup_work);
- w->ifindex = dev->ifindex;
- w->net = net;
- w->addr = ifa->addr;
- schedule_work(&w->work);
+ addr.in6 = ifa->addr;
- return NOTIFY_DONE;
- }
-
- module_put(THIS_MODULE);
- err_module:
- put_net(net);
+ nf_nat_masq_schedule(dev_net(dev), &addr, dev->ifindex, inet_cmp,
+ GFP_ATOMIC);
return NOTIFY_DONE;
}
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 11:00 ` Florian Westphal
@ 2021-09-15 14:25 ` Martin Zaharinov
2021-09-15 14:37 ` Martin Zaharinov
2021-09-16 20:00 ` Martin Zaharinov
1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-15 14:25 UTC (permalink / raw)
To: Florian Westphal; +Cc: Guillaume Nault, netdev
Hey Florian
make test in lab and look much better that before.
see this perf
PerfTop: 6551 irqs/sec kernel:77.8% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
15.70% [ixgbe] [k] ixgbe_read_reg
13.33% [kernel] [k] mutex_spin_on_owner
7.65% [kernel] [k] osq_lock
2.85% libfrr.so.0.0.0 [.] 0x00000000000ce970
1.94% libfrr.so.0.0.0 [.] 0x00000000000ce94e
1.19% libc.so.6 [.] 0x0000000000186afa
1.15% [kernel] [k] do_poll.constprop.0
0.99% [kernel] [k] inet_dump_ifaddr
0.94% libteam.so.5.6.1 [.] 0x0000000000006470
0.79% libc.so.6 [.] 0x0000000000186e57
0.71% [ixgbe] [k] ixgbe_update_mc_addr_list_generic
0.65% [kernel] [k] __fget_files
0.61% [kernel] [k] sock_poll
0.57% libteam.so.5.6.1 [.] 0x0000000000009e7d
0.51% perf [.] 0x00000000000bc7b3
0.51% libteam.so.5.6.1 [.] 0x0000000000006501
0.48% [kernel] [k] next_uptodate_page
0.46% [kernel] [k] _raw_read_lock_bh
0.43% libc.so.6 [.] 0x0000000000186eac
0.42% bgpd [.] 0x0000000000070a46
0.41% [pppoe] [k] pppoe_flush_dev
0.39% [kernel] [k] zap_pte_range
This happened when remove and add new interface on time of drop and reconnect users.
now : ip a command work fine !
Martin
> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
>
> Martin Zaharinov <micron10@gmail.com> wrote:
>
> [ Trimming CC list ]
>
>> Florian:
>>
>> If you make patch send to test please.
>
> Attached. No idea if it helps, but 'ip' should stay responsive
> even when masquerade processes netdevice events.
> <defer_masq_work.diff>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-15 14:25 ` Martin Zaharinov
@ 2021-09-15 14:37 ` Martin Zaharinov
0 siblings, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-15 14:37 UTC (permalink / raw)
To: Florian Westphal; +Cc: Guillaume Nault, netdev
and this :
PerfTop: 26378 irqs/sec kernel:61.4% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5.65% libfrr.so.0.0.0 [.] 0x00000000000ce970
5.56% [kernel] [k] osq_lock
5.22% [kernel] [k] mutex_spin_on_owner
3.66% [pppoe] [k] pppoe_flush_dev
3.01% libfrr.so.0.0.0 [.] 0x00000000000ce94e
1.98% libc.so.6 [.] 0x00000000000965a2
1.84% libc.so.6 [.] 0x0000000000186afa
1.55% libc.so.6 [.] 0x0000000000186e57
1.54% zebra [.] 0x00000000000b9271
1.46% zebra [.] 0x00000000000b91f1
1.46% libteam.so.5.6.1 [.] 0x0000000000006470
1.44% libc.so.6 [.] 0x00000000000965a0
1.30% libteam.so.5.6.1 [.] 0x0000000000009e7d
1.08% [kernel] [k] fib_table_flush
1.02% libc.so.6 [.] 0x0000000000186eac
0.93% [kernel] [k] do_poll.constprop.0
0.85% libc.so.6 [.] 0x0000000000186afe
0.80% dtvbras [.] 0x0000000000014be8
0.78% [kernel] [k] queued_read_lock_slowpath
0.72% [kernel] [k] next_uptodate_page
0.64% [kernel] [k] zap_pte_range
0.64% bgpd [.] 0x0000000000070a46
0.61% [kernel] [k] fib_table_insert
> On 15 Sep 2021, at 17:25, Martin Zaharinov <micron10@gmail.com> wrote:
>
> Hey Florian
>
> make test in lab and look much better that before.
>
> see this perf
>
> PerfTop: 6551 irqs/sec kernel:77.8% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 15.70% [ixgbe] [k] ixgbe_read_reg
> 13.33% [kernel] [k] mutex_spin_on_owner
> 7.65% [kernel] [k] osq_lock
> 2.85% libfrr.so.0.0.0 [.] 0x00000000000ce970
> 1.94% libfrr.so.0.0.0 [.] 0x00000000000ce94e
> 1.19% libc.so.6 [.] 0x0000000000186afa
> 1.15% [kernel] [k] do_poll.constprop.0
> 0.99% [kernel] [k] inet_dump_ifaddr
> 0.94% libteam.so.5.6.1 [.] 0x0000000000006470
> 0.79% libc.so.6 [.] 0x0000000000186e57
> 0.71% [ixgbe] [k] ixgbe_update_mc_addr_list_generic
> 0.65% [kernel] [k] __fget_files
> 0.61% [kernel] [k] sock_poll
> 0.57% libteam.so.5.6.1 [.] 0x0000000000009e7d
> 0.51% perf [.] 0x00000000000bc7b3
> 0.51% libteam.so.5.6.1 [.] 0x0000000000006501
> 0.48% [kernel] [k] next_uptodate_page
> 0.46% [kernel] [k] _raw_read_lock_bh
> 0.43% libc.so.6 [.] 0x0000000000186eac
> 0.42% bgpd [.] 0x0000000000070a46
> 0.41% [pppoe] [k] pppoe_flush_dev
> 0.39% [kernel] [k] zap_pte_range
>
>
> This happened when remove and add new interface on time of drop and reconnect users.
>
>
> now : ip a command work fine !
>
>
> Martin
>
>
>> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
>>
>> Martin Zaharinov <micron10@gmail.com> wrote:
>>
>> [ Trimming CC list ]
>>
>>> Florian:
>>>
>>> If you make patch send to test please.
>>
>> Attached. No idea if it helps, but 'ip' should stay responsive
>> even when masquerade processes netdevice events.
>> <defer_masq_work.diff>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
2021-09-14 11:00 ` Florian Westphal
2021-09-15 14:25 ` Martin Zaharinov
@ 2021-09-16 20:00 ` Martin Zaharinov
1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-16 20:00 UTC (permalink / raw)
To: Florian Westphal; +Cc: Guillaume Nault, netdev
Small Updates
After switch from frr to bird bgp reduce load from frr
but still when have disconnect 5k+ users have slow pppoe_flush_dev
PerfTop: 15606 irqs/sec kernel:77.7% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8.24% [kernel] [k] osq_lock
7.55% [kernel] [k] mutex_spin_on_owner
7.04% [pppoe] [k] pppoe_flush_dev
2.77% libteam.so.5.6.1 [.] 0x0000000000009e7d
2.67% libteam.so.5.6.1 [.] 0x0000000000006470
1.90% [kernel] [k] fib_table_flush
1.73% [kernel] [k] queued_read_lock_slowpath
1.68% [kernel] [k] next_uptodate_page
1.36% ip [.] 0x0000000000011b74
1.23% ip [.] 0x00000000000121b0
1.09% [kernel] [k] zap_pte_range
0.99% libteam.so.5.6.1 [.] 0x0000000000006501
0.88% dtvbras [.] 0x0000000000014be8
0.87% [kernel] [k] inet_dump_ifaddr
0.74% [kernel] [k] filemap_map_pages
0.72% [kernel] [k] neigh_flush_dev.isra.0
0.66% [kernel] [k] snmp_get_cpu_field
0.65% [kernel] [k] fib_table_insert
0.63% [kernel] [k] native_irq_return_iret
0.63% libteam.so.5.6.1 [.] 0x0000000000005c78
0.60% [kernel] [k] copy_page
0.52% libteam.so.5.6.1 [.] 0x000000000000647f
0.50% [kernel] [k] _raw_spin_lock
0.48% libc.so.6 [.] 0x00000000000965a2
0.45% [kernel] [k] _raw_read_lock_bh
0.44% [kernel] [k] release_pages
0.42% [kernel] [k] clear_page_erms
0.42% [kernel] [k] page_remove_rmap
0.41% [kernel] [k] queued_spin_lock_slowpath
0.38% [kernel] [k] kmem_cache_alloc
0.36% [kernel] [k] vma_interval_tree_insert
0.36% libteam.so.5.6.1 [.] 0x0000000000009e6f
0.36% [kernel] [k] do_set_pte
sessions:
starting: 296
active: 3868
finishing: 6748
> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
>
> Martin Zaharinov <micron10@gmail.com> wrote:
>
> [ Trimming CC list ]
>
>> Florian:
>>
>> If you make patch send to test please.
>
> Attached. No idea if it helps, but 'ip' should stay responsive
> even when masquerade processes netdevice events.
> <defer_masq_work.diff>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-09-16 20:00 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 20:53 Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected Martin Zaharinov
2021-08-06 4:40 ` Greg KH
2021-08-06 5:40 ` Martin Zaharinov
2021-08-08 15:14 ` Martin Zaharinov
2021-08-08 15:23 ` Pali Rohár
2021-08-08 15:29 ` Martin Zaharinov
2021-08-09 15:15 ` Pali Rohár
2021-08-10 18:27 ` Martin Zaharinov
2021-08-11 16:40 ` Guillaume Nault
2021-08-11 11:10 ` Martin Zaharinov
2021-08-11 16:48 ` Guillaume Nault
2021-09-07 6:16 ` Martin Zaharinov
2021-09-07 6:42 ` Martin Zaharinov
2021-09-11 6:26 ` Martin Zaharinov
2021-09-14 6:16 ` Martin Zaharinov
2021-09-14 8:02 ` Guillaume Nault
2021-09-14 9:50 ` Florian Westphal
2021-09-14 10:01 ` Martin Zaharinov
2021-09-14 11:00 ` Florian Westphal
2021-09-15 14:25 ` Martin Zaharinov
2021-09-15 14:37 ` Martin Zaharinov
2021-09-16 20:00 ` Martin Zaharinov
2021-09-14 10:53 ` Martin Zaharinov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).