Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* Packet gets stuck in NOLOCK pfifo_fast qdisc
@ 2019-10-09  6:46 Jonas Bonn
  2019-10-09 19:14 ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: Jonas Bonn @ 2019-10-09  6:46 UTC (permalink / raw)
  To: netdev, LKML, David S . Miller, pabeni

Hi,

The lockless pfifo_fast qdisc has an issue with packets getting stuck in 
the queue.  What appears to happen is:

i)  Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
ii)  Thread 1 dequeues the last packet in the queue.
iii)  Thread 1 iterates through the qdisc->dequeue function again and 
determines that the queue is empty.

iv)  Thread 2 queues up a packet.  Since 'seqlock' is busy, it just 
assumes the packet will be dequeued by whoever is holding the lock.

v)  Thread 1 releases 'seqlock'.

After v), nobody will check if there are packets in the queue until a 
new packet is enqueued.  Thereby, the packet enqueued by Thread 2 may be 
delayed indefinitely.

What, I think, should probably happen is that Thread 1 should check that 
the queue is empty again after releasing 'seqlock'.  I poked at this, 
but it wasn't obvious to me how to go about this given the way the 
layering works here.  Roughly:

qdisc_run_end() {
...
	spin_unlock(seqlock);
	if (!qdisc_is_empty(qdisc))
		qdisc_run();
...
}

Calling qdisc_run() from qdisc_run_end() doesn't feel right!

There's a qdisc->empty property (and qdisc_is_empty() relies on it) but 
it's not particularly useful in this case since there's a race in 
setting this property which makes it not quite reliable.

Hope someone can shine some light on how to proceed here.

/Jonas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2019-10-09  6:46 Packet gets stuck in NOLOCK pfifo_fast qdisc Jonas Bonn
@ 2019-10-09 19:14 ` Paolo Abeni
  2019-10-10  6:27   ` Jonas Bonn
  2019-10-11  0:39   ` Jonas Bonn
  0 siblings, 2 replies; 47+ messages in thread
From: Paolo Abeni @ 2019-10-09 19:14 UTC (permalink / raw)
  To: Jonas Bonn, netdev, LKML, David S . Miller, John Fastabend

On Wed, 2019-10-09 at 08:46 +0200, Jonas Bonn wrote:
> Hi,
> 
> The lockless pfifo_fast qdisc has an issue with packets getting stuck in 
> the queue.  What appears to happen is:
> 
> i)  Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
> ii)  Thread 1 dequeues the last packet in the queue.
> iii)  Thread 1 iterates through the qdisc->dequeue function again and 
> determines that the queue is empty.
> 
> iv)  Thread 2 queues up a packet.  Since 'seqlock' is busy, it just 
> assumes the packet will be dequeued by whoever is holding the lock.
> 
> v)  Thread 1 releases 'seqlock'.
> 
> After v), nobody will check if there are packets in the queue until a 
> new packet is enqueued.  Thereby, the packet enqueued by Thread 2 may be 
> delayed indefinitely.

I think you are right.

It looks like this possible race is present since the initial lockless
implementation - commit 6b3ba9146fe6 ("net: sched: allow qdiscs to
handle locking")

Anyhow the racing windows looks quite tiny - I never observed that
issue in my tests. Do you have a working reproducer?

Something alike the following code - completely untested - can possibly
address the issue, but it's a bit rough and I would prefer not adding
additonal complexity to the lockless qdiscs, can you please have a spin
a it?

Thanks,

Paolo
---
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 6a70845bd9ab..65a1c03330d6 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 		     struct net_device *dev, struct netdev_queue *txq,
 		     spinlock_t *root_lock, bool validate);
 
-void __qdisc_run(struct Qdisc *q);
+int __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
+	int quota = 0;
+
 	if (qdisc_run_begin(q)) {
 		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
 		 * to avoid racing with dev_qdisc_reset()
 		 */
 		if (!(q->flags & TCQ_F_NOLOCK) ||
 		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			__qdisc_run(q);
+			quota = __qdisc_run(q);
 		qdisc_run_end(q);
+
+		if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
+			__netif_schedule(q);
 	}
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 17bd8f539bc7..013480f6a794 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -376,7 +376,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 	return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
 }
 
-void __qdisc_run(struct Qdisc *q)
+int __qdisc_run(struct Qdisc *q)
 {
 	int quota = dev_tx_weight;
 	int packets;
@@ -390,9 +390,10 @@ void __qdisc_run(struct Qdisc *q)
 		quota -= packets;
 		if (quota <= 0 || need_resched()) {
 			__netif_schedule(q);
-			break;
+			return 0;
 		}
 	}
+	return quota;
 }
 
 unsigned long dev_trans_start(struct net_device *dev)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2019-10-09 19:14 ` Paolo Abeni
@ 2019-10-10  6:27   ` Jonas Bonn
  2019-10-11  0:39   ` Jonas Bonn
  1 sibling, 0 replies; 47+ messages in thread
From: Jonas Bonn @ 2019-10-10  6:27 UTC (permalink / raw)
  To: Paolo Abeni, netdev, LKML, David S . Miller, John Fastabend

Hi Paolo,

On 09/10/2019 21:14, Paolo Abeni wrote:
> On Wed, 2019-10-09 at 08:46 +0200, Jonas Bonn wrote:
>> Hi,
>>
>> The lockless pfifo_fast qdisc has an issue with packets getting stuck in
>> the queue.  What appears to happen is:
>>
>> i)  Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
>> ii)  Thread 1 dequeues the last packet in the queue.
>> iii)  Thread 1 iterates through the qdisc->dequeue function again and
>> determines that the queue is empty.
>>
>> iv)  Thread 2 queues up a packet.  Since 'seqlock' is busy, it just
>> assumes the packet will be dequeued by whoever is holding the lock.
>>
>> v)  Thread 1 releases 'seqlock'.
>>
>> After v), nobody will check if there are packets in the queue until a
>> new packet is enqueued.  Thereby, the packet enqueued by Thread 2 may be
>> delayed indefinitely.
> 
> I think you are right.
> 
> It looks like this possible race is present since the initial lockless
> implementation - commit 6b3ba9146fe6 ("net: sched: allow qdiscs to
> handle locking")
> 
> Anyhow the racing windows looks quite tiny - I never observed that
> issue in my tests. Do you have a working reproducer?

Yes, it's reliably reproducible.  We do network latency measurements and 
latency spikes for these packets that get stuck in the queue.

> 
> Something alike the following code - completely untested - can possibly
> address the issue, but it's a bit rough and I would prefer not adding
> additonal complexity to the lockless qdiscs, can you please have a spin
> a it?

Your change looks reasonable.  I'll give it a try.


> 
> Thanks,
> 
> Paolo
> ---
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..65a1c03330d6 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>   		     struct net_device *dev, struct netdev_queue *txq,
>   		     spinlock_t *root_lock, bool validate);
>   
> -void __qdisc_run(struct Qdisc *q);
> +int __qdisc_run(struct Qdisc *q);
>   
>   static inline void qdisc_run(struct Qdisc *q)
>   {
> +	int quota = 0;
> +
>   	if (qdisc_run_begin(q)) {
>   		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
>   		 * to avoid racing with dev_qdisc_reset()
>   		 */
>   		if (!(q->flags & TCQ_F_NOLOCK) ||
>   		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> -			__qdisc_run(q);
> +			quota = __qdisc_run(q);
>   		qdisc_run_end(q);
> +
> +		if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
> +			__netif_schedule(q);

Not sure this is relevant, but there's a subtle difference in the way 
that the underlying ptr_ring peeks at the queue head and checks whether 
the queue is empty.

For peek it's:

READ_ONCE(r->queue[r->consumer_head]);

For is_empty it's:

!r->queue[READ_ONCE(r->consumer_head)];

The placement of the READ_ONCE changes here.  I can't get my head around 
whether this difference is significant or not.  If it is, then perhaps 
an is_empty() method is needed on the qdisc_ops...???

/Jonas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2019-10-09 19:14 ` Paolo Abeni
  2019-10-10  6:27   ` Jonas Bonn
@ 2019-10-11  0:39   ` Jonas Bonn
  2020-06-23 13:42     ` Michael Zhivich
  1 sibling, 1 reply; 47+ messages in thread
From: Jonas Bonn @ 2019-10-11  0:39 UTC (permalink / raw)
  To: Paolo Abeni, netdev, LKML, David S . Miller, John Fastabend

Hi Paolo,

On 09/10/2019 21:14, Paolo Abeni wrote:
> Something alike the following code - completely untested - can possibly
> address the issue, but it's a bit rough and I would prefer not adding
> additonal complexity to the lockless qdiscs, can you please have a spin
> a it?

We've tested a couple of variants of this patch today, but unfortunately 
it doesn't fix the problem of packets getting stuck in the queue.

A couple of comments:

i) On 5.4, there is the BYPASS path that also needs the same treatment 
as it's essentially replicating the behavour of qdisc_run, just without 
the queue/dequeue steps

ii)  We are working a lot with the 4.19 kernel so I backported to the 
patch to this version and tested there.  Here the solution would seem to 
be more robust as the BYPASS path does not exist.

Unfortunately, in both cases we continue to see the issue of the "last 
packet" getting stuck in the queue.

/Jonas


> 
> Thanks,
> 
> Paolo
> ---
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..65a1c03330d6 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>   		     struct net_device *dev, struct netdev_queue *txq,
>   		     spinlock_t *root_lock, bool validate);
>   
> -void __qdisc_run(struct Qdisc *q);
> +int __qdisc_run(struct Qdisc *q);
>   
>   static inline void qdisc_run(struct Qdisc *q)
>   {
> +	int quota = 0;
> +
>   	if (qdisc_run_begin(q)) {
>   		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
>   		 * to avoid racing with dev_qdisc_reset()
>   		 */
>   		if (!(q->flags & TCQ_F_NOLOCK) ||
>   		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> -			__qdisc_run(q);
> +			quota = __qdisc_run(q);
>   		qdisc_run_end(q);
> +
> +		if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
> +			__netif_schedule(q);
>   	}
>   }
>   
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 17bd8f539bc7..013480f6a794 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -376,7 +376,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>   	return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
>   }
>   
> -void __qdisc_run(struct Qdisc *q)
> +int __qdisc_run(struct Qdisc *q)
>   {
>   	int quota = dev_tx_weight;
>   	int packets;
> @@ -390,9 +390,10 @@ void __qdisc_run(struct Qdisc *q)
>   		quota -= packets;
>   		if (quota <= 0 || need_resched()) {
>   			__netif_schedule(q);
> -			break;
> +			return 0;
>   		}
>   	}
> +	return quota;
>   }
>   
>   unsigned long dev_trans_start(struct net_device *dev)
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2019-10-11  0:39   ` Jonas Bonn
@ 2020-06-23 13:42     ` Michael Zhivich
  2020-06-30 19:14       ` Josh Hunt
  0 siblings, 1 reply; 47+ messages in thread
From: Michael Zhivich @ 2020-06-23 13:42 UTC (permalink / raw)
  To: jonas.bonn
  Cc: davem, john.fastabend, linux-kernel, netdev, pabeni, johunt, mzhivich

> From: Jonas Bonn <jonas.bonn@netrounds.com>
> To: Paolo Abeni <pabeni@redhat.com>,
> 	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
> 	LKML <linux-kernel@vger.kernel.org>,
> 	"David S . Miller" <davem@davemloft.net>,
> 	John Fastabend <john.fastabend@gmail.com>
> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
> Date: Fri, 11 Oct 2019 02:39:48 +0200
> Message-ID: <465a540e-5296-32e7-f6a6-79942dfe2618@netrounds.com> (raw)
> In-Reply-To: <95c5a697932e19ebd6577b5dac4d7052fe8c4255.camel@redhat.com>
> 
> Hi Paolo,
> 
> On 09/10/2019 21:14, Paolo Abeni wrote:
> > Something alike the following code - completely untested - can possibly
> > address the issue, but it's a bit rough and I would prefer not adding
> > additonal complexity to the lockless qdiscs, can you please have a spin
> > a it?
> 
> We've tested a couple of variants of this patch today, but unfortunately 
> it doesn't fix the problem of packets getting stuck in the queue.
> 
> A couple of comments:
> 
> i) On 5.4, there is the BYPASS path that also needs the same treatment 
> as it's essentially replicating the behavour of qdisc_run, just without 
> the queue/dequeue steps
> 
> ii)  We are working a lot with the 4.19 kernel so I backported to the 
> patch to this version and tested there.  Here the solution would seem to 
> be more robust as the BYPASS path does not exist.
> 
> Unfortunately, in both cases we continue to see the issue of the "last 
> packet" getting stuck in the queue.
> 
> /Jonas

Hello Jonas, Paolo,

We have observed the same problem with pfifo_fast qdisc when sending periodic small
packets on a TCP flow with multiple simultaneous connections on a 4.19.75
kernel.  We've been able to catch it in action using perf probes (see trace
below).  For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
it takes 200270us to traverse the networking stack on a system that's not otherwise busy.
qdisc only resumes processing when another enqueued packet comes in,
so the packet could have been stuck indefinitely.

   proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
   proc-19902 19902 [032] 580644.045480:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
   proc-19927 19927 [014] 580644.045480:      probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0 sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
   proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
   proc-19927 19927 [014] 580644.045481:      probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80 sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
   proc-19902 19902 [032] 580644.045481:        probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
   proc-19927 19927 [014] 580644.045481:            net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
   proc-19902 19902 [032] 580644.045482:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=1
   proc-19927 19927 [014] 580644.045483:     probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
   proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
   proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=1
   proc-19902 19902 [032] 580644.045484:          probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
   proc-19927 19927 [014] 580644.245745:     probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
   proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=2
   proc-19927 19927 [014] 580644.245746:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=0
   proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
   proc-19927 19927 [014] 580644.245747:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
   proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
   proc-19927 19927 [014] 580644.245748:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
   proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
   proc-19927 19927 [014] 580644.245749:          qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
   proc-19927 19927 [014] 580644.245749:        probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
   proc-19927 19927 [014] 580644.245750:       net:net_dev_start_xmit: dev=eth0 queue_mapping=14 skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0 network_offset=14 transport_offset_valid=1 transport_offset=34 tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1

I was wondering if you had any more luck in finding a solution or workaround for this problem
(that is, aside from switching to a different qdisc)?

Thanks,
~ Michael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-06-23 13:42     ` Michael Zhivich
@ 2020-06-30 19:14       ` Josh Hunt
  2020-07-01  7:53         ` Jonas Bonn
  2020-07-01 16:05         ` Cong Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Josh Hunt @ 2020-06-30 19:14 UTC (permalink / raw)
  To: jonas.bonn, pabeni
  Cc: Michael Zhivich, davem, john.fastabend, linux-kernel, netdev

On 6/23/20 6:42 AM, Michael Zhivich wrote:
>> From: Jonas Bonn <jonas.bonn@netrounds.com>
>> To: Paolo Abeni <pabeni@redhat.com>,
>> 	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
>> 	LKML <linux-kernel@vger.kernel.org>,
>> 	"David S . Miller" <davem@davemloft.net>,
>> 	John Fastabend <john.fastabend@gmail.com>
>> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
>> Date: Fri, 11 Oct 2019 02:39:48 +0200
>> Message-ID: <465a540e-5296-32e7-f6a6-79942dfe2618@netrounds.com> (raw)
>> In-Reply-To: <95c5a697932e19ebd6577b5dac4d7052fe8c4255.camel@redhat.com>
>>
>> Hi Paolo,
>>
>> On 09/10/2019 21:14, Paolo Abeni wrote:
>>> Something alike the following code - completely untested - can possibly
>>> address the issue, but it's a bit rough and I would prefer not adding
>>> additonal complexity to the lockless qdiscs, can you please have a spin
>>> a it?
>>
>> We've tested a couple of variants of this patch today, but unfortunately
>> it doesn't fix the problem of packets getting stuck in the queue.
>>
>> A couple of comments:
>>
>> i) On 5.4, there is the BYPASS path that also needs the same treatment
>> as it's essentially replicating the behavour of qdisc_run, just without
>> the queue/dequeue steps
>>
>> ii)  We are working a lot with the 4.19 kernel so I backported to the
>> patch to this version and tested there.  Here the solution would seem to
>> be more robust as the BYPASS path does not exist.
>>
>> Unfortunately, in both cases we continue to see the issue of the "last
>> packet" getting stuck in the queue.
>>
>> /Jonas
> 
> Hello Jonas, Paolo,
> 
> We have observed the same problem with pfifo_fast qdisc when sending periodic small
> packets on a TCP flow with multiple simultaneous connections on a 4.19.75
> kernel.  We've been able to catch it in action using perf probes (see trace
> below).  For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
> it takes 200270us to traverse the networking stack on a system that's not otherwise busy.
> qdisc only resumes processing when another enqueued packet comes in,
> so the packet could have been stuck indefinitely.
> 
>     proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
>     proc-19902 19902 [032] 580644.045480:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
>     proc-19927 19927 [014] 580644.045480:      probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0 sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
>     proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>     proc-19927 19927 [014] 580644.045481:      probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80 sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
>     proc-19902 19902 [032] 580644.045481:        probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
>     proc-19927 19927 [014] 580644.045481:            net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
>     proc-19902 19902 [032] 580644.045482:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=1
>     proc-19927 19927 [014] 580644.045483:     probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>     proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>     proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=1
>     proc-19902 19902 [032] 580644.045484:          probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
>     proc-19927 19927 [014] 580644.245745:     probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>     proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=2
>     proc-19927 19927 [014] 580644.245746:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=0
>     proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
>     proc-19927 19927 [014] 580644.245747:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
>     proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
>     proc-19927 19927 [014] 580644.245748:     probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
>     proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>     proc-19927 19927 [014] 580644.245749:          qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
>     proc-19927 19927 [014] 580644.245749:        probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
>     proc-19927 19927 [014] 580644.245750:       net:net_dev_start_xmit: dev=eth0 queue_mapping=14 skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0 network_offset=14 transport_offset_valid=1 transport_offset=34 tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1
> 
> I was wondering if you had any more luck in finding a solution or workaround for this problem
> (that is, aside from switching to a different qdisc)?
> 
> Thanks,
> ~ Michael
> 

Jonas/Paolo

Do either of you know if there's been any development on a fix for this 
issue? If not we can propose something.

Thanks
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-06-30 19:14       ` Josh Hunt
@ 2020-07-01  7:53         ` Jonas Bonn
  2020-07-01 16:05         ` Cong Wang
  1 sibling, 0 replies; 47+ messages in thread
From: Jonas Bonn @ 2020-07-01  7:53 UTC (permalink / raw)
  To: Josh Hunt, pabeni
  Cc: Michael Zhivich, davem, john.fastabend, linux-kernel, netdev



On 30/06/2020 21:14, Josh Hunt wrote:
> On 6/23/20 6:42 AM, Michael Zhivich wrote:
>>> From: Jonas Bonn <jonas.bonn@netrounds.com>
>>> To: Paolo Abeni <pabeni@redhat.com>,
>>>     "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
>>>     LKML <linux-kernel@vger.kernel.org>,
>>>     "David S . Miller" <davem@davemloft.net>,
>>>     John Fastabend <john.fastabend@gmail.com>
>>> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
>>> Date: Fri, 11 Oct 2019 02:39:48 +0200
>>> Message-ID: <465a540e-5296-32e7-f6a6-79942dfe2618@netrounds.com> (raw)
>>> In-Reply-To: <95c5a697932e19ebd6577b5dac4d7052fe8c4255.camel@redhat.com>
>>>
>>> Hi Paolo,
>>>
>>> On 09/10/2019 21:14, Paolo Abeni wrote:
>>>> Something alike the following code - completely untested - can possibly
>>>> address the issue, but it's a bit rough and I would prefer not adding
>>>> additonal complexity to the lockless qdiscs, can you please have a spin
>>>> a it?
>>>
>>> We've tested a couple of variants of this patch today, but unfortunately
>>> it doesn't fix the problem of packets getting stuck in the queue.
>>>
>>> A couple of comments:
>>>
>>> i) On 5.4, there is the BYPASS path that also needs the same treatment
>>> as it's essentially replicating the behavour of qdisc_run, just without
>>> the queue/dequeue steps
>>>
>>> ii)  We are working a lot with the 4.19 kernel so I backported to the
>>> patch to this version and tested there.  Here the solution would seem to
>>> be more robust as the BYPASS path does not exist.
>>>
>>> Unfortunately, in both cases we continue to see the issue of the "last
>>> packet" getting stuck in the queue.
>>>
>>> /Jonas
>>
>> Hello Jonas, Paolo,
>>
>> We have observed the same problem with pfifo_fast qdisc when sending 
>> periodic small
>> packets on a TCP flow with multiple simultaneous connections on a 4.19.75
>> kernel.  We've been able to catch it in action using perf probes (see 
>> trace
>> below).  For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
>> it takes 200270us to traverse the networking stack on a system that's 
>> not otherwise busy.
>> qdisc only resumes processing when another enqueued packet comes in,
>> so the packet could have been stuck indefinitely.
>>
>>     proc-19902 19902 [032] 580644.045480: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
>>     proc-19902 19902 [032] 580644.045480:     
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.045480:      
>> probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0 
>> sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
>>     proc-19902 19902 [032] 580644.045480: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.045481:      
>> probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80 
>> sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
>>     proc-19902 19902 [032] 580644.045481:        
>> probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0 
>> q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 
>> root_lock=0x0 validate=1 ret=-1 again=155
>>     proc-19927 19927 [014] 580644.045481:            
>> net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
>>     proc-19902 19902 [032] 580644.045482:     
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 
>> skb=0xffffffff9b69d8c0 band=1
>>     proc-19927 19927 [014] 580644.045483:     
>> probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0 
>> qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>>     proc-19902 19902 [032] 580644.045483: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.045483: 
>> probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) 
>> skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 
>> to_free=0xffff91d0f67ab940 atomic_qlen=1
>>     proc-19902 19902 [032] 580644.045484:          
>> probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
>>     proc-19927 19927 [014] 580644.245745:     
>> probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0 
>> qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>>     proc-19927 19927 [014] 580644.245745: 
>> probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) 
>> skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 
>> to_free=0xffff91d0f67ab940 atomic_qlen=2
>>     proc-19927 19927 [014] 580644.245746:     
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 
>> skb=0xffffffff9b69d8c0 band=0
>>     proc-19927 19927 [014] 580644.245746: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
>>     proc-19927 19927 [014] 580644.245747:     
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.245747: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.245748:     
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.245748: 
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) 
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.245749:          
>> qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF 
>> txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
>>     proc-19927 19927 [014] 580644.245749:        
>> probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0 
>> q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 
>> root_lock=0x0 validate=1 ret=-1 again=155
>>     proc-19927 19927 [014] 580644.245750:       
>> net:net_dev_start_xmit: dev=eth0 queue_mapping=14 
>> skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000 
>> vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0 
>> network_offset=14 transport_offset_valid=1 transport_offset=34 
>> tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1
>>
>> I was wondering if you had any more luck in finding a solution or 
>> workaround for this problem
>> (that is, aside from switching to a different qdisc)?
>>
>> Thanks,
>> ~ Michael
>>
> 
> Jonas/Paolo
> 
> Do either of you know if there's been any development on a fix for this 
> issue? If not we can propose something.

Hi Josh,

No, I haven't been able to do any more work on this and the affected 
user switched qdisc (to avoid this problem) so I lost the reliable 
reproducer that I had...

/Jonas

> 
> Thanks
> Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-06-30 19:14       ` Josh Hunt
  2020-07-01  7:53         ` Jonas Bonn
@ 2020-07-01 16:05         ` Cong Wang
  2020-07-01 19:58           ` Cong Wang
  1 sibling, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-07-01 16:05 UTC (permalink / raw)
  To: Josh Hunt
  Cc: jonas.bonn, Paolo Abeni, Michael Zhivich, David Miller,
	John Fastabend, LKML, Linux Kernel Network Developers

On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
> Do either of you know if there's been any development on a fix for this
> issue? If not we can propose something.

If you have a reproducer, I can look into this.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-01 16:05         ` Cong Wang
@ 2020-07-01 19:58           ` Cong Wang
  2020-07-01 22:02             ` Josh Hunt
  2020-07-02  6:14             ` Jonas Bonn
  0 siblings, 2 replies; 47+ messages in thread
From: Cong Wang @ 2020-07-01 19:58 UTC (permalink / raw)
  To: Josh Hunt
  Cc: jonas.bonn, Paolo Abeni, Michael Zhivich, David Miller,
	John Fastabend, LKML, Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 379 bytes --]

On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
> > Do either of you know if there's been any development on a fix for this
> > issue? If not we can propose something.
>
> If you have a reproducer, I can look into this.

Does the attached patch fix this bug completely?

Thanks.

[-- Attachment #2: qdisc_run.diff --]
[-- Type: text/x-patch, Size: 1202 bytes --]

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9092e697059e..5a03cded3054 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -123,7 +123,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 void __qdisc_run(struct Qdisc *q);
 
-static inline void qdisc_run(struct Qdisc *q)
+static inline bool qdisc_run(struct Qdisc *q)
 {
 	if (qdisc_run_begin(q)) {
 		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
@@ -133,7 +133,9 @@ static inline void qdisc_run(struct Qdisc *q)
 		    likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
 			__qdisc_run(q);
 		qdisc_run_end(q);
+		return true;
 	}
+	return false;
 }
 
 static inline __be16 tc_skb_protocol(const struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..c7e48356132a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+			__netif_schedule(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-01 19:58           ` Cong Wang
@ 2020-07-01 22:02             ` Josh Hunt
  2020-07-02  6:14             ` Jonas Bonn
  1 sibling, 0 replies; 47+ messages in thread
From: Josh Hunt @ 2020-07-01 22:02 UTC (permalink / raw)
  To: Cong Wang
  Cc: jonas.bonn, Paolo Abeni, Michael Zhivich, David Miller,
	John Fastabend, LKML, Linux Kernel Network Developers


On 7/1/20 12:58 PM, Cong Wang wrote:
> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
>>> Do either of you know if there's been any development on a fix for this
>>> issue? If not we can propose something.
>>
>> If you have a reproducer, I can look into this.
> 
> Does the attached patch fix this bug completely?
> 
> Thanks.
> 

Hey Cong

Unfortunately we don't have a reproducer that would be easy to share, 
however your patch makes sense to me at a high-level. We will test and 
let you know if this fixes our problem.

Thanks!
Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-01 19:58           ` Cong Wang
  2020-07-01 22:02             ` Josh Hunt
@ 2020-07-02  6:14             ` Jonas Bonn
  2020-07-02  9:45               ` Paolo Abeni
  1 sibling, 1 reply; 47+ messages in thread
From: Jonas Bonn @ 2020-07-02  6:14 UTC (permalink / raw)
  To: Cong Wang, Josh Hunt
  Cc: Paolo Abeni, Michael Zhivich, David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers

Hi Cong,

On 01/07/2020 21:58, Cong Wang wrote:
> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
>>> Do either of you know if there's been any development on a fix for this
>>> issue? If not we can propose something.
>>
>> If you have a reproducer, I can look into this.
> 
> Does the attached patch fix this bug completely?

It's easier to comment if you inline the patch, but after taking a quick 
look it seems too simplistic.

i)  Are you sure you haven't got the return values on qdisc_run reversed?
ii) There's a "bypass" path that skips the enqueue/dequeue operation if 
the queue is empty; that needs a similar treatment:  after releasing 
seqlock it needs to ensure that another packet hasn't been enqueued 
since it last checked.

/Jonas

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-02  6:14             ` Jonas Bonn
@ 2020-07-02  9:45               ` Paolo Abeni
  2020-07-02 18:08                 ` Josh Hunt
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2020-07-02  9:45 UTC (permalink / raw)
  To: Jonas Bonn, Cong Wang, Josh Hunt
  Cc: Michael Zhivich, David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers

Hi all,

On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> Hi Cong,
> 
> On 01/07/2020 21:58, Cong Wang wrote:
> > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
> > > > Do either of you know if there's been any development on a fix for this
> > > > issue? If not we can propose something.
> > > 
> > > If you have a reproducer, I can look into this.
> > 
> > Does the attached patch fix this bug completely?
> 
> It's easier to comment if you inline the patch, but after taking a quick 
> look it seems too simplistic.
> 
> i)  Are you sure you haven't got the return values on qdisc_run reversed?

qdisc_run() returns true if it was able to acquire the seq lock. We
need to take special action in the opposite case, so Cong's patch LGTM
from a functional PoV.

> ii) There's a "bypass" path that skips the enqueue/dequeue operation if 
> the queue is empty; that needs a similar treatment:  after releasing 
> seqlock it needs to ensure that another packet hasn't been enqueued 
> since it last checked.

That has been reverted with
commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323

---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 90b59fc50dc9..c7e48356132a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> 
>  	if (q->flags & TCQ_F_NOLOCK) {
>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -		qdisc_run(q);
> +		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> +			__netif_schedule(q);

I fear the __netif_schedule() call may cause performance regression to
the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
collect some data.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-02  9:45               ` Paolo Abeni
@ 2020-07-02 18:08                 ` Josh Hunt
  2020-07-07 14:18                   ` Paolo Abeni
                                     ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Josh Hunt @ 2020-07-02 18:08 UTC (permalink / raw)
  To: Paolo Abeni, Jonas Bonn, Cong Wang
  Cc: Michael Zhivich, David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers

On 7/2/20 2:45 AM, Paolo Abeni wrote:
> Hi all,
> 
> On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
>> Hi Cong,
>>
>> On 01/07/2020 21:58, Cong Wang wrote:
>>> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
>>>>> Do either of you know if there's been any development on a fix for this
>>>>> issue? If not we can propose something.
>>>>
>>>> If you have a reproducer, I can look into this.
>>>
>>> Does the attached patch fix this bug completely?
>>
>> It's easier to comment if you inline the patch, but after taking a quick
>> look it seems too simplistic.
>>
>> i)  Are you sure you haven't got the return values on qdisc_run reversed?
> 
> qdisc_run() returns true if it was able to acquire the seq lock. We
> need to take special action in the opposite case, so Cong's patch LGTM
> from a functional PoV.
> 
>> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
>> the queue is empty; that needs a similar treatment:  after releasing
>> seqlock it needs to ensure that another packet hasn't been enqueued
>> since it last checked.
> 
> That has been reverted with
> commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> 
> ---
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 90b59fc50dc9..c7e48356132a 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>
>>   	if (q->flags & TCQ_F_NOLOCK) {
>>   		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> -		qdisc_run(q);
>> +		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
>> +			__netif_schedule(q);
> 
> I fear the __netif_schedule() call may cause performance regression to
> the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> collect some data.

Initial results with Cong's patch look promising, so far no stalls. We 
will let it run over the long weekend and report back on Tuesday.

Paolo - I have concerns about possible performance regression with the 
change as well. If you can gather some data that would be great. If 
things look good with our low throughput test over the weekend we can 
also try assessing performance next week.

Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-02 18:08                 ` Josh Hunt
@ 2020-07-07 14:18                   ` Paolo Abeni
  2020-07-08 20:16                     ` Cong Wang
  2020-07-08 20:33                   ` Zhivich, Michael
  2020-08-20  7:43                   ` Jike Song
  2 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2020-07-07 14:18 UTC (permalink / raw)
  To: Josh Hunt, Jonas Bonn, Cong Wang
  Cc: Michael Zhivich, David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers

On Thu, 2020-07-02 at 11:08 -0700, Josh Hunt wrote:
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> > 
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> > > Hi Cong,
> > > 
> > > On 01/07/2020 21:58, Cong Wang wrote:
> > > > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
> > > > > > Do either of you know if there's been any development on a fix for this
> > > > > > issue? If not we can propose something.
> > > > > 
> > > > > If you have a reproducer, I can look into this.
> > > > 
> > > > Does the attached patch fix this bug completely?
> > > 
> > > It's easier to comment if you inline the patch, but after taking a quick
> > > look it seems too simplistic.
> > > 
> > > i)  Are you sure you haven't got the return values on qdisc_run reversed?
> > 
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> > 
> > > ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> > > the queue is empty; that needs a similar treatment:  after releasing
> > > seqlock it needs to ensure that another packet hasn't been enqueued
> > > since it last checked.
> > 
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> > 
> > ---
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 90b59fc50dc9..c7e48356132a 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > > 
> > >   	if (q->flags & TCQ_F_NOLOCK) {
> > >   		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > > -		qdisc_run(q);
> > > +		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> > > +			__netif_schedule(q);
> > 
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
> 
> Initial results with Cong's patch look promising, so far no stalls. We 
> will let it run over the long weekend and report back on Tuesday.
> 
> Paolo - I have concerns about possible performance regression with the 
> change as well. If you can gather some data that would be great. 

I finally had the time to run some performance tests vs the above with
mixed results.

Using several netperf threadsover a single pfifo_fast queue with small
UDP packets, perf differences vs vanilla are just above noise range (1-
1,5%)

Using pktgen in 'queue_xmit' mode on a dummy device (this should
maximise the pkt-rate and thus the contention) I see:

pktgen threads	vanilla		patched		delta
nr		kpps		kpps		%

1		3240		3240		0
2		3910		2710		-30.5
4		5140		4920		-4

A relevant source of the measured overhead is due to the contention on
q->state in __netif_schedule, so the following helps a bit:

---
diff --git a/net/core/dev.c b/net/core/dev.c
index b8e8286a0a34..3cad6e086fac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3750,7 +3750,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
        if (q->flags & TCQ_F_NOLOCK) {
                rc = q->enqueue(skb, q, NULL, &to_free) & NET_XMIT_MASK;
-               if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+               if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS &&
+                   !test_bit(__QDISC_STATE_SCHED, &q->state))
                        __netif_schedule(q);
 
                if (unlikely(to_free))
---

With the above incremental patch applied I see:
pktgen threads	vanilla		patched[II]	delta
nr		kpps		kpps		%
1		3240		3240		0
2		3910		2830		-27%
4		5140		5140		0

So the regression with 2 pktgen threads is still relevant. 'perf' shows
relevant time spent into net_tx_action() and __netif_schedule().

Cheers,

Paolo.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-07 14:18                   ` Paolo Abeni
@ 2020-07-08 20:16                     ` Cong Wang
  2020-07-09  9:20                       ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-07-08 20:16 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Josh Hunt, Jonas Bonn, Michael Zhivich, David Miller,
	John Fastabend, LKML, Linux Kernel Network Developers

On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
> So the regression with 2 pktgen threads is still relevant. 'perf' shows
> relevant time spent into net_tx_action() and __netif_schedule().

So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
not a good idea.

Let me see if there is any other way to fix this.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-02 18:08                 ` Josh Hunt
  2020-07-07 14:18                   ` Paolo Abeni
@ 2020-07-08 20:33                   ` Zhivich, Michael
  2020-08-20  7:43                   ` Jike Song
  2 siblings, 0 replies; 47+ messages in thread
From: Zhivich, Michael @ 2020-07-08 20:33 UTC (permalink / raw)
  To: Hunt, Joshua, Paolo Abeni, Jonas Bonn, Cong Wang
  Cc: David Miller, John Fastabend, LKML, Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 3042 bytes --]

On 7/2/20, 2:08 PM, "Josh Hunt" <johunt@akamai.com> wrote:
>
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> > 
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> >> Hi Cong,
> >>
> >> On 01/07/2020 21:58, Cong Wang wrote:
> >>> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >>>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <johunt@akamai.com> wrote:
> >>>>> Do either of you know if there's been any development on a fix for this
> >>>>> issue? If not we can propose something.
> >>>>
> >>>> If you have a reproducer, I can look into this.
> >>>
> >>> Does the attached patch fix this bug completely?
> >>
> >> It's easier to comment if you inline the patch, but after taking a quick
> >> look it seems too simplistic.
> >>
> >> i)  Are you sure you haven't got the return values on qdisc_run reversed?
> > 
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> > 
> >> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> >> the queue is empty; that needs a similar treatment:  after releasing
> >> seqlock it needs to ensure that another packet hasn't been enqueued
> >> since it last checked.
> > 
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> > 
> > ---
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 90b59fc50dc9..c7e48356132a 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >>
> >>   	if (q->flags & TCQ_F_NOLOCK) {
> >>   		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> -		qdisc_run(q);
> >> +		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> >> +			__netif_schedule(q);
> > 
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
>
> Initial results with Cong's patch look promising, so far no stalls. We 
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the 
> change as well. If you can gather some data that would be great. If 
> things look good with our low throughput test over the weekend we can 
> also try assessing performance next week.
>
> Josh

After running our reproducer over the long weekend, we've observed several more packets getting stuck.
The behavior is order of magnitude better *with* the patch (that is, only a few packets get stuck),
but the patch does not completely resolve the issue.

I have a nagging suspicion that the same race that we observed between consumer/producer threads can occur with
softirq processing in net_tx_action() as well (as triggered by __netif_schedule()), since both rely on the same semantics of qdisc_run().  
Unfortunately, in such a case, we cannot just punt to __netif_schedule() again.

Regards,
~ Michael

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3491 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-08 20:16                     ` Cong Wang
@ 2020-07-09  9:20                       ` Paolo Abeni
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Abeni @ 2020-07-09  9:20 UTC (permalink / raw)
  To: Cong Wang
  Cc: Josh Hunt, Jonas Bonn, Michael Zhivich, David Miller,
	John Fastabend, LKML, Linux Kernel Network Developers

On Wed, 2020-07-08 at 13:16 -0700, Cong Wang wrote:
> On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > So the regression with 2 pktgen threads is still relevant. 'perf' shows
> > relevant time spent into net_tx_action() and __netif_schedule().
> 
> So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
> not a good idea.
> 
> Let me see if there is any other way to fix this.

Thank you very much for the effort! I'm personally out of ideas for a
real fix that would avoid regressions. 

To be more exaustive this are the sources of overhead, as far as I can
observe them with perf:

- contention on q->state, in __netif_schedule()
- execution of net_tx_action() when there are no packet to be served

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-07-02 18:08                 ` Josh Hunt
  2020-07-07 14:18                   ` Paolo Abeni
  2020-07-08 20:33                   ` Zhivich, Michael
@ 2020-08-20  7:43                   ` Jike Song
  2020-08-20 18:13                     ` Josh Hunt
       [not found]                     ` <20200822032800.16296-1-hdanton@sina.com>
  2 siblings, 2 replies; 47+ messages in thread
From: Jike Song @ 2020-08-20  7:43 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Paolo Abeni, Jonas Bonn, Cong Wang, Michael Zhivich,
	David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers, kehuan.feng

Hi Josh,

On Fri, Jul 3, 2020 at 2:14 AM Josh Hunt <johunt@akamai.com> wrote:
{snip}
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great. If
> things look good with our low throughput test over the weekend we can
> also try assessing performance next week.
>

We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?

Besides, our testing is pretty complex, do you have a quick test to
reproduce it?

-- 
Thanks,
Jike

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-08-20  7:43                   ` Jike Song
@ 2020-08-20 18:13                     ` Josh Hunt
       [not found]                     ` <20200822032800.16296-1-hdanton@sina.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Josh Hunt @ 2020-08-20 18:13 UTC (permalink / raw)
  To: Jike Song
  Cc: Paolo Abeni, Jonas Bonn, Cong Wang, Michael Zhivich,
	David Miller, John Fastabend, LKML,
	Linux Kernel Network Developers, kehuan.feng

Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:
> Hi Josh,
> 
> 
> We met possibly the same problem when testing nvidia/mellanox's
> GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> can also have a try?

We also did something similar where we've switched over to using the fq 
scheduler everywhere for now. We believe the bug is in the nolock code 
which only pfifo_fast uses atm, but we've been unable to come up with a 
satisfactory solution.

> 
> Besides, our testing is pretty complex, do you have a quick test to
> reproduce it?
> 

Unfortunately we don't have a simple test case either. Our current 
reproducer is complex as well, although it would seem like we should be 
able to come up with something where you have maybe 2 threads trying to 
send on the same tx queue running pfifo_fast every few hundred 
milliseconds and not much else/no other tx traffic on that queue. IIRC 
we believe the scenario is when one thread is in the process of 
dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) 
sees the dequeue is in progress and so does not xmit the packet assuming 
the dequeue operation will take care of it. However b/c the dequeue is 
in the process of completing it doesn't and the newly enqueued packet 
stays in the qdisc until another packet is enqueued pushing both out.

Given that we have a workaround with using fq or any other qdisc not 
named pfifo_fast this has gotten bumped down in priority for us. I would 
like to work on a reproducer at some point, but won't likely be for a 
few weeks :(

Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                     ` <20200822032800.16296-1-hdanton@sina.com>
@ 2020-08-25  2:18                       ` Fengkehuan Feng
       [not found]                         ` <20200825032312.11776-1-hdanton@sina.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Fengkehuan Feng @ 2020-08-25  2:18 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn, Cong Wang,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

[-- Attachment #1: Type: text/plain, Size: 5412 bytes --]

Hillf,

With the latest version (attached what I have changed on my tree), the
system failed to start up with cpu stalled.


Hillf Danton <hdanton@sina.com> 于2020年8月22日周六 上午11:30写道:
>
>
> On Thu, 20 Aug 2020 20:43:17 +0800 Hillf Danton wrote:
> > Hi Jike,
> >
> > On Thu, 20 Aug 2020 15:43:17 +0800 Jike Song wrote:
> > > Hi Josh,
> > >
> > > On Fri, Jul 3, 2020 at 2:14 AM Josh Hunt <johunt@akamai.com> wrote:
> > > {snip}
> > > > Initial results with Cong's patch look promising, so far no stalls. We
> > > > will let it run over the long weekend and report back on Tuesday.
> > > >
> > > > Paolo - I have concerns about possible performance regression with the
> > > > change as well. If you can gather some data that would be great. If
> > > > things look good with our low throughput test over the weekend we can
> > > > also try assessing performance next week.
> > > >
> > >
> > > We met possibly the same problem when testing nvidia/mellanox's
> >
> > Below is what was sent in reply to this thread early last month with
> > minor tuning, based on the seqlock. Feel free to drop an echo if it
> > makes ant-antenna-size sense in your tests.
> >
> > > GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> > > DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> > > can also have a try?
> > >
> > > Besides, our testing is pretty complex, do you have a quick test to
> > > reproduce it?
> > >
> > > --
> > > Thanks,
> > > Jike
> >
> >
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -79,6 +79,7 @@ struct Qdisc {
> >  #define TCQ_F_INVISIBLE              0x80 /* invisible by default in dump */
> >  #define TCQ_F_NOLOCK         0x100 /* qdisc does not require locking */
> >  #define TCQ_F_OFFLOADED              0x200 /* qdisc is offloaded to HW */
> > +     int                     pkt_seq;
> >       u32                     limit;
> >       const struct Qdisc_ops  *ops;
> >       struct qdisc_size_table __rcu *stab;
> > @@ -156,6 +157,7 @@ static inline bool qdisc_is_empty(const
> >  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >  {
> >       if (qdisc->flags & TCQ_F_NOLOCK) {
> > +             qdisc->pkt_seq++;
> >               if (!spin_trylock(&qdisc->seqlock))
> >                       return false;
> >               WRITE_ONCE(qdisc->empty, false);
> > --- a/include/net/pkt_sched.h
> > +++ b/include/net/pkt_sched.h
> > @@ -117,7 +117,9 @@ void __qdisc_run(struct Qdisc *q);
> >
> >  static inline void qdisc_run(struct Qdisc *q)
> >  {
> > -     if (qdisc_run_begin(q)) {
> > +     while (qdisc_run_begin(q)) {
> > +             int seq = q->pkt_seq;
> > +
> >               /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> >                * to avoid racing with dev_qdisc_reset()
> >                */
> > @@ -125,6 +127,9 @@ static inline void qdisc_run(struct Qdis
> >                   likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> >                       __qdisc_run(q);
> >               qdisc_run_end(q);
> > +
> > +             if (!(q->flags & TCQ_F_NOLOCK) || seq == q->pkt_seq)
> > +                     return;
> >       }
> >  }
>
> The echo from Feng indicates that it's hard to conclude that TCQ_F_NOLOCK
> is the culprit, lets try again with it ignored for now.
>
> Every pkt enqueued on pfifo_fast is tracked in the below diff, and those
> pkts enqueued while we're running qdisc are detected and handled to cut
> the chance for the stuck pkts reported.
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
>  #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>  #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>  #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +       int                     pkt_seq;
>         u32                     limit;
>         const struct Qdisc_ops  *ops;
>         struct qdisc_size_table __rcu *stab;
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -631,6 +631,7 @@ static int pfifo_fast_enqueue(struct sk_
>                         return qdisc_drop(skb, qdisc, to_free);
>         }
>
> +       qdisc->pkt_seq++;
>         qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>         return NET_XMIT_SUCCESS;
>  }
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -117,7 +117,8 @@ void __qdisc_run(struct Qdisc *q);
>
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -       if (qdisc_run_begin(q)) {
> +       while (qdisc_run_begin(q)) {
> +               int seq = q->pkt_seq;
>                 /* NOLOCK qdisc must check 'state' under the qdisc seqlock
>                  * to avoid racing with dev_qdisc_reset()
>                  */
> @@ -125,6 +126,12 @@ static inline void qdisc_run(struct Qdis
>                     likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
>                         __qdisc_run(q);
>                 qdisc_run_end(q);
> +
> +               /* go another round if there are pkts enqueued after
> +                * taking seq_lock
> +                */
> +               if (seq != q->pkt_seq)
> +                       continue;
>         }
>  }
>
>

[-- Attachment #2: fix_nolock_from_hillf.patch --]
[-- Type: application/octet-stream, Size: 1260 bytes --]

--- include/net/sch_generic.h.orig	2020-08-21 15:13:51.787952710 +0800
+++ include/net/sch_generic.h	2020-08-24 22:01:46.718709912 +0800
@@ -79,6 +79,7 @@
 #define TCQ_F_INVISIBLE		0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK		0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED		0x200 /* qdisc is offloaded to HW */
+	int                     pkt_seq;
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
--- net/sched/sch_generic.c.orig	2020-08-24 22:02:04.589830751 +0800
+++ net/sched/sch_generic.c	2020-08-24 22:03:48.010728381 +0800
@@ -638,6 +638,8 @@
 	 * so we better not use qdisc_qstats_cpu_backlog_inc()
 	 */
 	this_cpu_add(qdisc->cpu_qstats->backlog, pkt_len);
+
+	qdisc->pkt_seq++;
 	return NET_XMIT_SUCCESS;
 }
 
--- include/net/pkt_sched.h.orig	2020-08-21 15:13:51.787952710 +0800
+++ include/net/pkt_sched.h	2020-08-24 22:06:58.856005213 +0800
@@ -116,9 +116,16 @@
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	if (qdisc_run_begin(q)) {
+	while (qdisc_run_begin(q)) {
+		int seq = q->pkt_seq;
 		__qdisc_run(q);
 		qdisc_run_end(q);
+
+		/* go another round if there are pkts enqueued after
+ 		* taking seq_lock
+ 		*/
+ 		if (seq != q->pkt_seq)
+			continue;
 	}
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                         ` <20200825032312.11776-1-hdanton@sina.com>
@ 2020-08-25  7:14                           ` Fengkehuan Feng
       [not found]                             ` <20200825162329.11292-1-hdanton@sina.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Fengkehuan Feng @ 2020-08-25  7:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn, Cong Wang,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Hi Hillf,

I just tried the updated version and the system can boot up now.
It does mitigate the issue a lot but still couldn't get rid of it
thoroughly. It seems to me like the effect of Cong's patch.


Hillf Danton <hdanton@sina.com> 于2020年8月25日周二 上午11:23写道:
>
>
> Hi Feng,
>
> On Tue, 25 Aug 2020 10:18:05 +0800 Fengkehuan Feng wrote:
> > Hillf,
> >
> > With the latest version (attached what I have changed on my tree), the
> > system failed to start up with cpu stalled.
>
> My fault.
>
> There is a missing break while running qdisc and it's fixed
> in the diff below for Linux-5.x.
>
> If it is Linux-4.x in your testing, running qdisc looks a bit
> different based on your diff(better if it's in the message body):
>
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -       if (qdisc_run_begin(q)) {
> +       while (qdisc_run_begin(q)) {
> +               int seq = q->pkt_seq;
>                 __qdisc_run(q);
>                 qdisc_run_end(q);
> +
> +               /* go another round if there are pkts enqueued after
> +                * taking seq_lock
> +                */
> +               if (seq != q->pkt_seq)
> +                       continue;
> +               else
> +                       return;
>         }
>  }
>
>
> Hillf
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
>  #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>  #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>  #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +       int                     pkt_seq;
>         u32                     limit;
>         const struct Qdisc_ops  *ops;
>         struct qdisc_size_table __rcu *stab;
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -631,6 +631,7 @@ static int pfifo_fast_enqueue(struct sk_
>                         return qdisc_drop(skb, qdisc, to_free);
>         }
>
> +       qdisc->pkt_seq++;
>         qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>         return NET_XMIT_SUCCESS;
>  }
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -117,14 +117,27 @@ void __qdisc_run(struct Qdisc *q);
>
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -       if (qdisc_run_begin(q)) {
> +       while (qdisc_run_begin(q)) {
> +               int seq = q->pkt_seq;
> +               bool check_seq = false;
> +
>                 /* NOLOCK qdisc must check 'state' under the qdisc seqlock
>                  * to avoid racing with dev_qdisc_reset()
>                  */
>                 if (!(q->flags & TCQ_F_NOLOCK) ||
> -                   likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> +                   likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>                         __qdisc_run(q);
> +                       check_seq = true;
> +               }
>                 qdisc_run_end(q);
> +
> +               /* go another round if there are pkts enqueued after
> +                * taking seq_lock
> +                */
> +               if (check_seq && seq != q->pkt_seq)
> +                       continue;
> +               else
> +                       return;
>         }
>  }
>
> --
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                             ` <20200825162329.11292-1-hdanton@sina.com>
@ 2020-08-26  2:38                               ` Kehuan Feng
       [not found]                                 ` <CACS=qqKptAQQGiMoCs1Zgs9S4ZppHhasy1AK4df2NxnCDR+vCw@mail.gmail.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Kehuan Feng @ 2020-08-26  2:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn, Cong Wang,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Hi Hillf,

Thanks for the patch.
I just tried it and it looks better than previous one. The issue
appeared only once over ~30 mins stressing (without the patch , it
shows up within 1 mins in usual, so I feel like we are getting close
to the final fix)
(pasted the modifications on my tree in case of any missing)

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-08-26 09:41:04.647173869 +0800
@@ -79,6 +79,7 @@
 #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
+ int                     pkt_seq;
  u32 limit;
  const struct Qdisc_ops *ops;
  struct qdisc_size_table __rcu *stab;
--- ./include/net/pkt_sched.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/pkt_sched.h 2020-08-26 09:42:14.491377514 +0800
@@ -117,8 +117,15 @@
 static inline void qdisc_run(struct Qdisc *q)
 {
  if (qdisc_run_begin(q)) {
+ q->pkt_seq = 0;
+
  __qdisc_run(q);
  qdisc_run_end(q);
+
+ /* reschedule qdisc if there are packets enqueued */
+ if (q->pkt_seq != 0)
+ __netif_schedule(q);
+
  }
 }

--- ./net/core/dev.c.orig 2020-03-19 16:31:27.000000000 +0800
+++ ./net/core/dev.c 2020-08-26 09:47:57.783165885 +0800
@@ -2721,6 +2721,7 @@

  local_irq_save(flags);
  sd = this_cpu_ptr(&softnet_data);
+ q->pkt_seq = 0;
  q->next_sched = NULL;
  *sd->output_queue_tailp = q;
  sd->output_queue_tailp = &q->next_sched;
--- ./net/sched/sch_generic.c.orig 2020-08-24 22:02:04.589830751 +0800
+++ ./net/sched/sch_generic.c 2020-08-26 09:43:40.987852551 +0800
@@ -403,6 +403,9 @@
  */
  quota -= packets;
  if (quota <= 0 || need_resched()) {
+ /* info caller to reschedule qdisc outside q->seqlock */
+ q->pkt_seq = 1;
+
  __netif_schedule(q);
  break;
  }


Hillf Danton <hdanton@sina.com> 于2020年8月26日周三 上午12:26写道:
>
>
> Hi Feng,
>
> On Tue, 25 Aug 2020 15:14:12 +0800 Fengkehuan Feng wrote:
> > Hi Hillf,
> >
> > I just tried the updated version and the system can boot up now.
>
> Thanks again for your testing.
>
> > It does mitigate the issue a lot but still couldn't get rid of it
> > thoroughly. It seems to me like the effect of Cong's patch.
>
> Your echoes show we're still march in the dark and let's try another
> direction in which qdisc is rescheduled outside seqlock to make sure
> tx softirq is raised when there're more packets on the pfifo_fast to
> be transmitted.
>
>         CPU0                            CPU1
>         ----                            ----
>         seqlock
>         test __QDISC_STATE_SCHED
>                 raise tx softirq
>                                         clear __QDISC_STATE_SCHED
>                                         try seqlock
>                                         __qdisc_run(q);
>                                         sequnlock
>         sequnlock
>
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
>  #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>  #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>  #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +       int                     pkt_seq;
>         u32                     limit;
>         const struct Qdisc_ops  *ops;
>         struct qdisc_size_table __rcu *stab;
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -118,6 +118,8 @@ void __qdisc_run(struct Qdisc *q);
>  static inline void qdisc_run(struct Qdisc *q)
>  {
>         if (qdisc_run_begin(q)) {
> +               q->pkt_seq = 0;
> +
>                 /* NOLOCK qdisc must check 'state' under the qdisc seqlock
>                  * to avoid racing with dev_qdisc_reset()
>                  */
> @@ -125,6 +127,10 @@ static inline void qdisc_run(struct Qdis
>                     likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
>                         __qdisc_run(q);
>                 qdisc_run_end(q);
> +
> +               /* reschedule qdisc if there are packets enqueued */
> +               if (q->pkt_seq != 0)
> +                       __netif_schedule(q);
>         }
>  }
>
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -384,6 +384,8 @@ void __qdisc_run(struct Qdisc *q)
>         while (qdisc_restart(q, &packets)) {
>                 quota -= packets;
>                 if (quota <= 0) {
> +                       /* info caller to reschedule qdisc outside q->seqlock */
> +                       q->pkt_seq = 1;
>                         __netif_schedule(q);
>                         break;
>                 }
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3031,6 +3031,7 @@ static void __netif_reschedule(struct Qd
>
>         local_irq_save(flags);
>         sd = this_cpu_ptr(&softnet_data);
> +       q->pkt_seq = 0;
>         q->next_sched = NULL;
>         *sd->output_queue_tailp = q;
>         sd->output_queue_tailp = &q->next_sched;
> --
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                                   ` <5f46032e.1c69fb81.9880c.7a6cSMTPIN_ADDED_MISSING@mx.google.com>
@ 2020-08-27  6:56                                     ` Kehuan Feng
       [not found]                                       ` <20200827125747.5816-1-hdanton@sina.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Kehuan Feng @ 2020-08-27  6:56 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn, Cong Wang,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Hi Hillf,

> Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing.

I assume you meant disabling NOLOCK for pfifo_fast.

Here is the modification,

--- ./net/sched/sch_generic.c.orig      2020-08-24 22:02:04.589830751 +0800
+++ ./net/sched/sch_generic.c   2020-08-27 10:17:10.148977195 +0800
@@ -792,7 +792,7 @@
        .dump           =       pfifo_fast_dump,
        .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
        .owner          =       THIS_MODULE,
-       .static_flags   =       TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
+       .static_flags   =       TCQ_F_CPUSTATS,

The issue never happen again with it for over 3 hours stressing. And I
restarted the test for two times. No any surprising. Quite stable...


Hillf Danton <hdanton@sina.com> 于2020年8月26日周三 下午2:37写道:
>
> Hi Feng,
>
>
>
> On Wed, 26 Aug 2020 11:12:38 +0800 Fengkehuan Feng wrote:
>
> >Hi Hillf,
> >
> >I just gave more tries on the patch, and it seems not that good as what I told in last email.
>
> >I could see more packets getting stuck now...
>
> We have more to learn here:P
>
> >
> >Let me explain what I am facing in detail in case we are not aligning to fix the same problem.
> >
> >Our application is in deep learning scenario and it's based on NVIDIA NCCL to do
>
> >collective communication intra-node or inter-node (to be more specific, it's data
>
> > all-reduce on two servers witch 8 GPU nodes each).
> >NCCL can support data transmission through TCP/RDMA/GDR. In normal, it takes
>
> > about 1000 us for TCP or less for RDMA/GDR to transmit 512KB packet, but
>
> > sometimes it tooks hundreds of millisecond or several seconds to get completed.
> >
>
> >When we change the default qdisc from pfifo_fast to fq_codel, the issue never
>
> > happen, so we suspect it's something wrong within the networking stack (but
>
> > it's a bit strange that RDMA or GDR has the same problem)
>
> Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing.
>
>
>
> --- a/net/sched/sch_generic.c
>
> +++ b/net/sched/sch_generic.c
>
> @@ -791,7 +791,7 @@ struct Qdisc_ops pfifo_fast_ops __read_m
>
>       .dump           =    pfifo_fast_dump,
>
>       .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
>
>       .owner          =    THIS_MODULE,
>
> -     .static_flags   =    TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
>
> +    .static_flags   =    TCQ_F_CPUSTATS,
>
> };
>
> EXPORT_SYMBOL(pfifo_fast_ops);
>
> --
>
>
>
> >
> >Here is the log print from our test application,
> >
> >size: 512KB, use_time: 1118us, speed: 0.436745GB/s
> >size: 512KB, use_time: 912us, speed: 0.535396GB/s
> >size: 512KB, use_time: 1023us, speed: 0.477303GB/s
> >size: 512KB, use_time: 919us, speed: 0.531318GB/s
> >size: 512KB, use_time: 1129us, speed: 0.432490GB/s
> >size: 512KB, use_time: 2098748us, speed: 0.000233GB/s
> >size: 512KB, use_time: 1018us, speed: 0.479648GB/s
> >size: 512KB, use_time: 1120us, speed: 0.435965GB/s
> >size: 512KB, use_time: 1071us, speed: 0.455912GB/
>
>
>
>
>
> JFYI I failed to find this message at lore.kernel.org perhaps
>
> because of pure text mail.
>
>
>
> Thanks
>
> Hillf

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                                       ` <20200827125747.5816-1-hdanton@sina.com>
@ 2020-08-28  1:45                                         ` Kehuan Feng
  2020-09-03  5:01                                           ` Cong Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Kehuan Feng @ 2020-08-28  1:45 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn, Cong Wang,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Hi Hillf,

Unfortunately, above mem barriers don't help. The issue shows up
within 1 minute ...

Hillf Danton <hdanton@sina.com> 于2020年8月27日周四 下午8:58写道:

>
>
> On Thu, 27 Aug 2020 14:56:31 +0800 Kehuan Feng wrote:
> >
> > > Lets see if TCQ_F_NOLOC is making fq_codel different in your testing.
> >
> > I assume you meant disabling NOLOCK for pfifo_fast.
> >
> > Here is the modification,
> >
> > --- ./net/sched/sch_generic.c.orig      2020-08-24 22:02:04.589830751 +0800
> > +++ ./net/sched/sch_generic.c   2020-08-27 10:17:10.148977195 +0800
> > @@ -792,7 +792,7 @@
> >         .dump           =3D       pfifo_fast_dump,
> >         .change_tx_queue_len =3D  pfifo_fast_change_tx_queue_len,
> >         .owner          =3D       THIS_MODULE,
> > -       .static_flags   =3D       TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
> > +       .static_flags   =3D       TCQ_F_CPUSTATS,
> >
> > The issue never happen again with it for over 3 hours stressing. And I
> > restarted the test for two times. No any surprising. Quite stable...
>
> Jaw off. That is great news and I'm failing again to explain the test
> result wrt the difference TCQ_F_NOLOCK can make in running qdisc.
>
> Nothing comes into mind other than two mem barriers though only one is
> needed...
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3040,6 +3040,7 @@ static void __netif_reschedule(struct Qd
>
>  void __netif_schedule(struct Qdisc *q)
>  {
> +       smp_mb__before_atomic();
>         if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
>                 __netif_reschedule(q);
>  }
> @@ -4899,6 +4900,7 @@ static __latent_entropy void net_tx_acti
>                          */
>                         smp_mb__before_atomic();
>                         clear_bit(__QDISC_STATE_SCHED, &q->state);
> +                       smp_mb__after_atomic();
>                         qdisc_run(q);
>                         if (root_lock)
>                                 spin_unlock(root_lock);
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-08-28  1:45                                         ` Kehuan Feng
@ 2020-09-03  5:01                                           ` Cong Wang
  2020-09-03  8:39                                             ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-09-03  5:01 UTC (permalink / raw)
  To: Kehuan Feng
  Cc: Hillf Danton, Jike Song, Josh Hunt, Paolo Abeni, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

[-- Attachment #1: Type: text/plain, Size: 135 bytes --]

Hello, Kehuan

Can you test the attached one-line fix? I think we are overthinking,
probably all
we need here is a busy wait.

Thanks.

[-- Attachment #2: qdisc-seqlock.diff --]
[-- Type: text/x-patch, Size: 530 bytes --]

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c39d60c..fc1bacdb102b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -156,8 +156,7 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
-		if (!spin_trylock(&qdisc->seqlock))
-			return false;
+		spin_lock(&qdisc->seqlock);
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-03  5:01                                           ` Cong Wang
@ 2020-09-03  8:39                                             ` Paolo Abeni
  2020-09-03 17:43                                               ` Cong Wang
       [not found]                                               ` <20200903101957.428-1-hdanton@sina.com>
  0 siblings, 2 replies; 47+ messages in thread
From: Paolo Abeni @ 2020-09-03  8:39 UTC (permalink / raw)
  To: Cong Wang, Kehuan Feng
  Cc: Hillf Danton, Jike Song, Josh Hunt, Jonas Bonn, Michael Zhivich,
	David Miller, John Fastabend, LKML, Netdev

On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> Can you test the attached one-line fix? I think we are overthinking,
> probably all
> we need here is a busy wait.

I think that will solve, but I also think that will kill NOLOCK
performances due to really increased contention.

At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.

Any other opinion more than welcome!

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-03  8:39                                             ` Paolo Abeni
@ 2020-09-03 17:43                                               ` Cong Wang
  2020-09-04  5:07                                                 ` John Fastabend
       [not found]                                               ` <20200903101957.428-1-hdanton@sina.com>
  1 sibling, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-09-03 17:43 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Kehuan Feng, Hillf Danton, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > Can you test the attached one-line fix? I think we are overthinking,
> > probably all
> > we need here is a busy wait.
>
> I think that will solve, but I also think that will kill NOLOCK
> performances due to really increased contention.

Yeah, we somehow end up with more locks (seqlock, skb array lock)
for lockless qdisc. What an irony... ;)

>
> At this point I fear we could consider reverting the NOLOCK stuff.
> I personally would hate doing so, but it looks like NOLOCK benefits are
> outweighed by its issues.

I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                                               ` <20200903101957.428-1-hdanton@sina.com>
@ 2020-09-04  3:20                                                 ` Kehuan Feng
  2020-09-10 20:19                                                   ` Cong Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Kehuan Feng @ 2020-09-04  3:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Paolo Abeni, Cong Wang, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Hi Hillf, Cong, Paolo,

Sorry for the late reply due to other urgent task.

I tried Hillf's patch (shown below on my tree) and it doesn't help and
the jitter shows up very quickly.

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-09-04 10:48:32.081217156 +0800
@@ -108,6 +108,7 @@

  spinlock_t busylock ____cacheline_aligned_in_smp;
  spinlock_t seqlock;
+ int run, seq;
 };

 static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
@@ -127,8 +128,11 @@
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
  if (qdisc->flags & TCQ_F_NOLOCK) {
+ qdisc->run++;
+ smp_wmb();
  if (!spin_trylock(&qdisc->seqlock))
  return false;
+ qdisc->seq = qdisc->run;
  } else if (qdisc_is_running(qdisc)) {
  return false;
  }
@@ -143,8 +147,15 @@
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
  write_seqcount_end(&qdisc->running);
- if (qdisc->flags & TCQ_F_NOLOCK)
+ if (qdisc->flags & TCQ_F_NOLOCK) {
+ int seq = qdisc->seq;
+
  spin_unlock(&qdisc->seqlock);
+ smp_rmb();
+ if (seq != qdisc->run)
+ __netif_schedule(qdisc);
+
+ }
 }


I also tried Cong's patch (shown below on my tree) and it could avoid
the issue (stressing for 30 minutus for three times and not jitter
observed).

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
@@ -127,8 +127,7 @@
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
  if (qdisc->flags & TCQ_F_NOLOCK) {
- if (!spin_trylock(&qdisc->seqlock))
- return false;
+ spin_lock(&qdisc->seqlock);
  } else if (qdisc_is_running(qdisc)) {
  return false;
  }

I am not actually know what you are discussing above. It seems to me
that Cong's patch is similar as disabling lockless feature.

Anyway, we are going to use fq_codel instead, since CentOS 8/kernel
4.18 also uses fq_codel as the default qdisc, not sure whehter they
found some thing related to this.

Thanks,
Kehuan

Hillf Danton <hdanton@sina.com> 于2020年9月3日周四 下午6:20写道:
>
>
> On Thu, 03 Sep 2020 10:39:54 +0200 Paolo Abeni wrote:
> > On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > > Can you test the attached one-line fix? I think we are overthinking,
> > > probably all
> > > we need here is a busy wait.
> >
> > I think that will solve, but I also think that will kill NOLOCK
> > performances due to really increased contention.
> >
> > At this point I fear we could consider reverting the NOLOCK stuff.
> > I personally would hate doing so, but it looks like NOLOCK benefits are
> > outweighed by its issues.
> >
> > Any other opinion more than welcome!
>
> Hi Paolo,
>
> I suspect it's too late to fix the -27% below.
> Surgery to cut NOLOCK seems too early before the fix.
>
> Hillf
>
> >pktgen threads vanilla         patched[II]     delta
> >nr             kpps            kpps            %
> >1              3240            3240            0
> >2              3910            2830            -27%
> >4              5140            5140            0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-03 17:43                                               ` Cong Wang
@ 2020-09-04  5:07                                                 ` John Fastabend
  2020-09-10 20:15                                                   ` Cong Wang
  2021-04-02 19:25                                                   ` Jiri Kosina
  0 siblings, 2 replies; 47+ messages in thread
From: John Fastabend @ 2020-09-04  5:07 UTC (permalink / raw)
  To: Cong Wang, Paolo Abeni
  Cc: Kehuan Feng, Hillf Danton, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

Cong Wang wrote:
> On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > > Can you test the attached one-line fix? I think we are overthinking,
> > > probably all
> > > we need here is a busy wait.
> >
> > I think that will solve, but I also think that will kill NOLOCK
> > performances due to really increased contention.
> 
> Yeah, we somehow end up with more locks (seqlock, skb array lock)
> for lockless qdisc. What an irony... ;)

I went back to the original nolock implementation code to try and figure
out how this was working in the first place.

After initial patch series we have this in __dev_xmit_skb()

	if (q->flags & TCQ_F_NOLOCK) {
		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
			__qdisc_drop(skb, &to_free);
			rc = NET_XMIT_DROP;
		} else {
			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
			__qdisc_run(q);
		}

		if (unlikely(to_free))
			kfree_skb_list(to_free);
		return rc;
	}

One important piece here is we used __qdisc_run(q) instead of
what we have there now qdisc_run(q). Here is the latest code,


	if (q->flags & TCQ_F_NOLOCK) { 
		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
		qdisc_run(q);
		...

__qdisc_run is going to always go into a qdisc_restart loop and
dequeue packets. There is no check here to see if another CPU
is running or not. Compare that to qdisc_run()

	static inline void qdisc_run(struct Qdisc *q)
	{
		if (qdisc_run_begin(q)) {
			__qdisc_run(q);
			qdisc_run_end(q);
		}
	}

Here we have all the racing around qdisc_is_running() that seems
unsolvable.

Seems we flipped __qdisc_run to qdisc_run here 32f7b44d0f566
("sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers"). 
Its not clear to me from thatpatch though why it was even done
there?

Maybe this would unlock us,

diff --git a/net/core/dev.c b/net/core/dev.c
index 7df6c9617321..9b09429103f1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		__qdisc_run(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);


Per other thread we also need the state deactivated check added
back.

> 
> >
> > At this point I fear we could consider reverting the NOLOCK stuff.
> > I personally would hate doing so, but it looks like NOLOCK benefits are
> > outweighed by its issues.
> 
> I agree, NOLOCK brings more pains than gains. There are many race
> conditions hidden in generic qdisc layer, another one is enqueue vs.
> reset which is being discussed in another thread.

Sure. Seems they crept in over time. I had some plans to write a
lockless HTB implementation. But with fq+EDT with BPF it seems that
it is no longer needed, we have a more generic/better solution.  So
I dropped it. Also most folks should really be using fq, fq_codel,
etc. by default anyways. Using pfifo_fast alone is not ideal IMO.

Thanks,
John

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-04  5:07                                                 ` John Fastabend
@ 2020-09-10 20:15                                                   ` Cong Wang
  2020-09-10 21:07                                                     ` John Fastabend
  2021-04-02 19:25                                                   ` Jiri Kosina
  1 sibling, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-09-10 20:15 UTC (permalink / raw)
  To: John Fastabend
  Cc: Paolo Abeni, Kehuan Feng, Hillf Danton, Jike Song, Josh Hunt,
	Jonas Bonn, Michael Zhivich, David Miller, LKML, Netdev

On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <john.fastabend@gmail.com> wrote:
> Maybe this would unlock us,
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7df6c9617321..9b09429103f1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>
>         if (q->flags & TCQ_F_NOLOCK) {
>                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -               qdisc_run(q);
> +               __qdisc_run(q);
>
>                 if (unlikely(to_free))
>                         kfree_skb_list(to_free);
>
>
> Per other thread we also need the state deactivated check added
> back.

I guess no, because pfifo_dequeue() seems to require q->seqlock,
according to comments in qdisc_run(), so we can not just get rid of
qdisc_run_begin()/qdisc_run_end() here.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-04  3:20                                                 ` Kehuan Feng
@ 2020-09-10 20:19                                                   ` Cong Wang
  2020-09-14  2:10                                                     ` Yunsheng Lin
  0 siblings, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-09-10 20:19 UTC (permalink / raw)
  To: Kehuan Feng
  Cc: Hillf Danton, Paolo Abeni, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev

On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <kehuan.feng@gmail.com> wrote:
> I also tried Cong's patch (shown below on my tree) and it could avoid
> the issue (stressing for 30 minutus for three times and not jitter
> observed).

Thanks for verifying it!

>
> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> @@ -127,8 +127,7 @@
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>   if (qdisc->flags & TCQ_F_NOLOCK) {
> - if (!spin_trylock(&qdisc->seqlock))
> - return false;
> + spin_lock(&qdisc->seqlock);
>   } else if (qdisc_is_running(qdisc)) {
>   return false;
>   }
>
> I am not actually know what you are discussing above. It seems to me
> that Cong's patch is similar as disabling lockless feature.

From performance's perspective, yeah. Did you see any performance
downgrade with my patch applied? It would be great if you can compare
it with removing NOLOCK. And if the performance is as bad as no
NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
for now.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-10 20:15                                                   ` Cong Wang
@ 2020-09-10 21:07                                                     ` John Fastabend
  2020-09-10 21:40                                                       ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: John Fastabend @ 2020-09-10 21:07 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Paolo Abeni, Kehuan Feng, Hillf Danton, Jike Song, Josh Hunt,
	Jonas Bonn, Michael Zhivich, David Miller, LKML, Netdev

Cong Wang wrote:
> On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > Maybe this would unlock us,
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 7df6c9617321..9b09429103f1 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >
> >         if (q->flags & TCQ_F_NOLOCK) {
> >                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > -               qdisc_run(q);
> > +               __qdisc_run(q);
> >
> >                 if (unlikely(to_free))
> >                         kfree_skb_list(to_free);
> >
> >
> > Per other thread we also need the state deactivated check added
> > back.
> 
> I guess no, because pfifo_dequeue() seems to require q->seqlock,
> according to comments in qdisc_run(), so we can not just get rid of
> qdisc_run_begin()/qdisc_run_end() here.
> 
> Thanks.

Seems we would have to revert this as well then,

 commit 021a17ed796b62383f7623f4fea73787abddad77
 Author: Paolo Abeni <pabeni@redhat.com>
 Date:   Tue May 15 16:24:37 2018 +0200

    pfifo_fast: drop unneeded additional lock on dequeue
    
    After the previous patch, for NOLOCK qdiscs, q->seqlock is
    always held when the dequeue() is invoked, we can drop
    any additional locking to protect such operation.

Then I think it should be safe. Back when I was working on the ptr
ring implementation I opted not to do a case without the spinlock
because the performance benefit was minimal in the benchmarks I
was looking at. I assumed at some point it would be worth going
back to it, but just changing those to the __ptr_ring* cases is
not safe without a lock. I remember having a discussion with Tsirkin
about the details, but would have to go through the mail servers
to find it.

FWIW the initial perf looked like this, (https://lwn.net/Articles/698135/)

nolock pfifo_fast
1:  1417597 1407479 1418913 1439601 
2:  1882009 1867799 1864374 1855950
4:  1806736 1804261 1803697 1806994
8:  1354318 1358686 1353145 1356645
12: 1331928 1333079 1333476 1335544

locked pfifo_fast
1:  1471479 1469142 1458825 1456788 
2:  1746231 1749490 1753176 1753780
4:  1119626 1120515 1121478 1119220
8:  1001471  999308 1000318 1000776
12:  989269  992122  991590  986581

As you can see measurable improvement on many cores. But, actually
worse if you have enough nic queues to map 1:1 with cores.

nolock mq
1:    1417768  1438712  1449092  1426775
2:    2644099  2634961  2628939  2712867
4:    4866133  4862802  4863396  4867423
8:    9422061  9464986  9457825  9467619
12:  13854470 13213735 13664498 13213292  

locked mq
1:   1448374  1444208  1437459  1437088 
2:   2687963  2679221  2651059  2691630
4:   5153884  4684153  5091728  4635261
8:   9292395  9625869  9681835  9711651
12: 13553918 13682410 14084055 13946138

So only better if you have more cores than hardware queues
which was the case on some of the devices we had at the time.

Thanks,
John

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-10 21:07                                                     ` John Fastabend
@ 2020-09-10 21:40                                                       ` Paolo Abeni
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Abeni @ 2020-09-10 21:40 UTC (permalink / raw)
  To: John Fastabend, Cong Wang
  Cc: Kehuan Feng, Hillf Danton, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, LKML, Netdev

On Thu, 2020-09-10 at 14:07 -0700, John Fastabend wrote:
> Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > Maybe this would unlock us,
> > > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 7df6c9617321..9b09429103f1 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > > 
> > >         if (q->flags & TCQ_F_NOLOCK) {
> > >                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > > -               qdisc_run(q);
> > > +               __qdisc_run(q);
> > > 
> > >                 if (unlikely(to_free))
> > >                         kfree_skb_list(to_free);
> > > 
> > > 
> > > Per other thread we also need the state deactivated check added
> > > back.
> > 
> > I guess no, because pfifo_dequeue() seems to require q->seqlock,
> > according to comments in qdisc_run(), so we can not just get rid of
> > qdisc_run_begin()/qdisc_run_end() here.
> > 
> > Thanks.
> 
> Seems we would have to revert this as well then,
> 
>  commit 021a17ed796b62383f7623f4fea73787abddad77
>  Author: Paolo Abeni <pabeni@redhat.com>
>  Date:   Tue May 15 16:24:37 2018 +0200
> 
>     pfifo_fast: drop unneeded additional lock on dequeue
>     
>     After the previous patch, for NOLOCK qdiscs, q->seqlock is
>     always held when the dequeue() is invoked, we can drop
>     any additional locking to protect such operation.
> 
> Then I think it should be safe. Back when I was working on the ptr
> ring implementation I opted not to do a case without the spinlock
> because the performance benefit was minimal in the benchmarks I
> was looking at.

The main point behind all that changes was try to close the gap vs the
locked implementation in the uncontended scenario. In our benchmark,
after commit eb82a994479245a79647d302f9b4eb8e7c9d7ca6, was more near to
10%.

Anyway I agree reverting back to the bitlock should be safe.

Cheers,

Paolo 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-10 20:19                                                   ` Cong Wang
@ 2020-09-14  2:10                                                     ` Yunsheng Lin
  2020-09-17 19:52                                                       ` Cong Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Yunsheng Lin @ 2020-09-14  2:10 UTC (permalink / raw)
  To: Cong Wang, Kehuan Feng
  Cc: Hillf Danton, Paolo Abeni, Jike Song, Josh Hunt, Jonas Bonn,
	Michael Zhivich, David Miller, John Fastabend, LKML, Netdev,
	linuxarm

On 2020/9/11 4:19, Cong Wang wrote:
> On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <kehuan.feng@gmail.com> wrote:
>> I also tried Cong's patch (shown below on my tree) and it could avoid
>> the issue (stressing for 30 minutus for three times and not jitter
>> observed).
> 
> Thanks for verifying it!
> 
>>
>> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
>> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
>> @@ -127,8 +127,7 @@
>>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>  {
>>   if (qdisc->flags & TCQ_F_NOLOCK) {
>> - if (!spin_trylock(&qdisc->seqlock))
>> - return false;
>> + spin_lock(&qdisc->seqlock);
>>   } else if (qdisc_is_running(qdisc)) {
>>   return false;
>>   }
>>
>> I am not actually know what you are discussing above. It seems to me
>> that Cong's patch is similar as disabling lockless feature.
> 
>>From performance's perspective, yeah. Did you see any performance
> downgrade with my patch applied? It would be great if you can compare
> it with removing NOLOCK. And if the performance is as bad as no
> NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> for now.

It seems the lockless qdisc may have below concurrent problem:
  cpu0:                                                           cpu1:
q->enqueue							    .
qdisc_run_begin(q)  						    .
__qdisc_run(q) ->qdisc_restart() -> dequeue_skb()		    .
		                 -> sch_direct_xmit()		    .
 								    .
                                                                q->enqueue
				                             qdisc_run_begin(q)			
qdisc_run_end(q)


cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).


Kehuan, do you care to try the below patch if it is the same problem?

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c3..c97c1ed 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_ENQUEUED,
 };

 struct qdisc_size_table {
diff --git a/net/core/dev.c b/net/core/dev.c
index 0362419..5985648 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3748,6 +3748,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);

 	if (q->flags & TCQ_F_NOLOCK) {
+		set_bit(__QDISC_STATE_ENQUEUED, &q->state);
+		smp_mb__after_atomic();
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
 		qdisc_run(q);

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d..c389641 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -381,6 +381,8 @@ void __qdisc_run(struct Qdisc *q)
 	int quota = dev_tx_weight;
 	int packets;

+	clear_bit(__QDISC_STATE_ENQUEUED, &q->state);
+	smp_mb__after_atomic();
 	while (qdisc_restart(q, &packets)) {
 		quota -= packets;
 		if (quota <= 0) {
@@ -388,6 +390,9 @@ void __qdisc_run(struct Qdisc *q)
 			break;
 		}
 	}
+
+	if (test_bit(__QDISC_STATE_ENQUEUED, &q->state))
+		__netif_schedule(q);
 }

 unsigned long dev_trans_start(struct net_device *dev)


> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-14  2:10                                                     ` Yunsheng Lin
@ 2020-09-17 19:52                                                       ` Cong Wang
  2020-09-18  2:06                                                         ` Kehuan Feng
  0 siblings, 1 reply; 47+ messages in thread
From: Cong Wang @ 2020-09-17 19:52 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Kehuan Feng, Hillf Danton, Paolo Abeni, Jike Song, Josh Hunt,
	Jonas Bonn, Michael Zhivich, David Miller, John Fastabend, LKML,
	Netdev, linuxarm

On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2020/9/11 4:19, Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <kehuan.feng@gmail.com> wrote:
> >> I also tried Cong's patch (shown below on my tree) and it could avoid
> >> the issue (stressing for 30 minutus for three times and not jitter
> >> observed).
> >
> > Thanks for verifying it!
> >
> >>
> >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> >> @@ -127,8 +127,7 @@
> >>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >>  {
> >>   if (qdisc->flags & TCQ_F_NOLOCK) {
> >> - if (!spin_trylock(&qdisc->seqlock))
> >> - return false;
> >> + spin_lock(&qdisc->seqlock);
> >>   } else if (qdisc_is_running(qdisc)) {
> >>   return false;
> >>   }
> >>
> >> I am not actually know what you are discussing above. It seems to me
> >> that Cong's patch is similar as disabling lockless feature.
> >
> >>From performance's perspective, yeah. Did you see any performance
> > downgrade with my patch applied? It would be great if you can compare
> > it with removing NOLOCK. And if the performance is as bad as no
> > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > for now.
>
> It seems the lockless qdisc may have below concurrent problem:
>   cpu0:                                                           cpu1:
> q->enqueue                                                          .
> qdisc_run_begin(q)                                                  .
> __qdisc_run(q) ->qdisc_restart() -> dequeue_skb()                   .
>                                  -> sch_direct_xmit()               .
>                                                                     .
>                                                                 q->enqueue
>                                                              qdisc_run_begin(q)
> qdisc_run_end(q)
>
>
> cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).

This is the same problem that my patch fixes, I do not know
why you are suggesting another patch despite quoting mine.
Please read the whole thread if you want to participate.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-17 19:52                                                       ` Cong Wang
@ 2020-09-18  2:06                                                         ` Kehuan Feng
  0 siblings, 0 replies; 47+ messages in thread
From: Kehuan Feng @ 2020-09-18  2:06 UTC (permalink / raw)
  To: Cong Wang
  Cc: Yunsheng Lin, Hillf Danton, Paolo Abeni, Jike Song, Josh Hunt,
	Jonas Bonn, Michael Zhivich, David Miller, John Fastabend, LKML,
	Netdev, linuxarm

Sorry, guys, the experiment environment is no longer existing now. We
finally use fq_codel for online product.

Cong Wang <xiyou.wangcong@gmail.com> 于2020年9月18日周五 上午3:52写道:
>
> On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >
> > On 2020/9/11 4:19, Cong Wang wrote:
> > > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <kehuan.feng@gmail.com> wrote:
> > >> I also tried Cong's patch (shown below on my tree) and it could avoid
> > >> the issue (stressing for 30 minutus for three times and not jitter
> > >> observed).
> > >
> > > Thanks for verifying it!
> > >
> > >>
> > >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> > >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> > >> @@ -127,8 +127,7 @@
> > >>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> > >>  {
> > >>   if (qdisc->flags & TCQ_F_NOLOCK) {
> > >> - if (!spin_trylock(&qdisc->seqlock))
> > >> - return false;
> > >> + spin_lock(&qdisc->seqlock);
> > >>   } else if (qdisc_is_running(qdisc)) {
> > >>   return false;
> > >>   }
> > >>
> > >> I am not actually know what you are discussing above. It seems to me
> > >> that Cong's patch is similar as disabling lockless feature.
> > >
> > >>From performance's perspective, yeah. Did you see any performance
> > > downgrade with my patch applied? It would be great if you can compare
> > > it with removing NOLOCK. And if the performance is as bad as no
> > > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > > for now.
> >
> > It seems the lockless qdisc may have below concurrent problem:
> >   cpu0:                                                           cpu1:
> > q->enqueue                                                          .
> > qdisc_run_begin(q)                                                  .
> > __qdisc_run(q) ->qdisc_restart() -> dequeue_skb()                   .
> >                                  -> sch_direct_xmit()               .
> >                                                                     .
> >                                                                 q->enqueue
> >                                                              qdisc_run_begin(q)
> > qdisc_run_end(q)
> >
> >
> > cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> > enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> > after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).
>
> This is the same problem that my patch fixes, I do not know
> why you are suggesting another patch despite quoting mine.
> Please read the whole thread if you want to participate.
>
> Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2020-09-04  5:07                                                 ` John Fastabend
  2020-09-10 20:15                                                   ` Cong Wang
@ 2021-04-02 19:25                                                   ` Jiri Kosina
  2021-04-02 19:33                                                     ` Josh Hunt
       [not found]                                                     ` <20210403003537.2032-1-hdanton@sina.com>
  1 sibling, 2 replies; 47+ messages in thread
From: Jiri Kosina @ 2021-04-02 19:25 UTC (permalink / raw)
  To: John Fastabend
  Cc: Cong Wang, Paolo Abeni, Kehuan Feng, Hillf Danton, Jike Song,
	Josh Hunt, Jonas Bonn, Michael Zhivich, David Miller, LKML,
	Michal Kubecek, Netdev

On Thu, 3 Sep 2020, John Fastabend wrote:

> > > At this point I fear we could consider reverting the NOLOCK stuff.
> > > I personally would hate doing so, but it looks like NOLOCK benefits are
> > > outweighed by its issues.
> > 
> > I agree, NOLOCK brings more pains than gains. There are many race
> > conditions hidden in generic qdisc layer, another one is enqueue vs.
> > reset which is being discussed in another thread.
> 
> Sure. Seems they crept in over time. I had some plans to write a
> lockless HTB implementation. But with fq+EDT with BPF it seems that
> it is no longer needed, we have a more generic/better solution.  So
> I dropped it. Also most folks should really be using fq, fq_codel,
> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.

Half a year later, we still have the NOLOCK implementation 
present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself. 

And we've just been bitten by this very same race which appears to be 
still unfixed, with single packet being stuck in pfifo_fast qdisc 
basically indefinitely due to this very race that this whole thread began 
with back in 2019.

Unless there are

	(a) any nice ideas how to solve this in an elegant way without 
	    (re-)introducing extra spinlock (Cong's fix) or

	(b) any objections to revert as per the argumentation above

I'll be happy to send a revert of the whole NOLOCK implementation next 
week.

Thanks,

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-02 19:25                                                   ` Jiri Kosina
@ 2021-04-02 19:33                                                     ` Josh Hunt
       [not found]                                                     ` <20210403003537.2032-1-hdanton@sina.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Josh Hunt @ 2021-04-02 19:33 UTC (permalink / raw)
  To: Jiri Kosina, John Fastabend
  Cc: Cong Wang, Paolo Abeni, Kehuan Feng, Hillf Danton, Jike Song,
	Jonas Bonn, Michael Zhivich, David Miller, LKML, Michal Kubecek,
	Netdev

On 4/2/21 12:25 PM, Jiri Kosina wrote:
> On Thu, 3 Sep 2020, John Fastabend wrote:
> 
>>>> At this point I fear we could consider reverting the NOLOCK stuff.
>>>> I personally would hate doing so, but it looks like NOLOCK benefits are
>>>> outweighed by its issues.
>>>
>>> I agree, NOLOCK brings more pains than gains. There are many race
>>> conditions hidden in generic qdisc layer, another one is enqueue vs.
>>> reset which is being discussed in another thread.
>>
>> Sure. Seems they crept in over time. I had some plans to write a
>> lockless HTB implementation. But with fq+EDT with BPF it seems that
>> it is no longer needed, we have a more generic/better solution.  So
>> I dropped it. Also most folks should really be using fq, fq_codel,
>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
> 
> Half a year later, we still have the NOLOCK implementation
> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
> 
> And we've just been bitten by this very same race which appears to be
> still unfixed, with single packet being stuck in pfifo_fast qdisc
> basically indefinitely due to this very race that this whole thread began
> with back in 2019.
> 
> Unless there are
> 
> 	(a) any nice ideas how to solve this in an elegant way without
> 	    (re-)introducing extra spinlock (Cong's fix) or
> 
> 	(b) any objections to revert as per the argumentation above
> 
> I'll be happy to send a revert of the whole NOLOCK implementation next
> week.
> 

Jiri

If you have a reproducer can you try 
https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your 
suggestion of reverting nolock makes sense to me. We've moved to using 
fq as our default now b/c of this bug.

Josh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
       [not found]                                                     ` <20210403003537.2032-1-hdanton@sina.com>
@ 2021-04-03 12:23                                                       ` Jiri Kosina
  2021-04-06  0:55                                                         ` Yunsheng Lin
  2021-04-06  1:49                                                         ` Cong Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Jiri Kosina @ 2021-04-03 12:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: John Fastabend, Cong Wang, Paolo Abeni, Kehuan Feng, Jike Song,
	Jonas Bonn, Michael Zhivich, David Miller, LKML, Michal Kubecek,
	Netdev, Josh Hunt, Yunsheng Lin

On Sat, 3 Apr 2021, Hillf Danton wrote:

> >>> Sure. Seems they crept in over time. I had some plans to write a
> >>> lockless HTB implementation. But with fq+EDT with BPF it seems that
> >>> it is no longer needed, we have a more generic/better solution.  So
> >>> I dropped it. Also most folks should really be using fq, fq_codel,
> >>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
> >> 
> >> Half a year later, we still have the NOLOCK implementation
> >> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
> >> 
> >> And we've just been bitten by this very same race which appears to be
> >> still unfixed, with single packet being stuck in pfifo_fast qdisc
> >> basically indefinitely due to this very race that this whole thread began
> >> with back in 2019.
> >> 
> >> Unless there are
> >> 
> >> 	(a) any nice ideas how to solve this in an elegant way without
> >> 	    (re-)introducing extra spinlock (Cong's fix) or
> >> 
> >> 	(b) any objections to revert as per the argumentation above
> >> 
> >> I'll be happy to send a revert of the whole NOLOCK implementation next
> >> week.
> >> 
> >Jiri
> >
> 
> Feel free to revert it as the scorch wont end without a deluge.

I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the 
coming days. If it works, then we can consider proceeding with it, 
otherwise I am all for reverting the whole NOLOCK stuff.

[1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-03 12:23                                                       ` Jiri Kosina
@ 2021-04-06  0:55                                                         ` Yunsheng Lin
  2021-04-06  7:06                                                           ` Michal Kubecek
  2021-04-06  1:49                                                         ` Cong Wang
  1 sibling, 1 reply; 47+ messages in thread
From: Yunsheng Lin @ 2021-04-06  0:55 UTC (permalink / raw)
  To: Jiri Kosina, Hillf Danton
  Cc: John Fastabend, Cong Wang, Paolo Abeni, Kehuan Feng, Jike Song,
	Jonas Bonn, Michael Zhivich, David Miller, LKML, Michal Kubecek,
	Netdev, Josh Hunt

On 2021/4/3 20:23, Jiri Kosina wrote:
> On Sat, 3 Apr 2021, Hillf Danton wrote:
> 
>>>>> Sure. Seems they crept in over time. I had some plans to write a
>>>>> lockless HTB implementation. But with fq+EDT with BPF it seems that
>>>>> it is no longer needed, we have a more generic/better solution.  So
>>>>> I dropped it. Also most folks should really be using fq, fq_codel,
>>>>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
>>>>
>>>> Half a year later, we still have the NOLOCK implementation
>>>> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
>>>>
>>>> And we've just been bitten by this very same race which appears to be
>>>> still unfixed, with single packet being stuck in pfifo_fast qdisc
>>>> basically indefinitely due to this very race that this whole thread began
>>>> with back in 2019.
>>>>
>>>> Unless there are
>>>>
>>>> 	(a) any nice ideas how to solve this in an elegant way without
>>>> 	    (re-)introducing extra spinlock (Cong's fix) or
>>>>
>>>> 	(b) any objections to revert as per the argumentation above
>>>>
>>>> I'll be happy to send a revert of the whole NOLOCK implementation next
>>>> week.
>>>>
>>> Jiri
>>>
>>
>> Feel free to revert it as the scorch wont end without a deluge.
> 
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the 
> coming days. If it works, then we can consider proceeding with it, 
> otherwise I am all for reverting the whole NOLOCK stuff.

Hi, Jiri
Do you have a reproducer that can be shared here?
With reproducer, I can debug and test it myself too.

Thanks.

> 
> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-03 12:23                                                       ` Jiri Kosina
  2021-04-06  0:55                                                         ` Yunsheng Lin
@ 2021-04-06  1:49                                                         ` Cong Wang
  2021-04-06  2:46                                                           ` Yunsheng Lin
  1 sibling, 1 reply; 47+ messages in thread
From: Cong Wang @ 2021-04-06  1:49 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Hillf Danton, John Fastabend, Paolo Abeni, Kehuan Feng,
	Jike Song, Jonas Bonn, Michael Zhivich, David Miller, LKML,
	Michal Kubecek, Netdev, Josh Hunt, Yunsheng Lin

On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> coming days. If it works, then we can consider proceeding with it,
> otherwise I am all for reverting the whole NOLOCK stuff.
>
> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u

I personally prefer to just revert that bit, as it brings more troubles
than gains. Even with Yunsheng's patch, there are still some issues.
Essentially, I think the core qdisc scheduling code is not ready for
lockless, just look at those NOLOCK checks in sch_generic.c. :-/

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06  1:49                                                         ` Cong Wang
@ 2021-04-06  2:46                                                           ` Yunsheng Lin
  2021-04-06  7:31                                                             ` Michal Kubecek
  0 siblings, 1 reply; 47+ messages in thread
From: Yunsheng Lin @ 2021-04-06  2:46 UTC (permalink / raw)
  To: Cong Wang, Jiri Kosina
  Cc: Hillf Danton, John Fastabend, Paolo Abeni, Kehuan Feng,
	Jike Song, Jonas Bonn, Michael Zhivich, David Miller, LKML,
	Michal Kubecek, Netdev, Josh Hunt, Jason A. Donenfeld,
	Toke Høiland-Jørgensen

On 2021/4/6 9:49, Cong Wang wrote:
> On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <jikos@kernel.org> wrote:
>>
>> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
>> coming days. If it works, then we can consider proceeding with it,
>> otherwise I am all for reverting the whole NOLOCK stuff.
>>
>> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u
> 
> I personally prefer to just revert that bit, as it brings more troubles
> than gains. Even with Yunsheng's patch, there are still some issues.
> Essentially, I think the core qdisc scheduling code is not ready for
> lockless, just look at those NOLOCK checks in sch_generic.c. :-/

I am also awared of the NOLOCK checks too:), and I am willing to
take care of it if that is possible.

As the number of cores in a system is increasing, it is the trend
to become lockless, right? Even there is only one cpu involved, the
spinlock taking and releasing takes about 30ns on our arm64 system
when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).

Currently I has three ideas to optimize the lockless qdisc:
1. implement the qdisc bypass for lockless qdisc too, see [1].

2. implement lockless enqueuing for lockless qdisc using the idea
   from Jason and Toke. And it has a noticable proformance increase with
   1-4 threads running using the below prototype based on ptr_ring.

static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
{

        int producer, next_producer;


        do {
                producer = READ_ONCE(r->producer);
                if (unlikely(!r->size) || r->queue[producer])
                        return -ENOSPC;
                next_producer = producer + 1;
                if (unlikely(next_producer >= r->size))
                        next_producer = 0;
        } while(cmpxchg_relaxed(&r->producer, producer, next_producer) != producer);

        /* Make sure the pointer we are storing points to a valid data. */
        /* Pairs with the dependency ordering in __ptr_ring_consume. */
        smp_wmb();

        WRITE_ONCE(r->queue[producer], ptr);
        return 0;
}

3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
   too, because dev_hard_start_xmit is also in the protection of
   qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
   a netdev queue, which is true for pfifo_fast, I believe).


[1]. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/

> 
> Thanks.
> 
> .
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06  0:55                                                         ` Yunsheng Lin
@ 2021-04-06  7:06                                                           ` Michal Kubecek
  2021-04-06 10:13                                                             ` Juergen Gross
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Kubecek @ 2021-04-06  7:06 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Jiri Kosina, Hillf Danton, John Fastabend, Cong Wang,
	Paolo Abeni, Kehuan Feng, Jike Song, Jonas Bonn, Michael Zhivich,
	David Miller, LKML, Netdev, Josh Hunt

On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
> 
> Hi, Jiri
> Do you have a reproducer that can be shared here?
> With reproducer, I can debug and test it myself too.

I'm afraid we are not aware of a simple reproducer. As mentioned in the
original discussion, the race window is extremely small and the other
thread has to do quite a lot in the meantime which is probably why, as
far as I know, this was never observed on real hardware, only in
virtualization environments. NFS may also be important as, IIUC, it can
often issue an RPC request from a different CPU right after a data
transfer. Perhaps you could cheat a bit and insert a random delay
between the empty queue check and releasing q->seqlock to make it more
likely to happen.

Other than that, it's rather just "run this complex software in a xen VM
and wait".

Michal

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06  2:46                                                           ` Yunsheng Lin
@ 2021-04-06  7:31                                                             ` Michal Kubecek
  2021-04-06 12:24                                                               ` Yunsheng Lin
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Kubecek @ 2021-04-06  7:31 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Cong Wang, Jiri Kosina, Hillf Danton, John Fastabend,
	Paolo Abeni, Kehuan Feng, Jike Song, Michael Zhivich,
	David Miller, LKML, Netdev, Josh Hunt, Jason A. Donenfeld,
	Toke Høiland-Jørgensen

On Tue, Apr 06, 2021 at 10:46:29AM +0800, Yunsheng Lin wrote:
> On 2021/4/6 9:49, Cong Wang wrote:
> > On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <jikos@kernel.org> wrote:
> >>
> >> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> >> coming days. If it works, then we can consider proceeding with it,
> >> otherwise I am all for reverting the whole NOLOCK stuff.
> >>
> >> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u
> > 
> > I personally prefer to just revert that bit, as it brings more troubles
> > than gains. Even with Yunsheng's patch, there are still some issues.
> > Essentially, I think the core qdisc scheduling code is not ready for
> > lockless, just look at those NOLOCK checks in sch_generic.c. :-/
> 
> I am also awared of the NOLOCK checks too:), and I am willing to
> take care of it if that is possible.
> 
> As the number of cores in a system is increasing, it is the trend
> to become lockless, right? Even there is only one cpu involved, the
> spinlock taking and releasing takes about 30ns on our arm64 system
> when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).

I agree with the benefits but currently the situation is that we have
a race condition affecting the default qdisc which is being hit in
production and can cause serious trouble which is made worse by commit
1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host
queues") preventing the retransmits of the stuck packet being sent.

Perhaps rather than patching over current implementation which requires
more and more complicated hacks to work around the fact that we cannot
make the "queue is empty" check and leaving the critical section atomic,
it would make sense to reimplement it in a way which would allow us
making it atomic.

Michal


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06  7:06                                                           ` Michal Kubecek
@ 2021-04-06 10:13                                                             ` Juergen Gross
  2021-04-06 12:17                                                               ` Yunsheng Lin
  0 siblings, 1 reply; 47+ messages in thread
From: Juergen Gross @ 2021-04-06 10:13 UTC (permalink / raw)
  To: Michal Kubecek, Yunsheng Lin
  Cc: Jiri Kosina, Hillf Danton, John Fastabend, Cong Wang,
	Paolo Abeni, Kehuan Feng, Jike Song, Jonas Bonn, Michael Zhivich,
	David Miller, LKML, Netdev, Josh Hunt


[-- Attachment #1.1.1: Type: text/plain, Size: 2732 bytes --]

On 06.04.21 09:06, Michal Kubecek wrote:
> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>
>> Hi, Jiri
>> Do you have a reproducer that can be shared here?
>> With reproducer, I can debug and test it myself too.
> 
> I'm afraid we are not aware of a simple reproducer. As mentioned in the
> original discussion, the race window is extremely small and the other
> thread has to do quite a lot in the meantime which is probably why, as
> far as I know, this was never observed on real hardware, only in
> virtualization environments. NFS may also be important as, IIUC, it can
> often issue an RPC request from a different CPU right after a data
> transfer. Perhaps you could cheat a bit and insert a random delay
> between the empty queue check and releasing q->seqlock to make it more
> likely to happen.
> 
> Other than that, it's rather just "run this complex software in a xen VM
> and wait".

Being the one who has managed to reproduce the issue I can share my
setup, maybe you can setup something similar (we have seen the issue
with this kind of setup on two different machines).

I'm using a physical machine with 72 cpus and 48 GB of memory. It is
running Xen as virtualization platform.

Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
limited to run on the first 40 physical cpus (no idea whether that
matters, though).

In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
sysbench instances in a loop, those instances are prepared via

sysbench --file-test-mode=rndrd --test=fileio prepare

and then started in a do while loop via:

sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 
--max-time=300 --max-requests=0 run

Each instance is using a dedicated NFS mount to run on. The NFS
server for the 8 mounts is running in dom0 of the same server, the
data of the NFS shares is located in a RAM disk (size is a little bit
above 16GB). The shares are mounted in the guest with:

mount -t nfs -o 
rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport 
dom0:/ramdisk/share[1-8] /mnt[1-8]

The guests vcpus are limited to run on physical cpus 40-55, on the same
physical cpus I have 16 small guests running eating up cpu time, each of
those guests is pinned to one of the physical cpus 40-55.

That's basically it. All you need to do is to watch out for sysbench
reporting maximum latencies above one second or so (in my setup there
are latencies of several minutes at least once each hour of testing).

In case you'd like to have some more details about the setup don't
hesitate to contact me directly. I can provide you with some scripts
and config runes if you want.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06 10:13                                                             ` Juergen Gross
@ 2021-04-06 12:17                                                               ` Yunsheng Lin
  0 siblings, 0 replies; 47+ messages in thread
From: Yunsheng Lin @ 2021-04-06 12:17 UTC (permalink / raw)
  To: Juergen Gross, Michal Kubecek
  Cc: Jiri Kosina, Hillf Danton, John Fastabend, Cong Wang,
	Paolo Abeni, Kehuan Feng, Jike Song, Jonas Bonn, Michael Zhivich,
	David Miller, LKML, Netdev, Josh Hunt

On 2021/4/6 18:13, Juergen Gross wrote:
> On 06.04.21 09:06, Michal Kubecek wrote:
>> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>>
>>> Hi, Jiri
>>> Do you have a reproducer that can be shared here?
>>> With reproducer, I can debug and test it myself too.
>>
>> I'm afraid we are not aware of a simple reproducer. As mentioned in the
>> original discussion, the race window is extremely small and the other
>> thread has to do quite a lot in the meantime which is probably why, as
>> far as I know, this was never observed on real hardware, only in
>> virtualization environments. NFS may also be important as, IIUC, it can
>> often issue an RPC request from a different CPU right after a data
>> transfer. Perhaps you could cheat a bit and insert a random delay
>> between the empty queue check and releasing q->seqlock to make it more
>> likely to happen.
>>
>> Other than that, it's rather just "run this complex software in a xen VM
>> and wait".
> 
> Being the one who has managed to reproduce the issue I can share my
> setup, maybe you can setup something similar (we have seen the issue
> with this kind of setup on two different machines).
> 
> I'm using a physical machine with 72 cpus and 48 GB of memory. It is
> running Xen as virtualization platform.
> 
> Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
> limited to run on the first 40 physical cpus (no idea whether that
> matters, though).
> 
> In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
> sysbench instances in a loop, those instances are prepared via
> 
> sysbench --file-test-mode=rndrd --test=fileio prepare
> 
> and then started in a do while loop via:
> 
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300 --max-requests=0 run
> 
> Each instance is using a dedicated NFS mount to run on. The NFS
> server for the 8 mounts is running in dom0 of the same server, the
> data of the NFS shares is located in a RAM disk (size is a little bit
> above 16GB). The shares are mounted in the guest with:
> 
> mount -t nfs -o rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport dom0:/ramdisk/share[1-8] /mnt[1-8]
> 
> The guests vcpus are limited to run on physical cpus 40-55, on the same
> physical cpus I have 16 small guests running eating up cpu time, each of
> those guests is pinned to one of the physical cpus 40-55.
> 
> That's basically it. All you need to do is to watch out for sysbench
> reporting maximum latencies above one second or so (in my setup there
> are latencies of several minutes at least once each hour of testing).
> 
> In case you'd like to have some more details about the setup don't
> hesitate to contact me directly. I can provide you with some scripts
> and config runes if you want.

The setup is rather complex, I just tried Michal' suggestion using
the below patch:

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9fb0ad4..b691eda 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -207,6 +207,11 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
        write_seqcount_end(&qdisc->running);
        if (qdisc->flags & TCQ_F_NOLOCK) {
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
                spin_unlock(&qdisc->seqlock);

                if (unlikely(test_bit(__QDISC_STATE_MISSED,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6d7f954..a83c520 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
                        return qdisc_drop_cpu(skb, qdisc, to_free);
                else
                        return qdisc_drop(skb, qdisc, to_free);
+       } else {
+               skb->enqueue_jiffies = jiffies;
        }

        qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -653,6 +655,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
                skb = __skb_array_consume(q);
        }
        if (likely(skb)) {
+               unsigned int delay_ms;
+
+               delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
linyunsheng@plinth:~/ci/kernel$ vi qdisc_reproducer.patch
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -920,7 +920,7 @@ struct sk_buff {
                                *data;
        unsigned int            truesize;
        refcount_t              users;
-
+       unsigned long           enqueue_jiffies;
 #ifdef CONFIG_SKB_EXTENSIONS
        /* only useable after checking ->active_extensions != 0 */
        struct skb_ext          *extensions;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 639e465..ba39b86 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -176,8 +176,14 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
        write_seqcount_end(&qdisc->running);
-       if (qdisc->flags & TCQ_F_NOLOCK)
+       if (qdisc->flags & TCQ_F_NOLOCK) {
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
+               udelay(10000);
                spin_unlock(&qdisc->seqlock);
+       }
 }

 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 49eae93..fcfae43 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
                        return qdisc_drop_cpu(skb, qdisc, to_free);
                else
                        return qdisc_drop(skb, qdisc, to_free);
+       } else {
+               skb->enqueue_jiffies = jiffies;
        }

        qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -651,6 +653,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
                skb = __skb_array_consume(q);
        }
        if (likely(skb)) {
+               unsigned int delay_ms;
+
+               delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
+
+               if (delay_ms > 100)
+                       netdev_err(qdisc_dev(qdisc), "delay: %u ms\n", delay_ms);
+
                qdisc_update_stats_at_dequeue(qdisc, skb);
        } else {
                WRITE_ONCE(qdisc->empty, true);


Using the below shell:

while((1))
do
taskset -c 0 mz dummy1 -d 10000 -c 100&
taskset -c 1 mz dummy1 -d 10000 -c 100&
taskset -c 2 mz dummy1 -d 10000 -c 100&
sleep 3
done

And got the below log:
[   80.881716] hns3 0000:bd:00.0 eth2: delay: 176 ms
[   82.036564] hns3 0000:bd:00.0 eth2: delay: 296 ms
[   87.065820] hns3 0000:bd:00.0 eth2: delay: 320 ms
[   94.134174] dummy1: delay: 1588 ms
[   94.137570] dummy1: delay: 1580 ms
[   94.140963] dummy1: delay: 1572 ms
[   94.144354] dummy1: delay: 1568 ms
[   94.147745] dummy1: delay: 1560 ms
[   99.065800] hns3 0000:bd:00.0 eth2: delay: 264 ms
[  100.106174] dummy1: delay: 1448 ms
[  102.169799] hns3 0000:bd:00.0 eth2: delay: 436 ms
[  103.166221] dummy1: delay: 1604 ms
[  103.169617] dummy1: delay: 1604 ms
[  104.985806] dummy1: delay: 316 ms
[  105.113797] hns3 0000:bd:00.0 eth2: delay: 308 ms
[  107.289805] hns3 0000:bd:00.0 eth2: delay: 556 ms
[  108.912922] hns3 0000:bd:00.0 eth2: delay: 188 ms
[  137.241801] dummy1: delay: 30624 ms
[  137.245283] dummy1: delay: 30620 ms
[  137.248760] dummy1: delay: 30616 ms
[  137.252237] dummy1: delay: 30616 ms


It seems the problem can be easily reproduced using Michal'
suggestion.

Will test and debug it using the above reproducer first.

> 
> 
> Juergen


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
  2021-04-06  7:31                                                             ` Michal Kubecek
@ 2021-04-06 12:24                                                               ` Yunsheng Lin
  0 siblings, 0 replies; 47+ messages in thread
From: Yunsheng Lin @ 2021-04-06 12:24 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Cong Wang, Jiri Kosina, Hillf Danton, John Fastabend,
	Paolo Abeni, Kehuan Feng, Jike Song, Michael Zhivich,
	David Miller, LKML, Netdev, Josh Hunt, Jason A. Donenfeld,
	Toke Høiland-Jørgensen

On 2021/4/6 15:31, Michal Kubecek wrote:
> On Tue, Apr 06, 2021 at 10:46:29AM +0800, Yunsheng Lin wrote:
>> On 2021/4/6 9:49, Cong Wang wrote:
>>> On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <jikos@kernel.org> wrote:
>>>>
>>>> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
>>>> coming days. If it works, then we can consider proceeding with it,
>>>> otherwise I am all for reverting the whole NOLOCK stuff.
>>>>
>>>> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@huawei.com/T/#u
>>>
>>> I personally prefer to just revert that bit, as it brings more troubles
>>> than gains. Even with Yunsheng's patch, there are still some issues.
>>> Essentially, I think the core qdisc scheduling code is not ready for
>>> lockless, just look at those NOLOCK checks in sch_generic.c. :-/
>>
>> I am also awared of the NOLOCK checks too:), and I am willing to
>> take care of it if that is possible.
>>
>> As the number of cores in a system is increasing, it is the trend
>> to become lockless, right? Even there is only one cpu involved, the
>> spinlock taking and releasing takes about 30ns on our arm64 system
>> when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).
> 
> I agree with the benefits but currently the situation is that we have
> a race condition affecting the default qdisc which is being hit in
> production and can cause serious trouble which is made worse by commit
> 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host
> queues") preventing the retransmits of the stuck packet being sent.
> 
> Perhaps rather than patching over current implementation which requires
> more and more complicated hacks to work around the fact that we cannot
> make the "queue is empty" check and leaving the critical section atomic,
> it would make sense to reimplement it in a way which would allow us
> making it atomic.

Yes, reimplementing that is also an option.
But what if reimplemention also has the same problem if we do not find
the root cause of this problem? I think it better to find the root cause
of it first?

> 
> Michal
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-04-06 12:24 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-09  6:46 Packet gets stuck in NOLOCK pfifo_fast qdisc Jonas Bonn
2019-10-09 19:14 ` Paolo Abeni
2019-10-10  6:27   ` Jonas Bonn
2019-10-11  0:39   ` Jonas Bonn
2020-06-23 13:42     ` Michael Zhivich
2020-06-30 19:14       ` Josh Hunt
2020-07-01  7:53         ` Jonas Bonn
2020-07-01 16:05         ` Cong Wang
2020-07-01 19:58           ` Cong Wang
2020-07-01 22:02             ` Josh Hunt
2020-07-02  6:14             ` Jonas Bonn
2020-07-02  9:45               ` Paolo Abeni
2020-07-02 18:08                 ` Josh Hunt
2020-07-07 14:18                   ` Paolo Abeni
2020-07-08 20:16                     ` Cong Wang
2020-07-09  9:20                       ` Paolo Abeni
2020-07-08 20:33                   ` Zhivich, Michael
2020-08-20  7:43                   ` Jike Song
2020-08-20 18:13                     ` Josh Hunt
     [not found]                     ` <20200822032800.16296-1-hdanton@sina.com>
2020-08-25  2:18                       ` Fengkehuan Feng
     [not found]                         ` <20200825032312.11776-1-hdanton@sina.com>
2020-08-25  7:14                           ` Fengkehuan Feng
     [not found]                             ` <20200825162329.11292-1-hdanton@sina.com>
2020-08-26  2:38                               ` Kehuan Feng
     [not found]                                 ` <CACS=qqKptAQQGiMoCs1Zgs9S4ZppHhasy1AK4df2NxnCDR+vCw@mail.gmail.com>
     [not found]                                   ` <5f46032e.1c69fb81.9880c.7a6cSMTPIN_ADDED_MISSING@mx.google.com>
2020-08-27  6:56                                     ` Kehuan Feng
     [not found]                                       ` <20200827125747.5816-1-hdanton@sina.com>
2020-08-28  1:45                                         ` Kehuan Feng
2020-09-03  5:01                                           ` Cong Wang
2020-09-03  8:39                                             ` Paolo Abeni
2020-09-03 17:43                                               ` Cong Wang
2020-09-04  5:07                                                 ` John Fastabend
2020-09-10 20:15                                                   ` Cong Wang
2020-09-10 21:07                                                     ` John Fastabend
2020-09-10 21:40                                                       ` Paolo Abeni
2021-04-02 19:25                                                   ` Jiri Kosina
2021-04-02 19:33                                                     ` Josh Hunt
     [not found]                                                     ` <20210403003537.2032-1-hdanton@sina.com>
2021-04-03 12:23                                                       ` Jiri Kosina
2021-04-06  0:55                                                         ` Yunsheng Lin
2021-04-06  7:06                                                           ` Michal Kubecek
2021-04-06 10:13                                                             ` Juergen Gross
2021-04-06 12:17                                                               ` Yunsheng Lin
2021-04-06  1:49                                                         ` Cong Wang
2021-04-06  2:46                                                           ` Yunsheng Lin
2021-04-06  7:31                                                             ` Michal Kubecek
2021-04-06 12:24                                                               ` Yunsheng Lin
     [not found]                                               ` <20200903101957.428-1-hdanton@sina.com>
2020-09-04  3:20                                                 ` Kehuan Feng
2020-09-10 20:19                                                   ` Cong Wang
2020-09-14  2:10                                                     ` Yunsheng Lin
2020-09-17 19:52                                                       ` Cong Wang
2020-09-18  2:06                                                         ` Kehuan Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).