LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [NETPOLL] netconsole: fix soft lockup when removing module
@ 2007-07-01 17:35 Oleg Nesterov
  2007-07-02  6:34 ` Jarek Poplawski
  2007-07-02  7:52 ` [PATCH] " Jarek Poplawski
  0 siblings, 2 replies; 11+ messages in thread
From: Oleg Nesterov @ 2007-07-01 17:35 UTC (permalink / raw)
  To: Jarek Poplawski, Linus Torvalds
  Cc: Andrew Morton, David S. Miller, linux-kernel

Jarek Poplawski wrote:
>
>    #1
>    Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
>    required a work function should always (unconditionally) rearm with
>    delay > 0 - otherwise it would endlessly loop. This patch replaces
>    this function with cancel_delayed_work(). Later kernel versions don't
>    require this, so here it's only for uniformity.

But 2.6.22 doesn't need this change, why it was merged?

In fact, I suspect this change adds a race,

> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work)
>  			netif_tx_unlock(dev);
>  			local_irq_restore(flags);
>  
> -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> +			if (atomic_read(&npinfo->refcnt))
> +				schedule_delayed_work(&npinfo->tx_work, HZ/10);
>  			return;
>  		}
>  		netif_tx_unlock(dev);
> @@ -785,9 +786,15 @@ void netpoll_cleanup(struct netpoll *np)
>  			if (atomic_dec_and_test(&npinfo->refcnt)) {
>  				skb_queue_purge(&npinfo->arp_tx);
>  				skb_queue_purge(&npinfo->txq);
> -				cancel_rearming_delayed_work(&npinfo->tx_work);
> +				cancel_delayed_work(&npinfo->tx_work);
>  				flush_scheduled_work();

Suppose that ->refcnt == 1, and queue_process() was preempted just after
atomic_read(&npinfo->refcnt).

netpoll_cleanup() comes, cancel_delayed_work() does nothing, flush_scheduled_work()
sleeps.

queue_process() gets CPU, re-schedules ->tx_work, and returns.

flush_scheduled_work() completes, netpoll_cleanup() frees npinfo and returns
while ->tx_work is pending.

No?

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-01 17:35 [NETPOLL] netconsole: fix soft lockup when removing module Oleg Nesterov
@ 2007-07-02  6:34 ` Jarek Poplawski
  2007-07-02  9:24   ` Oleg Nesterov
  2007-07-02  7:52 ` [PATCH] " Jarek Poplawski
  1 sibling, 1 reply; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-02  6:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On Sun, Jul 01, 2007 at 09:35:58PM +0400, Oleg Nesterov wrote:
> Jarek Poplawski wrote:
> >
> >    #1
> >    Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> >    required a work function should always (unconditionally) rearm with
> >    delay > 0 - otherwise it would endlessly loop. This patch replaces
> >    this function with cancel_delayed_work(). Later kernel versions don't
> >    require this, so here it's only for uniformity.
> 
> But 2.6.22 doesn't need this change, why it was merged?

One bad reason is given above. Should I look for another one?

> 
> In fact, I suspect this change adds a race,

You are right!

> 
> > --- a/net/core/netpoll.c
> > +++ b/net/core/netpoll.c
> > @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work)
> >  			netif_tx_unlock(dev);
> >  			local_irq_restore(flags);
> >  
> > -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > +			if (atomic_read(&npinfo->refcnt))
> > +				schedule_delayed_work(&npinfo->tx_work, HZ/10);
> >  			return;
> >  		}
> >  		netif_tx_unlock(dev);
> > @@ -785,9 +786,15 @@ void netpoll_cleanup(struct netpoll *np)
> >  			if (atomic_dec_and_test(&npinfo->refcnt)) {
> >  				skb_queue_purge(&npinfo->arp_tx);
> >  				skb_queue_purge(&npinfo->txq);
> > -				cancel_rearming_delayed_work(&npinfo->tx_work);
> > +				cancel_delayed_work(&npinfo->tx_work);
> >  				flush_scheduled_work();
> 
> Suppose that ->refcnt == 1, and queue_process() was preempted just after
> atomic_read(&npinfo->refcnt).
> 
> netpoll_cleanup() comes, cancel_delayed_work() does nothing, flush_scheduled_work()
> sleeps.
> 
> queue_process() gets CPU, re-schedules ->tx_work, and returns.
> 
> flush_scheduled_work() completes, netpoll_cleanup() frees npinfo and returns
> while ->tx_work is pending.
> 
> No?

No no. (Yes!)

I had some doubts about this, and you found very good reason
for this.

I'll soon send a patch to restore cancel_rearming_delayed_work
in 2.6.22.

So, 2.6.21 needs something better (maybe you've found it btw.?),
but they weren't too interested, anyway.

Thanks very much & regards,
Jarek P.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-01 17:35 [NETPOLL] netconsole: fix soft lockup when removing module Oleg Nesterov
  2007-07-02  6:34 ` Jarek Poplawski
@ 2007-07-02  7:52 ` Jarek Poplawski
  2007-07-02  8:59   ` Oleg Nesterov
  2007-07-04  6:41   ` [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module Jarek Poplawski
  1 sibling, 2 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-02  7:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev


>From my recent patch:

> >    #1
> >    Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> >    required a work function should always (unconditionally) rearm with
> >    delay > 0 - otherwise it would endlessly loop. This patch replaces
> >    this function with cancel_delayed_work(). Later kernel versions don't
> >    require this, so here it's only for uniformity.

But Oleg Nesterov <oleg@tv-sign.ru> found:

> But 2.6.22 doesn't need this change, why it was merged?
> 
> In fact, I suspect this change adds a race,
...

His description was right (thanks), so this patch reverts #1.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.22-rc7-/net/core/netpoll.c 2.6.22-rc7/net/core/netpoll.c
--- 2.6.22-rc7-/net/core/netpoll.c	2007-07-02 09:03:27.000000000 +0200
+++ 2.6.22-rc7/net/core/netpoll.c	2007-07-02 09:32:34.000000000 +0200
@@ -72,8 +72,7 @@ static void queue_process(struct work_st
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
 
-			if (atomic_read(&npinfo->refcnt))
-				schedule_delayed_work(&npinfo->tx_work, HZ/10);
+			schedule_delayed_work(&npinfo->tx_work, HZ/10);
 			return;
 		}
 		netif_tx_unlock(dev);
@@ -786,7 +785,7 @@ void netpoll_cleanup(struct netpoll *np)
 			if (atomic_dec_and_test(&npinfo->refcnt)) {
 				skb_queue_purge(&npinfo->arp_tx);
 				skb_queue_purge(&npinfo->txq);
-				cancel_delayed_work(&npinfo->tx_work);
+				cancel_rearming_delayed_work(&npinfo->tx_work);
 				flush_scheduled_work();
 
 				/* clean after last, unfinished work */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-02  7:52 ` [PATCH] " Jarek Poplawski
@ 2007-07-02  8:59   ` Oleg Nesterov
  2007-07-02 10:12     ` [PATCH 2/2][NETPOLL] netconsole: delete flush_scheduled_work Jarek Poplawski
  2007-07-04  6:41   ` [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module Jarek Poplawski
  1 sibling, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2007-07-02  8:59 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On 07/02, Jarek Poplawski wrote:
> 
> diff -Nurp 2.6.22-rc7-/net/core/netpoll.c 2.6.22-rc7/net/core/netpoll.c
> --- 2.6.22-rc7-/net/core/netpoll.c	2007-07-02 09:03:27.000000000 +0200
> +++ 2.6.22-rc7/net/core/netpoll.c	2007-07-02 09:32:34.000000000 +0200
> @@ -72,8 +72,7 @@ static void queue_process(struct work_st
>  			netif_tx_unlock(dev);
>  			local_irq_restore(flags);
>  
> -			if (atomic_read(&npinfo->refcnt))
> -				schedule_delayed_work(&npinfo->tx_work, HZ/10);
> +			schedule_delayed_work(&npinfo->tx_work, HZ/10);
>  			return;
>  		}
>  		netif_tx_unlock(dev);
> @@ -786,7 +785,7 @@ void netpoll_cleanup(struct netpoll *np)
>  			if (atomic_dec_and_test(&npinfo->refcnt)) {
>  				skb_queue_purge(&npinfo->arp_tx);
>  				skb_queue_purge(&npinfo->txq);
> -				cancel_delayed_work(&npinfo->tx_work);
> +				cancel_rearming_delayed_work(&npinfo->tx_work);
>  				flush_scheduled_work();

While you are here, could you also delete this flush_scheduled_work() ?
It is not needed any longer.

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-02  6:34 ` Jarek Poplawski
@ 2007-07-02  9:24   ` Oleg Nesterov
  2007-07-02 11:08     ` Jarek Poplawski
  0 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2007-07-02  9:24 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On 07/02, Jarek Poplawski wrote:
>
> > > --- a/net/core/netpoll.c
> > > +++ b/net/core/netpoll.c
> > > @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work)
> > >  			netif_tx_unlock(dev);
> > >  			local_irq_restore(flags);
> > >  
> > > -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > > +			if (atomic_read(&npinfo->refcnt))
> > > +				schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > >  			return;
> > >  		}
>
> [...snip...]
> 
> So, 2.6.21 needs something better (maybe you've found it btw.?),
> but they weren't too interested, anyway.

We can do a double flush trick. If queue_process() checks ->refcnt before
schedule_delayed_work() like above, netpoll_cleanup() can do

	flush_scheduled_work();

	// the next invocation of queue_process()
	// must see ->refcnt == 0
	if (!cancel_delayed_work(&npinfo->tx_work)) {
		/* may be queued, wait for completion */
		flush_scheduled_work();
	}

Jarek, I don't understand net/, a silly question. Why do we need the #2 chunk?
Isn't it better to move skb_queue_purge(&npinfo->txq) after cancel_..._work()
instead?

Oleg.	


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2][NETPOLL] netconsole: delete flush_scheduled_work
  2007-07-02  8:59   ` Oleg Nesterov
@ 2007-07-02 10:12     ` Jarek Poplawski
  0 siblings, 0 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-02 10:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On Mon, Jul 02, 2007 at 12:59:49PM +0400, Oleg Nesterov wrote:
...
> While you are here, could you also delete this flush_scheduled_work() ?
> It is not needed any longer.

Yes. I've thought about this, and even planned to mention, but then
forgotten... Of course, you are right, but since it stayed so long
and doesn't seem to be dangerous, and there is -rc7 I wasn't so brave.
But now I have an explanation...

Jarek P.

---------->

Subject: [PATCH][NETPOLL] netconsole: delete flush_scheduled_work

flush_scheduled_work() isn't needed after cancel_rearming_delayed_work(),
so here it's removed from netpoll_cleanup().

PS: This patch was prepared on 2.6.22-rc7 with my other today's patch:
netconsole: fix soft lockup ...

Noticed-by: Oleg Nesterov <oleg@tv-sign.ru>

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.22-rc7-plus-revert1-/net/core/netpoll.c 2.6.22-rc7-plus-revert1/net/core/netpoll.c
--- 2.6.22-rc7-plus-revert1-/net/core/netpoll.c	2007-07-02 09:32:34.000000000 +0200
+++ 2.6.22-rc7-plus-revert1/net/core/netpoll.c	2007-07-02 11:43:29.000000000 +0200
@@ -786,7 +786,6 @@ void netpoll_cleanup(struct netpoll *np)
 				skb_queue_purge(&npinfo->arp_tx);
 				skb_queue_purge(&npinfo->txq);
 				cancel_rearming_delayed_work(&npinfo->tx_work);
-				flush_scheduled_work();
 
 				/* clean after last, unfinished work */
 				if (!skb_queue_empty(&npinfo->txq)) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-02  9:24   ` Oleg Nesterov
@ 2007-07-02 11:08     ` Jarek Poplawski
  0 siblings, 0 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-02 11:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On Mon, Jul 02, 2007 at 01:24:08PM +0400, Oleg Nesterov wrote:
> On 07/02, Jarek Poplawski wrote:
> >
> > > > --- a/net/core/netpoll.c
> > > > +++ b/net/core/netpoll.c
> > > > @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work)
> > > >  			netif_tx_unlock(dev);
> > > >  			local_irq_restore(flags);
> > > >  
> > > > -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > > > +			if (atomic_read(&npinfo->refcnt))
> > > > +				schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > > >  			return;
> > > >  		}
> >
> > [...snip...]
> > 
> > So, 2.6.21 needs something better (maybe you've found it btw.?),
> > but they weren't too interested, anyway.
> 
> We can do a double flush trick. If queue_process() checks ->refcnt before
> schedule_delayed_work() like above, netpoll_cleanup() can do
> 
> 	flush_scheduled_work();
> 
> 	// the next invocation of queue_process()
> 	// must see ->refcnt == 0
> 	if (!cancel_delayed_work(&npinfo->tx_work)) {
> 		/* may be queued, wait for completion */
> 		flush_scheduled_work();
> 	}

I'll try to think about it later, but I don't plan to do next patch,
so feel free to send this. I didn't plan to fix netpoll at all
(I never didn't use nor studied this before...). But couldn't stand
this stupid lockup stays in 2.6.21. Now, I see it probably doesn't
annoy more than 2 or 3 people around...

> 
> Jarek, I don't understand net/, a silly question. Why do we need the #2 chunk?
> Isn't it better to move skb_queue_purge(&npinfo->txq) after cancel_..._work()
> instead?

I've thought about this too, but because I don't know netpoll/netconsole
enough I didn't want to change more than minimum needed. 

skb_queue_purge() uses heavy locking (irqsave) and I don't remember now
if I've found the reason or simply believed somebody had to have a reason
to do this there, anyway, if moved after cancel_ it could be done without
this locking, and something like while () instead of my if () should be
enough.

But there was not much interest about this patch, and I'm not currently
interested to be the main netconsole expert too, so maybe you would
like to try...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-02  7:52 ` [PATCH] " Jarek Poplawski
  2007-07-02  8:59   ` Oleg Nesterov
@ 2007-07-04  6:41   ` Jarek Poplawski
  2007-07-04  6:47     ` David Miller
  2007-07-04  7:21     ` Jarek Poplawski
  1 sibling, 2 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-04  6:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On Mon, Jul 02, 2007 at 09:52:26AM +0200, Jarek Poplawski wrote:
> 
> From my recent patch:
> 
> > >    #1
> > >    Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> > >    required a work function should always (unconditionally) rearm with
> > >    delay > 0 - otherwise it would endlessly loop. This patch replaces
> > >    this function with cancel_delayed_work(). Later kernel versions don't
> > >    require this, so here it's only for uniformity.
> 
> But Oleg Nesterov <oleg@tv-sign.ru> found:
> 
> > But 2.6.22 doesn't need this change, why it was merged?
> > 
> > In fact, I suspect this change adds a race,
> ...
> 
> His description was right (thanks), so this patch reverts #1.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

Oleg,

I think maybe you could ack these 2 netconsole patches...
They were done on your request but it looks like Andrew
is waiting on something...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-04  6:41   ` [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module Jarek Poplawski
@ 2007-07-04  6:47     ` David Miller
  2007-07-04  7:08       ` Jarek Poplawski
  2007-07-04  7:21     ` Jarek Poplawski
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2007-07-04  6:47 UTC (permalink / raw)
  To: jarkao2; +Cc: oleg, torvalds, akpm, linux-kernel, netdev

From: Jarek Poplawski <jarkao2@o2.pl>
Date: Wed, 4 Jul 2007 08:41:59 +0200

> On Mon, Jul 02, 2007 at 09:52:26AM +0200, Jarek Poplawski wrote:
> > 
> > From my recent patch:
> > 
> > > >    #1
> > > >    Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> > > >    required a work function should always (unconditionally) rearm with
> > > >    delay > 0 - otherwise it would endlessly loop. This patch replaces
> > > >    this function with cancel_delayed_work(). Later kernel versions don't
> > > >    require this, so here it's only for uniformity.
> > 
> > But Oleg Nesterov <oleg@tv-sign.ru> found:
> > 
> > > But 2.6.22 doesn't need this change, why it was merged?
> > > 
> > > In fact, I suspect this change adds a race,
> > ...
> > 
> > His description was right (thanks), so this patch reverts #1.
> > 
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> Oleg,
> 
> I think maybe you could ack these 2 netconsole patches...
> They were done on your request but it looks like Andrew
> is waiting on something...

I plan to apply this patch, don't worry about it :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-04  6:47     ` David Miller
@ 2007-07-04  7:08       ` Jarek Poplawski
  0 siblings, 0 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-04  7:08 UTC (permalink / raw)
  To: David Miller; +Cc: oleg, torvalds, akpm, linux-kernel, netdev

On Tue, Jul 03, 2007 at 11:47:18PM -0700, David Miller wrote:
...
> I plan to apply this patch, don't worry about it :)
 
Now I'm really worried! Don't you evere sleep?

Good night,
Jarek P.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
  2007-07-04  6:41   ` [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module Jarek Poplawski
  2007-07-04  6:47     ` David Miller
@ 2007-07-04  7:21     ` Jarek Poplawski
  1 sibling, 0 replies; 11+ messages in thread
From: Jarek Poplawski @ 2007-07-04  7:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, David S. Miller, linux-kernel, netdev

On Wed, Jul 04, 2007 at 08:41:59AM +0200, Jarek Poplawski wrote:
...
> They were done on your request but it looks like Andrew
> is waiting on something...

Andrew,

This time I'm not sorry for my English because I've just
found I could speak "Chiefly Midland and Southern U.S.".

Jarek P.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-07-04  7:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-01 17:35 [NETPOLL] netconsole: fix soft lockup when removing module Oleg Nesterov
2007-07-02  6:34 ` Jarek Poplawski
2007-07-02  9:24   ` Oleg Nesterov
2007-07-02 11:08     ` Jarek Poplawski
2007-07-02  7:52 ` [PATCH] " Jarek Poplawski
2007-07-02  8:59   ` Oleg Nesterov
2007-07-02 10:12     ` [PATCH 2/2][NETPOLL] netconsole: delete flush_scheduled_work Jarek Poplawski
2007-07-04  6:41   ` [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module Jarek Poplawski
2007-07-04  6:47     ` David Miller
2007-07-04  7:08       ` Jarek Poplawski
2007-07-04  7:21     ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).