LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
[not found] ` <20070419102122.GA93@tv-sign.ru>
@ 2007-04-20 9:22 ` Jarek Poplawski
2007-04-20 17:08 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-20 9:22 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel
On Thu, Apr 19, 2007 at 02:21:22PM +0400, Oleg Nesterov wrote:
> On 04/19, Andrew Morton wrote:
> >
> > Begin forwarded message:
> >
> > Date: Thu, 19 Apr 2007 08:54:04 +0200
> > From: Jarek Poplawski <jarkao2@o2.pl>
> > To: linux-kernel@vger.kernel.org
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Subject: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
At first I didn't spotted your thread is "independent" and
from your second message it seems you didn't receive all, you
expected. I wonder, if there is a need and possibility to add
somewhere (I would prefer Linus' tree git's http frontend)
addresses of developers (but not maintainers), who are
interested in CC-ing about some files' patches?.
...
> Yes. It would be better to use cancel_work_sync() instead of flush_workqueue()
> to make this less possible (because cancel_work_sync() doesn't need to wait for
> the whole ->worklist), but we can't.
>
> > Maybe this patch could check, if I'm not dreaming...
>
> Also: cancel_rearming_delayed_work() will hang if it (or cancel_delayed_work())
> was already called.
>
> I had some ideas how to make this interface reliable, but I can't see how to do
> this without uglification of the current code.
For some time I thought about using a flag (isn't there
one available after NOAUTOREL?), e.g. WORK_STRUCT_CANCEL,
as a sign:
- for a workqueue code: that the work shouldn't be queued,
nor executed, if possiblei, at first possible check.
- for a work function: to stop execution as soon as possible,
even without completing the usual job, at first possible check.
There would be a question, whether the flag should be changed
under a lock for exact synchronisation.
Cheers,
Jarek P.
PS: I hope there is no reason against going with this to lkml
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-20 9:22 ` Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work Jarek Poplawski
@ 2007-04-20 17:08 ` Oleg Nesterov
2007-04-23 9:00 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-20 17:08 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel
On 04/20, Jarek Poplawski wrote:
>
> On Thu, Apr 19, 2007 at 02:21:22PM +0400, Oleg Nesterov wrote:
> ...
> > Yes. It would be better to use cancel_work_sync() instead of flush_workqueue()
> > to make this less possible (because cancel_work_sync() doesn't need to wait for
> > the whole ->worklist), but we can't.
> >
> > > Maybe this patch could check, if I'm not dreaming...
> >
> > Also: cancel_rearming_delayed_work() will hang if it (or cancel_delayed_work())
> > was already called.
> >
> > I had some ideas how to make this interface reliable, but I can't see how to do
> > this without uglification of the current code.
>
> For some time I thought about using a flag (isn't there
> one available after NOAUTOREL?), e.g. WORK_STRUCT_CANCEL,
> as a sign:
>
> - for a workqueue code: that the work shouldn't be queued,
> nor executed, if possiblei, at first possible check.
Well, yes and no, afaics. (note also that NOAUTOREL has already gone).
First, this flag should be cleared after return from cancel_rearming_delayed_work().
Also, we should add a lot of nasty checks to workqueue.c
I _think_ we can re-use WORK_STRUCT_PENDING to improve this interface.
Note that if we set WORK_STRUCT_PENDING, the work can't be queued, and
dwork->timer can't be started. The only problem is that it is not so
trivial to avoid races.
I'll try to do something on Sunday.
> - for a work function: to stop execution as soon as possible,
> even without completing the usual job, at first possible check.
I doubt we need this "in general". It is easy to add some flag to the
work_struct's container and check it in work->func() when needed.
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-20 17:08 ` Oleg Nesterov
@ 2007-04-23 9:00 ` Jarek Poplawski
2007-04-23 16:33 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-23 9:00 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel
On Fri, Apr 20, 2007 at 09:08:36PM +0400, Oleg Nesterov wrote:
> On 04/20, Jarek Poplawski wrote:
> >
> > On Thu, Apr 19, 2007 at 02:21:22PM +0400, Oleg Nesterov wrote:
> > ...
> > > Yes. It would be better to use cancel_work_sync() instead of flush_workqueue()
> > > to make this less possible (because cancel_work_sync() doesn't need to wait for
> > > the whole ->worklist), but we can't.
> > >
> > > > Maybe this patch could check, if I'm not dreaming...
> > >
> > > Also: cancel_rearming_delayed_work() will hang if it (or cancel_delayed_work())
> > > was already called.
> > >
> > > I had some ideas how to make this interface reliable, but I can't see how to do
> > > this without uglification of the current code.
> >
> > For some time I thought about using a flag (isn't there
> > one available after NOAUTOREL?), e.g. WORK_STRUCT_CANCEL,
> > as a sign:
> >
> > - for a workqueue code: that the work shouldn't be queued,
> > nor executed, if possiblei, at first possible check.
>
> Well, yes and no, afaics. (note also that NOAUTOREL has already gone).
I thought I wrote the same (sorry for my English)...
>
> First, this flag should be cleared after return from cancel_rearming_delayed_work().
I think this flag, if at all, probably should be cleared only
consciously by the owner of a work, maybe as a schedule_xxx_work
parameter, (but shouldn't be used from work handlers for rearming).
Mostly it should mean: we are closing (and have no time to chase
our work)...
> Also, we should add a lot of nasty checks to workqueue.c
Checking a flag isn't nasty - it's clear. IMHO current way of checking,
whether cancel succeeded, is nasty.
>
> I _think_ we can re-use WORK_STRUCT_PENDING to improve this interface.
> Note that if we set WORK_STRUCT_PENDING, the work can't be queued, and
> dwork->timer can't be started. The only problem is that it is not so
> trivial to avoid races.
If there were no place, it would be better, then current way.
But WORK_STRUCT_PENDING couldn't be used for some error checking,
as it's now.
>
> I'll try to do something on Sunday.
>
> > - for a work function: to stop execution as soon as possible,
> > even without completing the usual job, at first possible check.
>
> I doubt we need this "in general". It is easy to add some flag to the
> work_struct's container and check it in work->func() when needed.
Yes, but currently you cannot to behave like this e.g. with
"rearming" work. And maybe a common api could save some work.
But of course, if you have better way to assure this, it's OK
with me and congratulations!
Regards,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-23 9:00 ` Jarek Poplawski
@ 2007-04-23 16:33 ` Oleg Nesterov
2007-04-24 11:53 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-23 16:33 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel
On 04/23, Jarek Poplawski wrote:
>
> On Fri, Apr 20, 2007 at 09:08:36PM +0400, Oleg Nesterov wrote:
> >
> > First, this flag should be cleared after return from cancel_rearming_delayed_work().
>
> I think this flag, if at all, probably should be cleared only
> consciously by the owner of a work, maybe as a schedule_xxx_work
> parameter, (but shouldn't be used from work handlers for rearming).
> Mostly it should mean: we are closing (and have no time to chase
> our work)...
This will change the API. Currently it is possible to do:
cancel_delayed_work(dwork);
schedule_delayed_work(dwork, delay);
and we have such a code. With the change you propose this can't work.
> > Also, we should add a lot of nasty checks to workqueue.c
>
> Checking a flag isn't nasty - it's clear. IMHO current way of checking,
> whether cancel succeeded, is nasty.
>
> >
> > I _think_ we can re-use WORK_STRUCT_PENDING to improve this interface.
> > Note that if we set WORK_STRUCT_PENDING, the work can't be queued, and
> > dwork->timer can't be started. The only problem is that it is not so
> > trivial to avoid races.
>
> If there were no place, it would be better, then current way.
> But WORK_STRUCT_PENDING couldn't be used for some error checking,
> as it's now.
Look,
void cancel_rearming_delayed_work(struct delayed_work *dwork)
{
struct work_struct *work = &dwork->work;
struct cpu_workqueue_struct *cwq = get_wq_data(work);
struct workqueue_struct *wq;
const cpumask_t *cpu_map;
int retry;
int cpu;
if (!cwq)
return;
retry:
spin_lock_irq(&cwq->lock);
list_del_init(&work->entry);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
retry = try_to_del_timer_sync(&dwork->timer) < 0;
spin_unlock_irq(&cwq->lock);
if (unlikely(retry))
goto retry;
// the work can't be re-queued and the timer can't
// be re-started due to WORK_STRUCT_PENDING
wq = cwq->wq;
cpu_map = wq_cpu_map(wq);
for_each_cpu_mask(cpu, *cpu_map)
wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
work_clear_pending(work);
}
I think this almost works, except:
- we should change run_workqueue() to call work_clear_pending()
under cwq->lock. I'd like to avoid this.
- this is racy wrt cpu-hotplug. We should re-check get_wq_data()
when we take the lock. This is easy.
- we should factor out the common code with cancel_work_sync().
I may be wrong, still had no time to concentrate on this with a "clear head".
May be tomorrow.
> > > - for a work function: to stop execution as soon as possible,
> > > even without completing the usual job, at first possible check.
> >
> > I doubt we need this "in general". It is easy to add some flag to the
> > work_struct's container and check it in work->func() when needed.
>
> Yes, but currently you cannot to behave like this e.g. with
> "rearming" work.
Why?
> And maybe a common api could save some work.
May be you are right, but still I don't think we should introduce
the new flag to implement this imho not-so-useful feature.
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-23 16:33 ` Oleg Nesterov
@ 2007-04-24 11:53 ` Jarek Poplawski
2007-04-24 18:55 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-24 11:53 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel
On Mon, Apr 23, 2007 at 08:33:12PM +0400, Oleg Nesterov wrote:
> On 04/23, Jarek Poplawski wrote:
> >
> > On Fri, Apr 20, 2007 at 09:08:36PM +0400, Oleg Nesterov wrote:
> > >
> > > First, this flag should be cleared after return from cancel_rearming_delayed_work().
> >
> > I think this flag, if at all, probably should be cleared only
> > consciously by the owner of a work, maybe as a schedule_xxx_work
> > parameter, (but shouldn't be used from work handlers for rearming).
> > Mostly it should mean: we are closing (and have no time to chase
> > our work)...
>
> This will change the API. Currently it is possible to do:
>
> cancel_delayed_work(dwork);
> schedule_delayed_work(dwork, delay);
>
> and we have such a code. With the change you propose this can't work.
Not necessarily: this all was only a concept and schedule_xxx_work
could be also a new function, if needed.
>
> > > Also, we should add a lot of nasty checks to workqueue.c
> >
> > Checking a flag isn't nasty - it's clear. IMHO current way of checking,
> > whether cancel succeeded, is nasty.
> >
> > >
> > > I _think_ we can re-use WORK_STRUCT_PENDING to improve this interface.
> > > Note that if we set WORK_STRUCT_PENDING, the work can't be queued, and
> > > dwork->timer can't be started. The only problem is that it is not so
> > > trivial to avoid races.
> >
> > If there were no place, it would be better, then current way.
> > But WORK_STRUCT_PENDING couldn't be used for some error checking,
> > as it's now.
>
> Look,
>
> void cancel_rearming_delayed_work(struct delayed_work *dwork)
> {
> struct work_struct *work = &dwork->work;
> struct cpu_workqueue_struct *cwq = get_wq_data(work);
> struct workqueue_struct *wq;
> const cpumask_t *cpu_map;
> int retry;
> int cpu;
>
> if (!cwq)
> return;
>
> retry:
> spin_lock_irq(&cwq->lock);
> list_del_init(&work->entry);
> __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> retry = try_to_del_timer_sync(&dwork->timer) < 0;
> spin_unlock_irq(&cwq->lock);
>
> if (unlikely(retry))
> goto retry;
>
> // the work can't be re-queued and the timer can't
> // be re-started due to WORK_STRUCT_PENDING
>
> wq = cwq->wq;
> cpu_map = wq_cpu_map(wq);
>
> for_each_cpu_mask(cpu, *cpu_map)
> wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
>
> work_clear_pending(work);
> }
>
> I think this almost works, except:
>
> - we should change run_workqueue() to call work_clear_pending()
> under cwq->lock. I'd like to avoid this.
>
> - this is racy wrt cpu-hotplug. We should re-check get_wq_data()
> when we take the lock. This is easy.
>
> - we should factor out the common code with cancel_work_sync().
>
> I may be wrong, still had no time to concentrate on this with a "clear head".
> May be tomorrow.
This looks fine. Of course, it requires to remove some debugging
currently done with _PENDING flag and it's hard to estimate this
all before you do more, but it should be more foreseeable than
current way. But the races with _PENDING could be really "funny"
without locking it everywhere. BTW - are a few locks more a real
problem, while serving a "sleeping" path? And I don't think there
is any reason to hurry...
>
> > > > - for a work function: to stop execution as soon as possible,
> > > > even without completing the usual job, at first possible check.
> > >
> > > I doubt we need this "in general". It is easy to add some flag to the
> > > work_struct's container and check it in work->func() when needed.
> >
> > Yes, but currently you cannot to behave like this e.g. with
> > "rearming" work.
>
> Why?
OK, it's not impossible, but needs some bothering: if I simply
set some flag and my work function exits before rearming -
cancel_rearming_delayed_work can loop.
>
> > And maybe a common api could save some work.
>
> May be you are right, but still I don't think we should introduce
> the new flag to implement this imho not-so-useful feature.
Maybe you are right. Probably some code should be analysed,
to check how often such activities are needed.
Cheers,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-24 11:53 ` Jarek Poplawski
@ 2007-04-24 18:55 ` Oleg Nesterov
2007-04-25 6:12 ` Jarek Poplawski
2007-04-25 12:20 ` Jarek Poplawski
0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-24 18:55 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/24, Jarek Poplawski wrote:
>
> This looks fine. Of course, it requires to remove some debugging
> currently done with _PENDING flag
For example?
> and it's hard to estimate this
> all before you do more, but it should be more foreseeable than
> current way. But the races with _PENDING could be really "funny"
> without locking it everywhere.
Please see the patch below. Do you see any problems? I'll send it
when I have time to re-read the code and write some tests. I still
hope we can find a way to avoid the change in run_workqueue()...
Note that cancel_rearming_delayed_work() now can handle the works
which re-arm itself via queue_work(), not only queue_delayed_work().
Note also we can change cancel_work_sync(), so it can deal with the
self rearming work_structs.
> BTW - are a few locks more a real
> problem, while serving a "sleeping" path? And I don't think there
> is any reason to hurry...
Sorry, could you clarify what you mean?
> > > Yes, but currently you cannot to behave like this e.g. with
> > > "rearming" work.
> >
> > Why?
>
> OK, it's not impossible, but needs some bothering: if I simply
> set some flag and my work function exits before rearming -
> cancel_rearming_delayed_work can loop.
Yes sure. I meant "after we fix the problems you pointed out".
Oleg.
--- OLD/kernel/workqueue.c~1_CRDW 2007-04-13 17:43:23.000000000 +0400
+++ OLD/kernel/workqueue.c 2007-04-24 22:41:15.000000000 +0400
@@ -242,11 +242,11 @@ static void run_workqueue(struct cpu_wor
work_func_t f = work->func;
cwq->current_work = work;
- list_del_init(cwq->worklist.next);
+ list_del_init(&work->entry);
+ work_clear_pending(work);
spin_unlock_irq(&cwq->lock);
BUG_ON(get_wq_data(work) != cwq);
- work_clear_pending(work);
f(work);
if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
@@ -398,6 +398,16 @@ static void wait_on_work(struct cpu_work
wait_for_completion(&barr.done);
}
+static void needs_a_good_name(struct workqueue_struct *wq,
+ struct work_struct *work)
+{
+ const cpumask_t *cpu_map = wq_cpu_map(wq);
+ int cpu;
+
+ for_each_cpu_mask(cpu, *cpu_map)
+ wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+}
+
/**
* cancel_work_sync - block until a work_struct's callback has terminated
* @work: the work which is to be flushed
@@ -414,9 +424,6 @@ static void wait_on_work(struct cpu_work
void cancel_work_sync(struct work_struct *work)
{
struct cpu_workqueue_struct *cwq;
- struct workqueue_struct *wq;
- const cpumask_t *cpu_map;
- int cpu;
might_sleep();
@@ -434,15 +441,10 @@ void cancel_work_sync(struct work_struct
work_clear_pending(work);
spin_unlock_irq(&cwq->lock);
- wq = cwq->wq;
- cpu_map = wq_cpu_map(wq);
-
- for_each_cpu_mask(cpu, *cpu_map)
- wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+ needs_a_good_name(cwq->wq, work);
}
EXPORT_SYMBOL_GPL(cancel_work_sync);
-
static struct workqueue_struct *keventd_wq;
/**
@@ -532,22 +534,34 @@ EXPORT_SYMBOL(flush_scheduled_work);
/**
* cancel_rearming_delayed_work - kill off a delayed work whose handler rearms the delayed work.
* @dwork: the delayed work struct
- *
- * Note that the work callback function may still be running on return from
- * cancel_delayed_work(). Run flush_workqueue() or cancel_work_sync() to wait
- * on it.
*/
void cancel_rearming_delayed_work(struct delayed_work *dwork)
{
- struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
-
- /* Was it ever queued ? */
- if (cwq != NULL) {
- struct workqueue_struct *wq = cwq->wq;
-
- while (!cancel_delayed_work(dwork))
- flush_workqueue(wq);
- }
+ struct work_struct *work = &dwork->work;
+ struct cpu_workqueue_struct *cwq = get_wq_data(work);
+ int retry;
+
+ if (!cwq)
+ return;
+
+ do {
+ retry = 1;
+ spin_lock_irq(&cwq->lock);
+ /* CPU_DEAD in progress may change cwq */
+ if (likely(cwq == get_wq_data(work))) {
+ list_del_init(&work->entry);
+ __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
+ retry = try_to_del_timer_sync(&dwork->timer) < 0;
+ }
+ spin_unlock_irq(&cwq->lock);
+ } while (unlikely(retry));
+
+ /*
+ * Nobody can clear WORK_STRUCT_PENDING. This means that the
+ * work can't be re-queued and the timer can't be re-started.
+ */
+ needs_a_good_name(cwq->wq, work);
+ work_clear_pending(work);
}
EXPORT_SYMBOL(cancel_rearming_delayed_work);
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-24 18:55 ` Oleg Nesterov
@ 2007-04-25 6:12 ` Jarek Poplawski
2007-04-25 12:20 ` Jarek Poplawski
1 sibling, 0 replies; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-25 6:12 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Tue, Apr 24, 2007 at 10:55:37PM +0400, Oleg Nesterov wrote:
> On 04/24, Jarek Poplawski wrote:
> >
> > This looks fine. Of course, it requires to remove some debugging
> > currently done with _PENDING flag
>
> For example?
Sorry!!! I don't know where I've seen those flags - maybe it's
something with my coffee...
>
> > and it's hard to estimate this
> > all before you do more, but it should be more foreseeable than
> > current way. But the races with _PENDING could be really "funny"
> > without locking it everywhere.
>
> Please see the patch below. Do you see any problems? I'll send it
> when I have time to re-read the code and write some tests. I still
> hope we can find a way to avoid the change in run_workqueue()...
>
> Note that cancel_rearming_delayed_work() now can handle the works
> which re-arm itself via queue_work(), not only queue_delayed_work().
>
> Note also we can change cancel_work_sync(), so it can deal with the
> self rearming work_structs.
>
> > BTW - are a few locks more a real
> > problem, while serving a "sleeping" path? And I don't think there
> > is any reason to hurry...
>
> Sorry, could you clarify what you mean?
I don't understand your unwillingnes e.g. with this run_workqueue
lock. If it's about performance, do you think the clients of
workqueue could care very much?
>
> > > > Yes, but currently you cannot to behave like this e.g. with
> > > > "rearming" work.
> > >
> > > Why?
> >
> > OK, it's not impossible, but needs some bothering: if I simply
> > set some flag and my work function exits before rearming -
> > cancel_rearming_delayed_work can loop.
>
> Yes sure. I meant "after we fix the problems you pointed out".
>
> Oleg.
>
> --- OLD/kernel/workqueue.c~1_CRDW 2007-04-13 17:43:23.000000000 +0400
> +++ OLD/kernel/workqueue.c 2007-04-24 22:41:15.000000000 +0400
> @@ -242,11 +242,11 @@ static void run_workqueue(struct cpu_wor
> work_func_t f = work->func;
>
> cwq->current_work = work;
> - list_del_init(cwq->worklist.next);
> + list_del_init(&work->entry);
> + work_clear_pending(work);
> spin_unlock_irq(&cwq->lock);
>
> BUG_ON(get_wq_data(work) != cwq);
> - work_clear_pending(work);
> f(work);
>
> if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> @@ -398,6 +398,16 @@ static void wait_on_work(struct cpu_work
> wait_for_completion(&barr.done);
> }
>
> +static void needs_a_good_name(struct workqueue_struct *wq,
> + struct work_struct *work)
> +{
> + const cpumask_t *cpu_map = wq_cpu_map(wq);
> + int cpu;
> +
> + for_each_cpu_mask(cpu, *cpu_map)
> + wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
> +}
> +
> /**
> * cancel_work_sync - block until a work_struct's callback has terminated
> * @work: the work which is to be flushed
> @@ -414,9 +424,6 @@ static void wait_on_work(struct cpu_work
> void cancel_work_sync(struct work_struct *work)
> {
> struct cpu_workqueue_struct *cwq;
> - struct workqueue_struct *wq;
> - const cpumask_t *cpu_map;
> - int cpu;
>
> might_sleep();
>
> @@ -434,15 +441,10 @@ void cancel_work_sync(struct work_struct
> work_clear_pending(work);
> spin_unlock_irq(&cwq->lock);
>
> - wq = cwq->wq;
> - cpu_map = wq_cpu_map(wq);
> -
> - for_each_cpu_mask(cpu, *cpu_map)
> - wait_on_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
> + needs_a_good_name(cwq->wq, work);
> }
> EXPORT_SYMBOL_GPL(cancel_work_sync);
>
> -
> static struct workqueue_struct *keventd_wq;
>
> /**
> @@ -532,22 +534,34 @@ EXPORT_SYMBOL(flush_scheduled_work);
> /**
> * cancel_rearming_delayed_work - kill off a delayed work whose handler rearms the delayed work.
> * @dwork: the delayed work struct
> - *
> - * Note that the work callback function may still be running on return from
> - * cancel_delayed_work(). Run flush_workqueue() or cancel_work_sync() to wait
> - * on it.
> */
> void cancel_rearming_delayed_work(struct delayed_work *dwork)
> {
> - struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
> -
> - /* Was it ever queued ? */
> - if (cwq != NULL) {
> - struct workqueue_struct *wq = cwq->wq;
> -
> - while (!cancel_delayed_work(dwork))
> - flush_workqueue(wq);
> - }
> + struct work_struct *work = &dwork->work;
> + struct cpu_workqueue_struct *cwq = get_wq_data(work);
> + int retry;
> +
> + if (!cwq)
> + return;
> +
> + do {
> + retry = 1;
> + spin_lock_irq(&cwq->lock);
> + /* CPU_DEAD in progress may change cwq */
> + if (likely(cwq == get_wq_data(work))) {
> + list_del_init(&work->entry);
> + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> + }
> + spin_unlock_irq(&cwq->lock);
> + } while (unlikely(retry));
> +
> + /*
> + * Nobody can clear WORK_STRUCT_PENDING. This means that the
> + * work can't be re-queued and the timer can't be re-started.
> + */
I've some doubts, yet. Probably there are two week places:
1. If delayed_work_timer_fn of this work is fired and is waiting
on the above spin_lock then, after above spin_unlock, the work
will be queued. Probably this is also possible without timer i.e.
with queue_work.
2. If this function is fired after setting _PENDING flag in
queue_delayed_work_on, but before add_timer, this
try_to_del_timer_sync loop would miss this, too.
I found this analysing your first proposal, so I can miss
something new, but at the first glance this looks alike.
> + needs_a_good_name(cwq->wq, work);
> + work_clear_pending(work);
> }
> EXPORT_SYMBOL(cancel_rearming_delayed_work);
So, if you could clear my doubts plus some more time,
for new things, and I'll be happy with this tomorrow,
I hope!
Regards,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-24 18:55 ` Oleg Nesterov
2007-04-25 6:12 ` Jarek Poplawski
@ 2007-04-25 12:20 ` Jarek Poplawski
2007-04-25 12:28 ` Jarek Poplawski
1 sibling, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-25 12:20 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
2 cents more...
On Tue, Apr 24, 2007 at 10:55:37PM +0400, Oleg Nesterov wrote:
...
> --- OLD/kernel/workqueue.c~1_CRDW 2007-04-13 17:43:23.000000000 +0400
> +++ OLD/kernel/workqueue.c 2007-04-24 22:41:15.000000000 +0400
> @@ -242,11 +242,11 @@ static void run_workqueue(struct cpu_wor
...
>
> +static void needs_a_good_name(struct workqueue_struct *wq,
If you don't prefer something original, I think this
could be something like:
wait_on_work_[on_each | each_cpu | per_cpu] etc.
...
> /**
> @@ -532,22 +534,34 @@ EXPORT_SYMBOL(flush_scheduled_work);
> /**
> * cancel_rearming_delayed_work - kill off a delayed work whose handler rearms the delayed work.
> * @dwork: the delayed work struct
> - *
> - * Note that the work callback function may still be running on return from
> - * cancel_delayed_work(). Run flush_workqueue() or cancel_work_sync() to wait
> - * on it.
Probably there should be added a few words about changes.
> */
> void cancel_rearming_delayed_work(struct delayed_work *dwork)
> {
> - struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
> -
> - /* Was it ever queued ? */
> - if (cwq != NULL) {
> - struct workqueue_struct *wq = cwq->wq;
> -
> - while (!cancel_delayed_work(dwork))
> - flush_workqueue(wq);
> - }
> + struct work_struct *work = &dwork->work;
> + struct cpu_workqueue_struct *cwq = get_wq_data(work);
> + int retry;
> +
> + if (!cwq)
> + return;
> +
> + do {
> + retry = 1;
> + spin_lock_irq(&cwq->lock);
> + /* CPU_DEAD in progress may change cwq */
> + if (likely(cwq == get_wq_data(work))) {
> + list_del_init(&work->entry);
> + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> + }
else
retry = 0;
> + spin_unlock_irq(&cwq->lock);
> + } while (unlikely(retry));
...
Cheers,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-25 12:20 ` Jarek Poplawski
@ 2007-04-25 12:28 ` Jarek Poplawski
2007-04-25 12:47 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-25 12:28 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Wed, Apr 25, 2007 at 02:20:38PM +0200, Jarek Poplawski wrote:
> 2 cents more...
...
> On Tue, Apr 24, 2007 at 10:55:37PM +0400, Oleg Nesterov wrote:
> > + do {
> > + retry = 1;
Of course this'll be shorter:
retry = 0;
> > + spin_lock_irq(&cwq->lock);
> > + /* CPU_DEAD in progress may change cwq */
> > + if (likely(cwq == get_wq_data(work))) {
> > + list_del_init(&work->entry);
> > + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> > + }
> > + spin_unlock_irq(&cwq->lock);
> > + } while (unlikely(retry));
...
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-25 12:28 ` Jarek Poplawski
@ 2007-04-25 12:47 ` Oleg Nesterov
2007-04-25 14:47 ` Oleg Nesterov
2007-04-26 13:13 ` Jarek Poplawski
0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-25 12:47 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/25, Jarek Poplawski wrote:
>
> On Wed, Apr 25, 2007 at 02:20:38PM +0200, Jarek Poplawski wrote:
> > 2 cents more...
> ...
> > On Tue, Apr 24, 2007 at 10:55:37PM +0400, Oleg Nesterov wrote:
> > > + do {
> > > + retry = 1;
>
> Of course this'll be shorter:
>
> retry = 0;
No, this would be wrong. Note the comment about CPU-hotplug below,
we should retry if cwq was changed.
> > > + spin_lock_irq(&cwq->lock);
> > > + /* CPU_DEAD in progress may change cwq */
> > > + if (likely(cwq == get_wq_data(work))) {
> > > + list_del_init(&work->entry);
> > > + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > > + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> > > + }
> > > + spin_unlock_irq(&cwq->lock);
> > > + } while (unlikely(retry));
> 1. If delayed_work_timer_fn of this work is fired and is waiting
> on the above spin_lock then, after above spin_unlock, the work
> will be queued.
No, in that case try_to_del_timer_sync() returns -1.
> Probably this is also possible without timer i.e.
> with queue_work.
Yes, thanks. While adding cpu-hotplug check I forgot to add ->current_work
check, which is needed to actually implement this
> > Note that cancel_rearming_delayed_work() now can handle the works
> > which re-arm itself via queue_work(), not only queue_delayed_work().
part. I'll resend after fix.
> 2. If this function is fired after setting _PENDING flag in
> queue_delayed_work_on, but before add_timer, this
> try_to_del_timer_sync loop would miss this, too.
same as above, thanks.
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-25 12:47 ` Oleg Nesterov
@ 2007-04-25 14:47 ` Oleg Nesterov
2007-04-26 12:59 ` Jarek Poplawski
2007-04-26 13:13 ` Jarek Poplawski
1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-25 14:47 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/25, Oleg Nesterov wrote:
>
> On 04/25, Jarek Poplawski wrote:
> >
> > Probably this is also possible without timer i.e.
> > with queue_work.
>
> Yes, thanks. While adding cpu-hotplug check I forgot to add ->current_work
> check, which is needed to actually implement this
>
> > > Note that cancel_rearming_delayed_work() now can handle the works
> > > which re-arm itself via queue_work(), not only queue_delayed_work().
>
> part. I'll resend after fix.
Hm. But can't we do better? Looks like we don't need to check ->current_work,
void cancel_rearming_delayed_work(struct delayed_work *dwork)
{
struct work_struct *work = &dwork->work;
struct cpu_workqueue_struct *cwq = get_wq_data(work);
int done;
do {
done = 1;
spin_lock_irq(&cwq->lock);
if (!list_empty(&work->entry))
list_del_init(&work->entry);
else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
done = del_timer(&dwork->timer)
spin_unlock_irq(&cwq->lock);
} while (!done);
/*
* Nobody can clear WORK_STRUCT_PENDING. This means that the
* work can't be re-queued and the timer can't be re-started.
*/
needs_a_good_name(cwq->wq, work);
work_clear_pending(work);
}
Jarek, I didn't think much about this, just a new idea. I am posting this code
in a hope you can review it while I sleep on this... CPU-hotplug is ignored for
now. Note that this version doesn't need the change in run_workqueue().
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-25 14:47 ` Oleg Nesterov
@ 2007-04-26 12:59 ` Jarek Poplawski
2007-04-26 16:34 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-26 12:59 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Wed, Apr 25, 2007 at 06:47:59PM +0400, Oleg Nesterov wrote:
> On 04/25, Oleg Nesterov wrote:
> >
> > On 04/25, Jarek Poplawski wrote:
> > >
> > > Probably this is also possible without timer i.e.
> > > with queue_work.
> >
> > Yes, thanks. While adding cpu-hotplug check I forgot to add ->current_work
> > check, which is needed to actually implement this
> >
> > > > Note that cancel_rearming_delayed_work() now can handle the works
> > > > which re-arm itself via queue_work(), not only queue_delayed_work().
> >
> > part. I'll resend after fix.
>
> Hm. But can't we do better? Looks like we don't need to check ->current_work,
>
> void cancel_rearming_delayed_work(struct delayed_work *dwork)
> {
> struct work_struct *work = &dwork->work;
> struct cpu_workqueue_struct *cwq = get_wq_data(work);
> int done;
I don't understand, why you think cwq cannot be NULL here.
>
> do {
> done = 1;
> spin_lock_irq(&cwq->lock);
>
> if (!list_empty(&work->entry))
> list_del_init(&work->entry);
BTW, isn't needs_a_good_name needles after this and after del_timer positive?
> else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
> done = del_timer(&dwork->timer)
If this runs while a work function is fired in run_workqueue,
it sets _PENDING bit, but if the work skips rearming, we have probably
endless loop, again.
>
> spin_unlock_irq(&cwq->lock);
> } while (!done);
>
> /*
> * Nobody can clear WORK_STRUCT_PENDING. This means that the
> * work can't be re-queued and the timer can't be re-started.
> */
> needs_a_good_name(cwq->wq, work);
> work_clear_pending(work);
> }
>
> Jarek, I didn't think much about this, just a new idea. I am posting this code
> in a hope you can review it while I sleep on this... CPU-hotplug is ignored for
> now. Note that this version doesn't need the change in run_workqueue().
It's very interesting proposal, but also hard to analyse - the locks
are taken and given away, and there is hard to forsee, when and where
the loop regains the lock again. It is something alike to the current
way, with some added measures: you try to shoot a work on the run,
while queued or timer_pending, plus the _PENDING flag set, so it seems,
there is some risk of longer than planed looping.
I have to look at this more, at home and, if something new, I'll write
tomorrow. So, the good news, is you should have enough sleep this time!
Cheers,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-25 12:47 ` Oleg Nesterov
2007-04-25 14:47 ` Oleg Nesterov
@ 2007-04-26 13:13 ` Jarek Poplawski
2007-04-26 16:44 ` Oleg Nesterov
1 sibling, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-26 13:13 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Wed, Apr 25, 2007 at 04:47:14PM +0400, Oleg Nesterov wrote:
> On 04/25, Jarek Poplawski wrote:
...
> > > > + spin_lock_irq(&cwq->lock);
> > > > + /* CPU_DEAD in progress may change cwq */
> > > > + if (likely(cwq == get_wq_data(work))) {
> > > > + list_del_init(&work->entry);
> > > > + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > > > + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> > > > + }
> > > > + spin_unlock_irq(&cwq->lock);
> > > > + } while (unlikely(retry));
>
> > 1. If delayed_work_timer_fn of this work is fired and is waiting
> > on the above spin_lock then, after above spin_unlock, the work
> > will be queued.
>
> No, in that case try_to_del_timer_sync() returns -1.
Yes. But I think it's safe only after moving work_clear_pending
in run_workqueue under a lock; probably otherwise there is a
possibility this flag could be cleared, after above unlock.
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-26 12:59 ` Jarek Poplawski
@ 2007-04-26 16:34 ` Oleg Nesterov
2007-04-27 5:26 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-26 16:34 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/26, Jarek Poplawski wrote:
>
> > void cancel_rearming_delayed_work(struct delayed_work *dwork)
> > {
> > struct work_struct *work = &dwork->work;
> > struct cpu_workqueue_struct *cwq = get_wq_data(work);
> > int done;
>
> I don't understand, why you think cwq cannot be NULL here.
sure it can, this is just a template.
> >
> > do {
> > done = 1;
> > spin_lock_irq(&cwq->lock);
> >
> > if (!list_empty(&work->entry))
> > list_del_init(&work->entry);
>
> BTW, isn't needs_a_good_name needles after this and after del_timer positive?
no, we still need it. work->func() may be running on another CPU as well.
>
> > else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
> > done = del_timer(&dwork->timer)
>
> If this runs while a work function is fired in run_workqueue,
> it sets _PENDING bit, but if the work skips rearming, we have probably
> endless loop, again.
No, if the work skips rearming (or didn't yet), we set WORK_STRUCT_PENDING
successfully.
> It is something alike to the current
> way, with some added measures: you try to shoot a work on the run,
> while queued or timer_pending, plus the _PENDING flag set, so it seems,
> there is some risk of longer than planed looping.
Sorry, can't understand. done == 0 means that the queueing in progress,
this work should be placed on cwq->worklist very soon, most probably
right after we drop cwq->lock.
> I have to look at this more, at home and, if something new, I'll write
> tomorrow. So, the good news, is you should have enough sleep this time!
Thanks for review!
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-26 13:13 ` Jarek Poplawski
@ 2007-04-26 16:44 ` Oleg Nesterov
2007-04-27 5:52 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-26 16:44 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/26, Jarek Poplawski wrote:
>
> On Wed, Apr 25, 2007 at 04:47:14PM +0400, Oleg Nesterov wrote:
> ...
> > > > > + spin_lock_irq(&cwq->lock);
> > > > > + /* CPU_DEAD in progress may change cwq */
> > > > > + if (likely(cwq == get_wq_data(work))) {
> > > > > + list_del_init(&work->entry);
> > > > > + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > > > > + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> > > > > + }
> > > > > + spin_unlock_irq(&cwq->lock);
> > > > > + } while (unlikely(retry));
> >
> > > 1. If delayed_work_timer_fn of this work is fired and is waiting
> > > on the above spin_lock then, after above spin_unlock, the work
> > > will be queued.
> >
> > No, in that case try_to_del_timer_sync() returns -1.
>
> Yes. But I think it's safe only after moving work_clear_pending
> in run_workqueue under a lock; probably otherwise there is a
> possibility this flag could be cleared, after above unlock.
It doesn't matter in this particular case because we are going to retry
anyway. But yes, this patch moves work_clear_pending() under lock, because
otherwise it could be cleared by run_workqueue() if this work is about
to be executed, but was already deleted from list.
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-26 16:34 ` Oleg Nesterov
@ 2007-04-27 5:26 ` Jarek Poplawski
2007-04-27 7:52 ` Oleg Nesterov
0 siblings, 1 reply; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-27 5:26 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Thu, Apr 26, 2007 at 08:34:06PM +0400, Oleg Nesterov wrote:
> On 04/26, Jarek Poplawski wrote:
> >
> > > void cancel_rearming_delayed_work(struct delayed_work *dwork)
> > > {
> > > struct work_struct *work = &dwork->work;
> > > struct cpu_workqueue_struct *cwq = get_wq_data(work);
> > > int done;
> >
> > I don't understand, why you think cwq cannot be NULL here.
>
> sure it can, this is just a template.
>
> > >
> > > do {
> > > done = 1;
> > > spin_lock_irq(&cwq->lock);
> > >
> > > if (!list_empty(&work->entry))
> > > list_del_init(&work->entry);
> >
> > BTW, isn't needs_a_good_name needles after this and after del_timer positive?
>
> no, we still need it. work->func() may be running on another CPU as well.
>
> >
> > > else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
> > > done = del_timer(&dwork->timer)
> >
> > If this runs while a work function is fired in run_workqueue,
> > it sets _PENDING bit, but if the work skips rearming, we have probably
> > endless loop, again.
>
> No, if the work skips rearming (or didn't yet), we set WORK_STRUCT_PENDING
> successfully.
Sorry! Should be:
"If this runs while a work function is fired in run_workqueue,
it sets _PENDING bit, but if the work skips rearming, I have probably
endless loop, again."
>
> > It is something alike to the current
> > way, with some added measures: you try to shoot a work on the run,
> > while queued or timer_pending, plus the _PENDING flag set, so it seems,
> > there is some risk of longer than planed looping.
>
> Sorry, can't understand. done == 0 means that the queueing in progress,
> this work should be placed on cwq->worklist very soon, most probably
> right after we drop cwq->lock.
I think, theoretically, probably, maybe, there is possible some strange
case, this function gets spin_lock only when: list_empty(&work->entry) == 1
&& _PENDING == 1 && del_timer(&dwork->timer) == 0.
>
> > I have to look at this more, at home and, if something new, I'll write
> > tomorrow. So, the good news, is you should have enough sleep this time!
>
> Thanks for review!
OK. Here is the review:
It looks great!!! I cannot believe, it could be so "easy"!
Regards,
Jarek P.
PS: probably unusable, but for my own satisfaction:
Acked-by: Jarek Poplawski <jarkao2@o2.pl
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-26 16:44 ` Oleg Nesterov
@ 2007-04-27 5:52 ` Jarek Poplawski
0 siblings, 0 replies; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-27 5:52 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Thu, Apr 26, 2007 at 08:44:36PM +0400, Oleg Nesterov wrote:
> On 04/26, Jarek Poplawski wrote:
> >
> > On Wed, Apr 25, 2007 at 04:47:14PM +0400, Oleg Nesterov wrote:
> > ...
> > > > > > + spin_lock_irq(&cwq->lock);
> > > > > > + /* CPU_DEAD in progress may change cwq */
> > > > > > + if (likely(cwq == get_wq_data(work))) {
> > > > > > + list_del_init(&work->entry);
> > > > > > + __set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> > > > > > + retry = try_to_del_timer_sync(&dwork->timer) < 0;
> > > > > > + }
> > > > > > + spin_unlock_irq(&cwq->lock);
> > > > > > + } while (unlikely(retry));
> > >
> > > > 1. If delayed_work_timer_fn of this work is fired and is waiting
> > > > on the above spin_lock then, after above spin_unlock, the work
> > > > will be queued.
> > >
> > > No, in that case try_to_del_timer_sync() returns -1.
> >
> > Yes. But I think it's safe only after moving work_clear_pending
> > in run_workqueue under a lock; probably otherwise there is a
> > possibility this flag could be cleared, after above unlock.
>
> It doesn't matter in this particular case because we are going to retry
> anyway. But yes, this patch moves work_clear_pending() under lock, because
> otherwise it could be cleared by run_workqueue() if this work is about
> to be executed, but was already deleted from list.
...and it seems to be the same what I meant...
I wanted only to make agree (now it's only for historical reasons)
the lock on _PENDING could matter in run_workqueue.
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-27 5:26 ` Jarek Poplawski
@ 2007-04-27 7:52 ` Oleg Nesterov
2007-04-27 9:03 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2007-04-27 7:52 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On 04/27, Jarek Poplawski wrote:
>
> On Thu, Apr 26, 2007 at 08:34:06PM +0400, Oleg Nesterov wrote:
> > >
> > > > else if (test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
> > > > done = del_timer(&dwork->timer)
> > >
> > [...snip...]
> > > It is something alike to the current
> > > way, with some added measures: you try to shoot a work on the run,
> > > while queued or timer_pending, plus the _PENDING flag set, so it seems,
> > > there is some risk of longer than planed looping.
> >
> > Sorry, can't understand. done == 0 means that the queueing in progress,
> > this work should be placed on cwq->worklist very soon, most probably
> > right after we drop cwq->lock.
>
> I think, theoretically, probably, maybe, there is possible some strange
> case, this function gets spin_lock only when: list_empty(&work->entry) == 1
> && _PENDING == 1 && del_timer(&dwork->timer) == 0.
Yes, but this is not so strange, this means the queueing in progress. Most
probably the "owner" of WORK_STRUCT_PENDING bit spins waiting for cwq->lock.
We will retry in this case. Of course, if we have a workqueue with the single
work which just re-arms itself via queue_work() (without delay) and does nothing
more, we may need a lot of looping.
> PS: probably unusable, but for my own satisfaction:
>
> Acked-by: Jarek Poplawski <jarkao2@o2.pl>
It is useable, at least for me. I hope you will re-ack when I actually send
the patch. Note that the "else" branch above doesn't need cwq->lock, and we
should start with del_timer(), because the pending timer is the most common
case.
Oleg.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
2007-04-27 7:52 ` Oleg Nesterov
@ 2007-04-27 9:03 ` Jarek Poplawski
0 siblings, 0 replies; 19+ messages in thread
From: Jarek Poplawski @ 2007-04-27 9:03 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Ingo Molnar, linux-kernel, David Howells
On Fri, Apr 27, 2007 at 11:52:47AM +0400, Oleg Nesterov wrote:
> On 04/27, Jarek Poplawski wrote:
...
> > > Sorry, can't understand. done == 0 means that the queueing in progress,
> > > this work should be placed on cwq->worklist very soon, most probably
> > > right after we drop cwq->lock.
> >
> > I think, theoretically, probably, maybe, there is possible some strange
> > case, this function gets spin_lock only when: list_empty(&work->entry) == 1
> > && _PENDING == 1 && del_timer(&dwork->timer) == 0.
>
> Yes, but this is not so strange, this means the queueing in progress. Most
> probably the "owner" of WORK_STRUCT_PENDING bit spins waiting for cwq->lock.
> We will retry in this case. Of course, if we have a workqueue with the single
> work which just re-arms itself via queue_work() (without delay) and does nothing
> more, we may need a lot of looping.
I've forgot most of the math already, but there is (probably)
some Parkinson's Law about it. So, by this strange case I
mean really lot of looping (something around infinity - quite
precisely).
>
> > PS: probably unusable, but for my own satisfaction:
> >
> > Acked-by: Jarek Poplawski <jarkao2@o2.pl>
>
> It is useable, at least for me. I hope you will re-ack when I actually send
This is even more strange...
BTW, I take a week of vacation (people here deserve to rest
from me), so let's say it's both acked and re-acked by me.
> the patch. Note that the "else" branch above doesn't need cwq->lock, and we
> should start with del_timer(), because the pending timer is the most common
> case.
I see, you've thought about it probably more than you said
so, I trust you 100% here (but will check later, anyway...).
Cheers,
Jarek P.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-04-27 8:57 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20070419002548.72689f0e.akpm@linux-foundation.org>
[not found] ` <20070419102122.GA93@tv-sign.ru>
2007-04-20 9:22 ` Fw: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work Jarek Poplawski
2007-04-20 17:08 ` Oleg Nesterov
2007-04-23 9:00 ` Jarek Poplawski
2007-04-23 16:33 ` Oleg Nesterov
2007-04-24 11:53 ` Jarek Poplawski
2007-04-24 18:55 ` Oleg Nesterov
2007-04-25 6:12 ` Jarek Poplawski
2007-04-25 12:20 ` Jarek Poplawski
2007-04-25 12:28 ` Jarek Poplawski
2007-04-25 12:47 ` Oleg Nesterov
2007-04-25 14:47 ` Oleg Nesterov
2007-04-26 12:59 ` Jarek Poplawski
2007-04-26 16:34 ` Oleg Nesterov
2007-04-27 5:26 ` Jarek Poplawski
2007-04-27 7:52 ` Oleg Nesterov
2007-04-27 9:03 ` Jarek Poplawski
2007-04-26 13:13 ` Jarek Poplawski
2007-04-26 16:44 ` Oleg Nesterov
2007-04-27 5:52 ` Jarek Poplawski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).