LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH/RFC] workqueue: allow rescuer thread to do more work.
@ 2014-10-29  6:26 NeilBrown
  2014-10-29 14:32 ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2014-10-29  6:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2500 bytes --]


When there is serious memory pressure, all workers in a pool could be blocked,
and a new thread cannot be created because it requires memory allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with thousands
of requests queued for xfslogd, which has a max_requests of 1, and is needed
for retiring all 'xfs' write requests.  When one of the worker pools gets
into this state, it progresses extremely slowly and possibly never recovers
(only waited an hour or two).

So if, after handling everything on worklist, there is again something on
worklist (counted in nr_active), and if the queue is still congested, keep
processing instead of waiting for the next wake-up.

Signed-off-by: NeilBrown <neilb@suse.de>
---

Hi Tejun,
  I haven't tested this patch yet so this really is an 'RFC'.
In general ->nr_active should only be accessed under the pool->lock,
but a miss-read here will at most cause a very occasional 100ms delay so
shouldn't be a big problem.  The only thread likely to change ->nr_active is
this thread, so such a delay would be extremely unlikely.

Thanks,
NeilBrown


diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685daee3d..d0a8b101c5d9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2244,16 +2244,18 @@ repeat:
 		spin_lock_irq(&pool->lock);
 		rescuer->pool = pool;
 
-		/*
-		 * Slurp in all works issued via this workqueue and
-		 * process'em.
-		 */
-		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
-		list_for_each_entry_safe(work, n, &pool->worklist, entry)
-			if (get_work_pwq(work) == pwq)
-				move_linked_works(work, scheduled, &n);
+		do {
+			/*
+			 * Slurp in all works issued via this workqueue and
+			 * process'em.
+			 */
+			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
+			list_for_each_entry_safe(work, n, &pool->worklist, entry)
+				if (get_work_pwq(work) == pwq)
+					move_linked_works(work, scheduled, &n);
 
-		process_scheduled_works(rescuer);
+			process_scheduled_works(rescuer);
+		} while (need_more_worker(pool) && pwq->nr_active);
 
 		/*
 		 * Put the reference grabbed by send_mayday().  @pool won't

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-10-29  6:26 [PATCH/RFC] workqueue: allow rescuer thread to do more work NeilBrown
@ 2014-10-29 14:32 ` Tejun Heo
  2014-10-29 23:19   ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-10-29 14:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel

Hello, Neil.

On Wed, Oct 29, 2014 at 05:26:08PM +1100, NeilBrown wrote:
> Hi Tejun,
>   I haven't tested this patch yet so this really is an 'RFC'.
> In general ->nr_active should only be accessed under the pool->lock,
> but a miss-read here will at most cause a very occasional 100ms delay so
> shouldn't be a big problem.  The only thread likely to change ->nr_active is
> this thread, so such a delay would be extremely unlikely.
> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685daee3d..d0a8b101c5d9 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2244,16 +2244,18 @@ repeat:
>  		spin_lock_irq(&pool->lock);
>  		rescuer->pool = pool;
>  
> -		/*
> -		 * Slurp in all works issued via this workqueue and
> -		 * process'em.
> -		 */
> -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> -		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> -			if (get_work_pwq(work) == pwq)
> -				move_linked_works(work, scheduled, &n);
> +		do {
> +			/*
> +			 * Slurp in all works issued via this workqueue and
> +			 * process'em.
> +			 */
> +			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> +			list_for_each_entry_safe(work, n, &pool->worklist, entry)
> +				if (get_work_pwq(work) == pwq)
> +					move_linked_works(work, scheduled, &n);
>  
> -		process_scheduled_works(rescuer);
> +			process_scheduled_works(rescuer);
> +		} while (need_more_worker(pool) && pwq->nr_active);

need_more_worker(pool) is always true for unbound pools as long as
there are work items queued, so the above condition may stay true
longer than it needs to.  Given that workder depletion is pool-wide
event, maybe it'd make sense to trigger rescuers immediately while
workers are in short supply?  e.g. while there's a manager stuck in
maybe_create_worker() with the mayday timer already triggered?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-10-29 14:32 ` Tejun Heo
@ 2014-10-29 23:19   ` NeilBrown
  2014-11-04 14:22     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2014-10-29 23:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2446 bytes --]

On Wed, 29 Oct 2014 10:32:10 -0400 Tejun Heo <tj@kernel.org> wrote:

> Hello, Neil.
> 
> On Wed, Oct 29, 2014 at 05:26:08PM +1100, NeilBrown wrote:
> > Hi Tejun,
> >   I haven't tested this patch yet so this really is an 'RFC'.
> > In general ->nr_active should only be accessed under the pool->lock,
> > but a miss-read here will at most cause a very occasional 100ms delay so
> > shouldn't be a big problem.  The only thread likely to change ->nr_active is
> > this thread, so such a delay would be extremely unlikely.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 09b685daee3d..d0a8b101c5d9 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2244,16 +2244,18 @@ repeat:
> >  		spin_lock_irq(&pool->lock);
> >  		rescuer->pool = pool;
> >  
> > -		/*
> > -		 * Slurp in all works issued via this workqueue and
> > -		 * process'em.
> > -		 */
> > -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > -		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> > -			if (get_work_pwq(work) == pwq)
> > -				move_linked_works(work, scheduled, &n);
> > +		do {
> > +			/*
> > +			 * Slurp in all works issued via this workqueue and
> > +			 * process'em.
> > +			 */
> > +			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > +			list_for_each_entry_safe(work, n, &pool->worklist, entry)
> > +				if (get_work_pwq(work) == pwq)
> > +					move_linked_works(work, scheduled, &n);
> >  
> > -		process_scheduled_works(rescuer);
> > +			process_scheduled_works(rescuer);
> > +		} while (need_more_worker(pool) && pwq->nr_active);
> 
> need_more_worker(pool) is always true for unbound pools as long as
> there are work items queued, so the above condition may stay true
> longer than it needs to.

Because ->nr_running isn't maintained for WORKER_UNBOUND (which is part of
WORKER_NOT_RUNNING) - got it.

>                           Given that workder depletion is pool-wide
> event, maybe it'd make sense to trigger rescuers immediately while
> workers are in short supply?  e.g. while there's a manager stuck in
> maybe_create_worker() with the mayday timer already triggered?

So what if I change "need_more_worker" to "need_to_create_worker" ?
Then it will stop as soon as there in an idle worker thread.
That is the condition that keeps maybe_create_worker() looping.
??

Thanks,
NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-10-29 23:19   ` NeilBrown
@ 2014-11-04 14:22     ` Tejun Heo
  2014-11-06 16:58       ` Dongsu Park
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-11-04 14:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel

Hello, Neil.

Sorry about the delay.

On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
> >                           Given that workder depletion is pool-wide
> > event, maybe it'd make sense to trigger rescuers immediately while
> > workers are in short supply?  e.g. while there's a manager stuck in
> > maybe_create_worker() with the mayday timer already triggered?
> 
> So what if I change "need_more_worker" to "need_to_create_worker" ?
> Then it will stop as soon as there in an idle worker thread.
> That is the condition that keeps maybe_create_worker() looping.
> ??

Yeah, that'd be a better condition and can work out.  Can you please
write up a patch to do that and do some synthetic tests excercising
the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
when posting the patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-04 14:22     ` Tejun Heo
@ 2014-11-06 16:58       ` Dongsu Park
  2014-11-07  3:03         ` Lai Jiangshan
  0 siblings, 1 reply; 26+ messages in thread
From: Dongsu Park @ 2014-11-06 16:58 UTC (permalink / raw)
  To: Tejun Heo, NeilBrown; +Cc: Lai Jiangshan, linux-kernel

Hi Tejun & Neil,

On 04.11.2014 09:22, Tejun Heo wrote:
> On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
> > >                           Given that workder depletion is pool-wide
> > > event, maybe it'd make sense to trigger rescuers immediately while
> > > workers are in short supply?  e.g. while there's a manager stuck in
> > > maybe_create_worker() with the mayday timer already triggered?
> > 
> > So what if I change "need_more_worker" to "need_to_create_worker" ?
> > Then it will stop as soon as there in an idle worker thread.
> > That is the condition that keeps maybe_create_worker() looping.
> > ??
> 
> Yeah, that'd be a better condition and can work out.  Can you please
> write up a patch to do that and do some synthetic tests excercising
> the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
> when posting the patch.

This issue looks exactly like what I've encountered occasionally in our test
setup. (with a custom kernel based on 3.12, MD/raid1, dm-multipath, etc.)
When a system suffers from high memory pressure, and at the same time
underlying devices of RAID arrays are repeatedly removed and re-added,
then sometimes the whole system gets locked up on a worker pool's lock.
So I had to fix our custom MD code to allocate a separate ordered workqueue
with WQ_MEM_RECLAIM, apart from md_wq or md_misc_wq.
Then the lockup seemed to have disappeared.

Now that I read the Neil's patch, which looks like an ultimate solution
to the problem I have seen. I'm really looking forward to seeing this
change in mainline.

How about the attached patch? Based on the Neil's patch, I replaced
need_more_worker() with need_to_create_worker() as Tejun suggested.

Test is running with this patch, which seems to be working for now.
But I'm going to observe the test result carefully for a few more days.

Regards,
Dongsu

----
>From de9aadd6fb742ea8acce4245a27946d3f233ab7f Mon Sep 17 00:00:00 2001
From: Dongsu Park <dongsu.park@profitbricks.com>
Date: Wed, 5 Nov 2014 17:28:07 +0100
Subject: [RFC PATCH] workqueue: allow rescuer thread to do more work

Original commit message from NeilBrown <neilb@suse.de>:
====
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer
thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of 1,
and is needed for retiring all 'xfs' write requests.  When one of the
worker pools gets into this state, it progresses extremely slowly and
possibly never recovers (only waited an hour or two).

So if, after handling everything on worklist, there is again something
on worklist (counted in nr_active), and if the queue is still congested,
keep processing instead of waiting for the next wake-up.
====

Dongsu Park: replaced need_more_worker() with need_to_create_worker(),
as suggested by Tejun.

Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com>
Link: https://lkml.org/lkml/2014/10/29/55
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Original-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 kernel/workqueue.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..4d20225 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2244,16 +2244,19 @@ repeat:
 		spin_lock_irq(&pool->lock);
 		rescuer->pool = pool;
 
-		/*
-		 * Slurp in all works issued via this workqueue and
-		 * process'em.
-		 */
-		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
-		list_for_each_entry_safe(work, n, &pool->worklist, entry)
-			if (get_work_pwq(work) == pwq)
-				move_linked_works(work, scheduled, &n);
-
-		process_scheduled_works(rescuer);
+		do {
+			/*
+			 * Slurp in all works issued via this workqueue and
+			 * process'em.
+			 */
+			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
+			list_for_each_entry_safe(work, n, &pool->worklist,
+					entry)
+				if (get_work_pwq(work) == pwq)
+					move_linked_works(work, scheduled, &n);
+
+			process_scheduled_works(rescuer);
+		} while (need_to_create_worker(pool) && pwq->nr_active);
 
 		/*
 		 * Put the reference grabbed by send_mayday().  @pool won't
-- 
1.9.3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-06 16:58       ` Dongsu Park
@ 2014-11-07  3:03         ` Lai Jiangshan
  2014-11-10  5:28           ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Lai Jiangshan @ 2014-11-07  3:03 UTC (permalink / raw)
  To: Dongsu Park; +Cc: Tejun Heo, NeilBrown, linux-kernel



On 11/07/2014 12:58 AM, Dongsu Park wrote:
> Hi Tejun & Neil,
> 
> On 04.11.2014 09:22, Tejun Heo wrote:
>> On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
>>>>                           Given that workder depletion is pool-wide
>>>> event, maybe it'd make sense to trigger rescuers immediately while
>>>> workers are in short supply?  e.g. while there's a manager stuck in
>>>> maybe_create_worker() with the mayday timer already triggered?
>>>
>>> So what if I change "need_more_worker" to "need_to_create_worker" ?
>>> Then it will stop as soon as there in an idle worker thread.
>>> That is the condition that keeps maybe_create_worker() looping.
>>> ??
>>
>> Yeah, that'd be a better condition and can work out.  Can you please
>> write up a patch to do that and do some synthetic tests excercising
>> the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
>> when posting the patch.
> 
> This issue looks exactly like what I've encountered occasionally in our test
> setup. (with a custom kernel based on 3.12, MD/raid1, dm-multipath, etc.)
> When a system suffers from high memory pressure, and at the same time
> underlying devices of RAID arrays are repeatedly removed and re-added,
> then sometimes the whole system gets locked up on a worker pool's lock.
> So I had to fix our custom MD code to allocate a separate ordered workqueue
> with WQ_MEM_RECLAIM, apart from md_wq or md_misc_wq.
> Then the lockup seemed to have disappeared.
> 
> Now that I read the Neil's patch, which looks like an ultimate solution
> to the problem I have seen. I'm really looking forward to seeing this
> change in mainline.
> 
> How about the attached patch? Based on the Neil's patch, I replaced
> need_more_worker() with need_to_create_worker() as Tejun suggested.
> 
> Test is running with this patch, which seems to be working for now.
> But I'm going to observe the test result carefully for a few more days.
> 
> Regards,
> Dongsu
> 
> ----
>>From de9aadd6fb742ea8acce4245a27946d3f233ab7f Mon Sep 17 00:00:00 2001
> From: Dongsu Park <dongsu.park@profitbricks.com>
> Date: Wed, 5 Nov 2014 17:28:07 +0100
> Subject: [RFC PATCH] workqueue: allow rescuer thread to do more work
> 
> Original commit message from NeilBrown <neilb@suse.de>:
> ====
> When there is serious memory pressure, all workers in a pool could be
> blocked, and a new thread cannot be created because it requires memory
> allocation.
> 
> In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer
> thread to do some work.
> 
> The rescuer will only handle requests that are already on ->worklist.
> If max_requests is 1, that means it will handle a single request.
> 
> The rescuer will be woken again in 100ms to handle another max_requests
> requests.


I also observed this problem by review when I was developing
the per-pwq-worklist patchset which has a side-affect that it also naturally
fix the problem.

However, it is nothing about correctness and I made promise to Frederic Weisbecker
for working on unbound pool for power-saving, then the per-pwq-worklist patchset
is put off. So I have to ack it.

> 
> I've seen a machine (running a 3.0 based "enterprise" kernel) with
> thousands of requests queued for xfslogd, which has a max_requests of 1,
> and is needed for retiring all 'xfs' write requests.  When one of the
> worker pools gets into this state, it progresses extremely slowly and
> possibly never recovers (only waited an hour or two).
> 
> So if, after handling everything on worklist, there is again something
> on worklist (counted in nr_active), and if the queue is still congested,
> keep processing instead of waiting for the next wake-up.
> ====
> 
> Dongsu Park: replaced need_more_worker() with need_to_create_worker(),
> as suggested by Tejun.
> 
> Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com>
> Link: https://lkml.org/lkml/2014/10/29/55
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Original-by: NeilBrown <neilb@suse.de>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  kernel/workqueue.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685d..4d20225 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2244,16 +2244,19 @@ repeat:
>  		spin_lock_irq(&pool->lock);
>  		rescuer->pool = pool;
>  
> -		/*
> -		 * Slurp in all works issued via this workqueue and
> -		 * process'em.
> -		 */
> -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> -		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> -			if (get_work_pwq(work) == pwq)
> -				move_linked_works(work, scheduled, &n);
> -
> -		process_scheduled_works(rescuer);
> +		do {
> +			/*
> +			 * Slurp in all works issued via this workqueue and
> +			 * process'em.
> +			 */
> +			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> +			list_for_each_entry_safe(work, n, &pool->worklist,
> +					entry)
> +				if (get_work_pwq(work) == pwq)
> +					move_linked_works(work, scheduled, &n);
> +
> +			process_scheduled_works(rescuer);
> +		} while (need_to_create_worker(pool) && pwq->nr_active);
>  
>  		/*
>  		 * Put the reference grabbed by send_mayday().  @pool won't


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-07  3:03         ` Lai Jiangshan
@ 2014-11-10  5:28           ` NeilBrown
  2014-11-10  8:52             ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2014-11-10  5:28 UTC (permalink / raw)
  To: Lai Jiangshan, Tejun Heo; +Cc: Dongsu Park, linux-kernel, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 7931 bytes --]

On Fri, 7 Nov 2014 11:03:40 +0800 Lai Jiangshan <laijs@cn.fujitsu.com> wrote:

> 
> 
> On 11/07/2014 12:58 AM, Dongsu Park wrote:
> > Hi Tejun & Neil,
> > 
> > On 04.11.2014 09:22, Tejun Heo wrote:
> >> On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
> >>>>                           Given that workder depletion is pool-wide
> >>>> event, maybe it'd make sense to trigger rescuers immediately while
> >>>> workers are in short supply?  e.g. while there's a manager stuck in
> >>>> maybe_create_worker() with the mayday timer already triggered?
> >>>
> >>> So what if I change "need_more_worker" to "need_to_create_worker" ?
> >>> Then it will stop as soon as there in an idle worker thread.
> >>> That is the condition that keeps maybe_create_worker() looping.
> >>> ??
> >>
> >> Yeah, that'd be a better condition and can work out.  Can you please
> >> write up a patch to do that and do some synthetic tests excercising
> >> the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
> >> when posting the patch.
> > 
> > This issue looks exactly like what I've encountered occasionally in our test
> > setup. (with a custom kernel based on 3.12, MD/raid1, dm-multipath, etc.)
> > When a system suffers from high memory pressure, and at the same time
> > underlying devices of RAID arrays are repeatedly removed and re-added,
> > then sometimes the whole system gets locked up on a worker pool's lock.
> > So I had to fix our custom MD code to allocate a separate ordered workqueue
> > with WQ_MEM_RECLAIM, apart from md_wq or md_misc_wq.
> > Then the lockup seemed to have disappeared.
> > 
> > Now that I read the Neil's patch, which looks like an ultimate solution
> > to the problem I have seen. I'm really looking forward to seeing this
> > change in mainline.
> > 
> > How about the attached patch? Based on the Neil's patch, I replaced
> > need_more_worker() with need_to_create_worker() as Tejun suggested.
> > 
> > Test is running with this patch, which seems to be working for now.
> > But I'm going to observe the test result carefully for a few more days.
> > 
> > Regards,
> > Dongsu
> > 
> > ----
> >>From de9aadd6fb742ea8acce4245a27946d3f233ab7f Mon Sep 17 00:00:00 2001
> > From: Dongsu Park <dongsu.park@profitbricks.com>
> > Date: Wed, 5 Nov 2014 17:28:07 +0100
> > Subject: [RFC PATCH] workqueue: allow rescuer thread to do more work
> > 
> > Original commit message from NeilBrown <neilb@suse.de>:
> > ====
> > When there is serious memory pressure, all workers in a pool could be
> > blocked, and a new thread cannot be created because it requires memory
> > allocation.
> > 
> > In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer
> > thread to do some work.
> > 
> > The rescuer will only handle requests that are already on ->worklist.
> > If max_requests is 1, that means it will handle a single request.
> > 
> > The rescuer will be woken again in 100ms to handle another max_requests
> > requests.
> 
> 
> I also observed this problem by review when I was developing
> the per-pwq-worklist patchset which has a side-affect that it also naturally
> fix the problem.
> 
> However, it is nothing about correctness and I made promise to Frederic Weisbecker
> for working on unbound pool for power-saving, then the per-pwq-worklist patchset
> is put off. So I have to ack it.

Thanks!
However testing showed that the patch isn't quite right.
The test on ->nr_active is not correct.  I was meaning to test "are there
any requests that have been activated but not yet serviced", but this test
only covers the first half.

If a queue allows a number of active requests (max_active > 1), and several
are blocked waiting for something (e.g. more memory), then max_active will be
positive even though there is no useful work for the rescuer thread to do -
so it will spin.

Jan Kara and I came up with a different patch which testing has shown is
quite successful.  However it makes changes to when mayday_clear_cpu() is
set, and that isn't relevant in the current kernel.

I've ported the patch to -mainline, but haven't really tested it properly
(just compile tested so far).
That version is below.

thanks,
NeilBrown


> 
> > 
> > I've seen a machine (running a 3.0 based "enterprise" kernel) with
> > thousands of requests queued for xfslogd, which has a max_requests of 1,
> > and is needed for retiring all 'xfs' write requests.  When one of the
> > worker pools gets into this state, it progresses extremely slowly and
> > possibly never recovers (only waited an hour or two).
> > 
> > So if, after handling everything on worklist, there is again something
> > on worklist (counted in nr_active), and if the queue is still congested,
> > keep processing instead of waiting for the next wake-up.
> > ====
> > 
> > Dongsu Park: replaced need_more_worker() with need_to_create_worker(),
> > as suggested by Tejun.
> > 
> > Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com>
> > Link: https://lkml.org/lkml/2014/10/29/55
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Original-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  kernel/workqueue.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 09b685d..4d20225 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2244,16 +2244,19 @@ repeat:
> >  		spin_lock_irq(&pool->lock);
> >  		rescuer->pool = pool;
> >  
> > -		/*
> > -		 * Slurp in all works issued via this workqueue and
> > -		 * process'em.
> > -		 */
> > -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > -		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> > -			if (get_work_pwq(work) == pwq)
> > -				move_linked_works(work, scheduled, &n);
> > -
> > -		process_scheduled_works(rescuer);
> > +		do {
> > +			/*
> > +			 * Slurp in all works issued via this workqueue and
> > +			 * process'em.
> > +			 */
> > +			WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > +			list_for_each_entry_safe(work, n, &pool->worklist,
> > +					entry)
> > +				if (get_work_pwq(work) == pwq)
> > +					move_linked_works(work, scheduled, &n);
> > +
> > +			process_scheduled_works(rescuer);
> > +		} while (need_to_create_worker(pool) && pwq->nr_active);
> >  
> >  		/*
> >  		 * Put the reference grabbed by send_mayday().  @pool won't


From: NeilBrown <neilb@suse.de>
Subject: workqueue: Make rescuer thread process more works

Currently workqueue rescuer thread processes at most max_active works from a
workqueue before it goes back to sleep for 100 ms. Especially for workqueues
with low max_active this leads to rescuer being very slow and when queued
work is blocking reclaim it leads to machine taking very long time (minutes
or more) to recover from a situation when new workers cannot be created.

Fix the problem by going through worklist until either new worker is created
or all no new works can be found.

We remove and re-add the pool_workqueue to the mayday list so that each pool_workqueue
so that no one pool_workqueue can starve the others.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685daee3d..19ecee70e3e9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2253,6 +2253,10 @@ repeat:
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
+		if (!list_empty(scheduled) && need_to_create_worker(pool))
+			/* Try again, in case more requests get added */
+			if (list_empty(&pwq->mayday_node))
+				list_add_tail(&pwq->mayday_node, &wq->maydays);
 		process_scheduled_works(rescuer);
 
 		/*

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-10  5:28           ` NeilBrown
@ 2014-11-10  8:52             ` Jan Kara
  2014-11-10 22:04               ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2014-11-10  8:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: Lai Jiangshan, Tejun Heo, Dongsu Park, linux-kernel, Jan Kara

On Mon 10-11-14 16:28:48, NeilBrown wrote:
> On Fri, 7 Nov 2014 11:03:40 +0800 Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> > On 11/07/2014 12:58 AM, Dongsu Park wrote:
> > > Hi Tejun & Neil,
> > > 
> > > On 04.11.2014 09:22, Tejun Heo wrote:
> > >> On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
> > >>>>                           Given that workder depletion is pool-wide
> > >>>> event, maybe it'd make sense to trigger rescuers immediately while
> > >>>> workers are in short supply?  e.g. while there's a manager stuck in
> > >>>> maybe_create_worker() with the mayday timer already triggered?
> > >>>
> > >>> So what if I change "need_more_worker" to "need_to_create_worker" ?
> > >>> Then it will stop as soon as there in an idle worker thread.
> > >>> That is the condition that keeps maybe_create_worker() looping.
> > >>> ??
> > >>
> > >> Yeah, that'd be a better condition and can work out.  Can you please
> > >> write up a patch to do that and do some synthetic tests excercising
> > >> the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
> > >> when posting the patch.
> > > 
> > > This issue looks exactly like what I've encountered occasionally in our test
> > > setup. (with a custom kernel based on 3.12, MD/raid1, dm-multipath, etc.)
> > > When a system suffers from high memory pressure, and at the same time
> > > underlying devices of RAID arrays are repeatedly removed and re-added,
> > > then sometimes the whole system gets locked up on a worker pool's lock.
> > > So I had to fix our custom MD code to allocate a separate ordered workqueue
> > > with WQ_MEM_RECLAIM, apart from md_wq or md_misc_wq.
> > > Then the lockup seemed to have disappeared.
> > > 
> > > Now that I read the Neil's patch, which looks like an ultimate solution
> > > to the problem I have seen. I'm really looking forward to seeing this
> > > change in mainline.
> > > 
> > > How about the attached patch? Based on the Neil's patch, I replaced
> > > need_more_worker() with need_to_create_worker() as Tejun suggested.
> > > 
> > > Test is running with this patch, which seems to be working for now.
> > > But I'm going to observe the test result carefully for a few more days.
> > > 
> > > Regards,
> > > Dongsu
> > > 
> > > ----
> > >>From de9aadd6fb742ea8acce4245a27946d3f233ab7f Mon Sep 17 00:00:00 2001
> > > From: Dongsu Park <dongsu.park@profitbricks.com>
> > > Date: Wed, 5 Nov 2014 17:28:07 +0100
> > > Subject: [RFC PATCH] workqueue: allow rescuer thread to do more work
> > > 
> > > Original commit message from NeilBrown <neilb@suse.de>:
> > > ====
> > > When there is serious memory pressure, all workers in a pool could be
> > > blocked, and a new thread cannot be created because it requires memory
> > > allocation.
> > > 
> > > In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer
> > > thread to do some work.
> > > 
> > > The rescuer will only handle requests that are already on ->worklist.
> > > If max_requests is 1, that means it will handle a single request.
> > > 
> > > The rescuer will be woken again in 100ms to handle another max_requests
> > > requests.
> > 
> > 
> > I also observed this problem by review when I was developing
> > the per-pwq-worklist patchset which has a side-affect that it also naturally
> > fix the problem.
> > 
> > However, it is nothing about correctness and I made promise to Frederic Weisbecker
> > for working on unbound pool for power-saving, then the per-pwq-worklist patchset
> > is put off. So I have to ack it.
> 
> Thanks!
> However testing showed that the patch isn't quite right.
> The test on ->nr_active is not correct.  I was meaning to test "are there
> any requests that have been activated but not yet serviced", but this test
> only covers the first half.
> 
> If a queue allows a number of active requests (max_active > 1), and several
> are blocked waiting for something (e.g. more memory), then max_active will be
> positive even though there is no useful work for the rescuer thread to do -
> so it will spin.
> 
> Jan Kara and I came up with a different patch which testing has shown is
> quite successful.  However it makes changes to when mayday_clear_cpu() is
> set, and that isn't relevant in the current kernel.
> 
> I've ported the patch to -mainline, but haven't really tested it properly
> (just compile tested so far).
> That version is below.
...
> 
> From: NeilBrown <neilb@suse.de>
> Subject: workqueue: Make rescuer thread process more works
> 
> Currently workqueue rescuer thread processes at most max_active works from a
> workqueue before it goes back to sleep for 100 ms. Especially for workqueues
> with low max_active this leads to rescuer being very slow and when queued
> work is blocking reclaim it leads to machine taking very long time (minutes
> or more) to recover from a situation when new workers cannot be created.
> 
> Fix the problem by going through worklist until either new worker is created
> or all no new works can be found.
> 
> We remove and re-add the pool_workqueue to the mayday list so that each pool_workqueue
> so that no one pool_workqueue can starve the others.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685daee3d..19ecee70e3e9 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2253,6 +2253,10 @@ repeat:
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> +		if (!list_empty(scheduled) && need_to_create_worker(pool))
> +			/* Try again, in case more requests get added */
> +			if (list_empty(&pwq->mayday_node))
> +				list_add_tail(&pwq->mayday_node, &wq->maydays);
>  		process_scheduled_works(rescuer);
  This is certainly missing locking - we need to hold wq_mayday_lock when
changing wq->maydays list. Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-10  8:52             ` Jan Kara
@ 2014-11-10 22:04               ` NeilBrown
  2014-11-14 17:21                 ` Tejun Heo
  2014-11-18  4:27                 ` [PATCH - v3?] " NeilBrown
  0 siblings, 2 replies; 26+ messages in thread
From: NeilBrown @ 2014-11-10 22:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lai Jiangshan, Tejun Heo, Dongsu Park, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7795 bytes --]

On Mon, 10 Nov 2014 09:52:50 +0100 Jan Kara <jack@suse.cz> wrote:

> On Mon 10-11-14 16:28:48, NeilBrown wrote:
> > On Fri, 7 Nov 2014 11:03:40 +0800 Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> > > On 11/07/2014 12:58 AM, Dongsu Park wrote:
> > > > Hi Tejun & Neil,
> > > > 
> > > > On 04.11.2014 09:22, Tejun Heo wrote:
> > > >> On Thu, Oct 30, 2014 at 10:19:32AM +1100, NeilBrown wrote:
> > > >>>>                           Given that workder depletion is pool-wide
> > > >>>> event, maybe it'd make sense to trigger rescuers immediately while
> > > >>>> workers are in short supply?  e.g. while there's a manager stuck in
> > > >>>> maybe_create_worker() with the mayday timer already triggered?
> > > >>>
> > > >>> So what if I change "need_more_worker" to "need_to_create_worker" ?
> > > >>> Then it will stop as soon as there in an idle worker thread.
> > > >>> That is the condition that keeps maybe_create_worker() looping.
> > > >>> ??
> > > >>
> > > >> Yeah, that'd be a better condition and can work out.  Can you please
> > > >> write up a patch to do that and do some synthetic tests excercising
> > > >> the code path?  Also please cc Lai Jiangshan <laijs@cn.fujitsu.com>
> > > >> when posting the patch.
> > > > 
> > > > This issue looks exactly like what I've encountered occasionally in our test
> > > > setup. (with a custom kernel based on 3.12, MD/raid1, dm-multipath, etc.)
> > > > When a system suffers from high memory pressure, and at the same time
> > > > underlying devices of RAID arrays are repeatedly removed and re-added,
> > > > then sometimes the whole system gets locked up on a worker pool's lock.
> > > > So I had to fix our custom MD code to allocate a separate ordered workqueue
> > > > with WQ_MEM_RECLAIM, apart from md_wq or md_misc_wq.
> > > > Then the lockup seemed to have disappeared.
> > > > 
> > > > Now that I read the Neil's patch, which looks like an ultimate solution
> > > > to the problem I have seen. I'm really looking forward to seeing this
> > > > change in mainline.
> > > > 
> > > > How about the attached patch? Based on the Neil's patch, I replaced
> > > > need_more_worker() with need_to_create_worker() as Tejun suggested.
> > > > 
> > > > Test is running with this patch, which seems to be working for now.
> > > > But I'm going to observe the test result carefully for a few more days.
> > > > 
> > > > Regards,
> > > > Dongsu
> > > > 
> > > > ----
> > > >>From de9aadd6fb742ea8acce4245a27946d3f233ab7f Mon Sep 17 00:00:00 2001
> > > > From: Dongsu Park <dongsu.park@profitbricks.com>
> > > > Date: Wed, 5 Nov 2014 17:28:07 +0100
> > > > Subject: [RFC PATCH] workqueue: allow rescuer thread to do more work
> > > > 
> > > > Original commit message from NeilBrown <neilb@suse.de>:
> > > > ====
> > > > When there is serious memory pressure, all workers in a pool could be
> > > > blocked, and a new thread cannot be created because it requires memory
> > > > allocation.
> > > > 
> > > > In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer
> > > > thread to do some work.
> > > > 
> > > > The rescuer will only handle requests that are already on ->worklist.
> > > > If max_requests is 1, that means it will handle a single request.
> > > > 
> > > > The rescuer will be woken again in 100ms to handle another max_requests
> > > > requests.
> > > 
> > > 
> > > I also observed this problem by review when I was developing
> > > the per-pwq-worklist patchset which has a side-affect that it also naturally
> > > fix the problem.
> > > 
> > > However, it is nothing about correctness and I made promise to Frederic Weisbecker
> > > for working on unbound pool for power-saving, then the per-pwq-worklist patchset
> > > is put off. So I have to ack it.
> > 
> > Thanks!
> > However testing showed that the patch isn't quite right.
> > The test on ->nr_active is not correct.  I was meaning to test "are there
> > any requests that have been activated but not yet serviced", but this test
> > only covers the first half.
> > 
> > If a queue allows a number of active requests (max_active > 1), and several
> > are blocked waiting for something (e.g. more memory), then max_active will be
> > positive even though there is no useful work for the rescuer thread to do -
> > so it will spin.
> > 
> > Jan Kara and I came up with a different patch which testing has shown is
> > quite successful.  However it makes changes to when mayday_clear_cpu() is
> > set, and that isn't relevant in the current kernel.
> > 
> > I've ported the patch to -mainline, but haven't really tested it properly
> > (just compile tested so far).
> > That version is below.
> ...
> > 
> > From: NeilBrown <neilb@suse.de>
> > Subject: workqueue: Make rescuer thread process more works
> > 
> > Currently workqueue rescuer thread processes at most max_active works from a
> > workqueue before it goes back to sleep for 100 ms. Especially for workqueues
> > with low max_active this leads to rescuer being very slow and when queued
> > work is blocking reclaim it leads to machine taking very long time (minutes
> > or more) to recover from a situation when new workers cannot be created.
> > 
> > Fix the problem by going through worklist until either new worker is created
> > or all no new works can be found.
> > 
> > We remove and re-add the pool_workqueue to the mayday list so that each pool_workqueue
> > so that no one pool_workqueue can starve the others.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 09b685daee3d..19ecee70e3e9 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2253,6 +2253,10 @@ repeat:
> >  			if (get_work_pwq(work) == pwq)
> >  				move_linked_works(work, scheduled, &n);
> >  
> > +		if (!list_empty(scheduled) && need_to_create_worker(pool))
> > +			/* Try again, in case more requests get added */
> > +			if (list_empty(&pwq->mayday_node))
> > +				list_add_tail(&pwq->mayday_node, &wq->maydays);
> >  		process_scheduled_works(rescuer);
>   This is certainly missing locking - we need to hold wq_mayday_lock when
> changing wq->maydays list. Otherwise the patch looks good to me.
> 
> 								Honza


Thanks... I can't just take wq_mayday_lock to cover that code as we already
have pool->lock and they nest the other way.

What do people think of this approach?

We hold onto wq_mayday_lock a bit longer, until we know if there is  really
any work to do.
The bit I'm least sure of is moving worker_attach_to_pool() after
"rescuer->pool = pool".  Might that be a problem?

Thanks,
NeilBrown



diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685daee3d..f2db6073c498 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2235,11 +2235,6 @@ repeat:
 		struct work_struct *work, *n;
 
 		__set_current_state(TASK_RUNNING);
-		list_del_init(&pwq->mayday_node);
-
-		spin_unlock_irq(&wq_mayday_lock);
-
-		worker_attach_to_pool(rescuer, pool);
 
 		spin_lock_irq(&pool->lock);
 		rescuer->pool = pool;
@@ -2253,8 +2248,16 @@ repeat:
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
-		process_scheduled_works(rescuer);
+		if (list_empty(scheduled) || !need_to_create_worker(pool))
+			/* We can let go of this one  now */
+			list_del_init(&pwq->mayday_node);
+		spin_unlock_irq(&wq_mayday_lock);
+
+		if (!list_empty(scheduled)) {
+			worker_attach_to_pool(rescuer, pool);
 
+			process_scheduled_works(rescuer);
+		}
 		/*
 		 * Put the reference grabbed by send_mayday().  @pool won't
 		 * go away while we're still attached to it.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH/RFC] workqueue: allow rescuer thread to do more work.
  2014-11-10 22:04               ` NeilBrown
@ 2014-11-14 17:21                 ` Tejun Heo
  2014-11-18  4:27                 ` [PATCH - v3?] " NeilBrown
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-11-14 17:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

Hello,

On Tue, Nov 11, 2014 at 09:04:02AM +1100, NeilBrown wrote:
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685daee3d..f2db6073c498 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2235,11 +2235,6 @@ repeat:
>  		struct work_struct *work, *n;
>  
>  		__set_current_state(TASK_RUNNING);
> -		list_del_init(&pwq->mayday_node);
> -
> -		spin_unlock_irq(&wq_mayday_lock);
> -
> -		worker_attach_to_pool(rescuer, pool);
>  
>  		spin_lock_irq(&pool->lock);
>  		rescuer->pool = pool;
> @@ -2253,8 +2248,16 @@ repeat:
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> -		process_scheduled_works(rescuer);
> +		if (list_empty(scheduled) || !need_to_create_worker(pool))
> +			/* We can let go of this one  now */
> +			list_del_init(&pwq->mayday_node);

We prolly want to rotate the processed one to the back of the list too
so that the rescuer round robins across multiple mayday requests?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-11-10 22:04               ` NeilBrown
  2014-11-14 17:21                 ` Tejun Heo
@ 2014-11-18  4:27                 ` NeilBrown
  2014-11-18  6:01                   ` Lai Jiangshan
  2014-12-02 20:43                   ` Tejun Heo
  1 sibling, 2 replies; 26+ messages in thread
From: NeilBrown @ 2014-11-18  4:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4035 bytes --]


When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests.  When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).

With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance.  This allows
all requests to be handled in a timely fashion.

The code is a bit awkward due to the need to hold both wq_mayday_lock
and pool->lock at the same time, and due to the lock ordering imposed
on them.  In particular we move work items from the ->worklist to the
rescuer list while holding both locks because we need pool->lock
to do the move, and need wq_mayday_lock to manipulate the mayday list
after we have found out if there was anything to do.

'need_to_create_worker()' is called *before* moving work items off
pool->worklist as an empty worklist will make it return false, but
after the move_linked_works() calls and before the
process_scheduled_works() call, an empty worklist does not indicate
that there is no work to do.

We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.

As the main rescuer loop now iterates an arbitrary number of time,
cond_resched() was inserted to avoid imposing excessive latencies.

I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread.  In that context
it significantly improves performance.  A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.

Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.

Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index caedde34ee7f..4baa7b8b7e0f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2253,26 +2253,36 @@ repeat:
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
 		struct work_struct *work, *n;
+		int still_needed;
 
 		__set_current_state(TASK_RUNNING);
-		list_del_init(&pwq->mayday_node);
-
-		spin_unlock_irq(&wq_mayday_lock);
-
-		worker_attach_to_pool(rescuer, pool);
-
-		spin_lock_irq(&pool->lock);
-		rescuer->pool = pool;
-
+		spin_lock(&pool->lock);
 		/*
 		 * Slurp in all works issued via this workqueue and
 		 * process'em.
 		 */
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
+		still_needed = need_to_create_worker(pool);
 		list_for_each_entry_safe(work, n, &pool->worklist, entry)
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
+		if (!list_empty(scheduled))
+			still_needed = 1;
+		if (still_needed) {
+			list_move_tail(&pwq->mayday_node, &wq->maydays);
+			get_pwq(pwq);
+		} else
+			/* We can let go of this one now */
+			list_del_init(&pwq->mayday_node);
+
+		spin_unlock(&pool->lock);
+		spin_unlock_irq(&wq_mayday_lock);
+
+		worker_attach_to_pool(rescuer, pool);
+
+		spin_lock_irq(&pool->lock);
+		rescuer->pool = pool;
 		process_scheduled_works(rescuer);
 
 		/*
@@ -2293,7 +2303,7 @@ repeat:
 		spin_unlock_irq(&pool->lock);
 
 		worker_detach_from_pool(rescuer, pool);
-
+		cond_resched();
 		spin_lock_irq(&wq_mayday_lock);
 	}
 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-11-18  4:27                 ` [PATCH - v3?] " NeilBrown
@ 2014-11-18  6:01                   ` Lai Jiangshan
  2014-11-18  6:11                     ` NeilBrown
  2014-12-02 20:43                   ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Lai Jiangshan @ 2014-11-18  6:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tejun Heo, Jan Kara, Dongsu Park, linux-kernel

On 11/18/2014 12:27 PM, NeilBrown wrote:
> 
> When there is serious memory pressure, all workers in a pool could be
> blocked, and a new thread cannot be created because it requires memory
> allocation.
> 
> In this situation a WQ_MEM_RECLAIM workqueue will wake up the
> rescuer thread to do some work.
> 
> The rescuer will only handle requests that are already on ->worklist.
> If max_requests is 1, that means it will handle a single request.
> 
> The rescuer will be woken again in 100ms to handle another max_requests
> requests.
> 
> I've seen a machine (running a 3.0 based "enterprise" kernel) with
> thousands of requests queued for xfslogd, which has a max_requests of
> 1, and is needed for retiring all 'xfs' write requests.  When one of
> the worker pools gets into this state, it progresses extremely slowly
> and possibly never recovers (only waited an hour or two).
> 
> With this patch we leave a pool_workqueue on mayday list
> until it is clearly no longer in need of assistance.  This allows
> all requests to be handled in a timely fashion.
> 
> The code is a bit awkward due to the need to hold both wq_mayday_lock
> and pool->lock at the same time, and due to the lock ordering imposed
> on them.  In particular we move work items from the ->worklist to the
> rescuer list while holding both locks because we need pool->lock
> to do the move, and need wq_mayday_lock to manipulate the mayday list
> after we have found out if there was anything to do.
> 
> 'need_to_create_worker()' is called *before* moving work items off
> pool->worklist as an empty worklist will make it return false, but
> after the move_linked_works() calls and before the
> process_scheduled_works() call, an empty worklist does not indicate
> that there is no work to do.
> 
> We keep each pool_workqueue on the mayday list until
> need_to_create_worker() is false, and no work for this workqueue is
> found in the pool.
> 
> As the main rescuer loop now iterates an arbitrary number of time,
> cond_resched() was inserted to avoid imposing excessive latencies.
> 
> I have tested this in combination with a (hackish) patch which forces
> all work items to be handled by the rescuer thread.  In that context
> it significantly improves performance.  A similar patch for a 3.0
> kernel significantly improved performance on a heavy work load.
> 
> Thanks to Jan Kara for some design ideas, and to Dongsu Park for
> some comments and testing.
> 
> Cc: Dongsu Park <dongsu.park@profitbricks.com>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index caedde34ee7f..4baa7b8b7e0f 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2253,26 +2253,36 @@ repeat:
>  					struct pool_workqueue, mayday_node);
>  		struct worker_pool *pool = pwq->pool;
>  		struct work_struct *work, *n;
> +		int still_needed;
>  
>  		__set_current_state(TASK_RUNNING);
> -		list_del_init(&pwq->mayday_node);
> -
> -		spin_unlock_irq(&wq_mayday_lock);
> -
> -		worker_attach_to_pool(rescuer, pool);
> -
> -		spin_lock_irq(&pool->lock);
> -		rescuer->pool = pool;
> -
> +		spin_lock(&pool->lock);
>  		/*
>  		 * Slurp in all works issued via this workqueue and
>  		 * process'em.
>  		 */
>  		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));

> +		still_needed = need_to_create_worker(pool);

This line of code will cause the rescuer busy-loop even no work-item pending
on the workqueue.

>  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> +		if (!list_empty(scheduled))
> +			still_needed = 1;
> +		if (still_needed) {
> +			list_move_tail(&pwq->mayday_node, &wq->maydays);
> +			get_pwq(pwq);
> +		} else
> +			/* We can let go of this one now */
> +			list_del_init(&pwq->mayday_node);
> +
> +		spin_unlock(&pool->lock);
> +		spin_unlock_irq(&wq_mayday_lock);
> +
> +		worker_attach_to_pool(rescuer, pool);
> +
> +		spin_lock_irq(&pool->lock);
> +		rescuer->pool = pool;
>  		process_scheduled_works(rescuer);
>  
>  		/*
> @@ -2293,7 +2303,7 @@ repeat:
>  		spin_unlock_irq(&pool->lock);
>  
>  		worker_detach_from_pool(rescuer, pool);
> -
> +		cond_resched();
>  		spin_lock_irq(&wq_mayday_lock);
>  	}
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-11-18  6:01                   ` Lai Jiangshan
@ 2014-11-18  6:11                     ` NeilBrown
  0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2014-11-18  6:11 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Tejun Heo, Jan Kara, Dongsu Park, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5094 bytes --]

On Tue, 18 Nov 2014 14:01:32 +0800 Lai Jiangshan <laijs@cn.fujitsu.com> wrote:

> On 11/18/2014 12:27 PM, NeilBrown wrote:
> > 
> > When there is serious memory pressure, all workers in a pool could be
> > blocked, and a new thread cannot be created because it requires memory
> > allocation.
> > 
> > In this situation a WQ_MEM_RECLAIM workqueue will wake up the
> > rescuer thread to do some work.
> > 
> > The rescuer will only handle requests that are already on ->worklist.
> > If max_requests is 1, that means it will handle a single request.
> > 
> > The rescuer will be woken again in 100ms to handle another max_requests
> > requests.
> > 
> > I've seen a machine (running a 3.0 based "enterprise" kernel) with
> > thousands of requests queued for xfslogd, which has a max_requests of
> > 1, and is needed for retiring all 'xfs' write requests.  When one of
> > the worker pools gets into this state, it progresses extremely slowly
> > and possibly never recovers (only waited an hour or two).
> > 
> > With this patch we leave a pool_workqueue on mayday list
> > until it is clearly no longer in need of assistance.  This allows
> > all requests to be handled in a timely fashion.
> > 
> > The code is a bit awkward due to the need to hold both wq_mayday_lock
> > and pool->lock at the same time, and due to the lock ordering imposed
> > on them.  In particular we move work items from the ->worklist to the
> > rescuer list while holding both locks because we need pool->lock
> > to do the move, and need wq_mayday_lock to manipulate the mayday list
> > after we have found out if there was anything to do.
> > 
> > 'need_to_create_worker()' is called *before* moving work items off
> > pool->worklist as an empty worklist will make it return false, but
> > after the move_linked_works() calls and before the
> > process_scheduled_works() call, an empty worklist does not indicate
> > that there is no work to do.
> > 
> > We keep each pool_workqueue on the mayday list until
> > need_to_create_worker() is false, and no work for this workqueue is
> > found in the pool.
> > 
> > As the main rescuer loop now iterates an arbitrary number of time,
> > cond_resched() was inserted to avoid imposing excessive latencies.
> > 
> > I have tested this in combination with a (hackish) patch which forces
> > all work items to be handled by the rescuer thread.  In that context
> > it significantly improves performance.  A similar patch for a 3.0
> > kernel significantly improved performance on a heavy work load.
> > 
> > Thanks to Jan Kara for some design ideas, and to Dongsu Park for
> > some comments and testing.
> > 
> > Cc: Dongsu Park <dongsu.park@profitbricks.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index caedde34ee7f..4baa7b8b7e0f 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2253,26 +2253,36 @@ repeat:
> >  					struct pool_workqueue, mayday_node);
> >  		struct worker_pool *pool = pwq->pool;
> >  		struct work_struct *work, *n;
> > +		int still_needed;
> >  
> >  		__set_current_state(TASK_RUNNING);
> > -		list_del_init(&pwq->mayday_node);
> > -
> > -		spin_unlock_irq(&wq_mayday_lock);
> > -
> > -		worker_attach_to_pool(rescuer, pool);
> > -
> > -		spin_lock_irq(&pool->lock);
> > -		rescuer->pool = pool;
> > -
> > +		spin_lock(&pool->lock);
> >  		/*
> >  		 * Slurp in all works issued via this workqueue and
> >  		 * process'em.
> >  		 */
> >  		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> 
> > +		still_needed = need_to_create_worker(pool);
> 
> This line of code will cause the rescuer busy-loop even no work-item pending
> on the workqueue.

Thanks for the review...

> 
> >  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> >  			if (get_work_pwq(work) == pwq)
> >  				move_linked_works(work, scheduled, &n);
> >  
> > +		if (!list_empty(scheduled))
> > +			still_needed = 1;

This should have been

   if (list_empty(scheduled))
        still_needed = 0;

which would address your concern.  My original code effectively did that, but
when I restructured it a little to make it more readable, I broke it :-(

I'll repost tomorrow if there are no further comments.

Thanks.

NeilBrown



> > +		if (still_needed) {
> > +			list_move_tail(&pwq->mayday_node, &wq->maydays);
> > +			get_pwq(pwq);
> > +		} else
> > +			/* We can let go of this one now */
> > +			list_del_init(&pwq->mayday_node);
> > +
> > +		spin_unlock(&pool->lock);
> > +		spin_unlock_irq(&wq_mayday_lock);
> > +
> > +		worker_attach_to_pool(rescuer, pool);
> > +
> > +		spin_lock_irq(&pool->lock);
> > +		rescuer->pool = pool;
> >  		process_scheduled_works(rescuer);
> >  
> >  		/*
> > @@ -2293,7 +2303,7 @@ repeat:
> >  		spin_unlock_irq(&pool->lock);
> >  
> >  		worker_detach_from_pool(rescuer, pool);
> > -
> > +		cond_resched();
> >  		spin_lock_irq(&wq_mayday_lock);
> >  	}
> >  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-11-18  4:27                 ` [PATCH - v3?] " NeilBrown
  2014-11-18  6:01                   ` Lai Jiangshan
@ 2014-12-02 20:43                   ` Tejun Heo
  2014-12-03  0:40                     ` NeilBrown
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-12-02 20:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

Hello,

On Tue, Nov 18, 2014 at 03:27:54PM +1100, NeilBrown wrote:
> @@ -2253,26 +2253,36 @@ repeat:
>  					struct pool_workqueue, mayday_node);
>  		struct worker_pool *pool = pwq->pool;
>  		struct work_struct *work, *n;
> +		int still_needed;
>  
>  		__set_current_state(TASK_RUNNING);
> -		list_del_init(&pwq->mayday_node);
> -
> -		spin_unlock_irq(&wq_mayday_lock);
> -
> -		worker_attach_to_pool(rescuer, pool);
> -
> -		spin_lock_irq(&pool->lock);
> -		rescuer->pool = pool;
> -
> +		spin_lock(&pool->lock);
>  		/*
>  		 * Slurp in all works issued via this workqueue and
>  		 * process'em.
>  		 */
>  		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> +		still_needed = need_to_create_worker(pool);
>  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> +		if (!list_empty(scheduled))
> +			still_needed = 1;
> +		if (still_needed) {
> +			list_move_tail(&pwq->mayday_node, &wq->maydays);
> +			get_pwq(pwq);
> +		} else
> +			/* We can let go of this one now */
> +			list_del_init(&pwq->mayday_node);

This seems rather convoluted.  Why are we testing this before
executing the work item?  Can't we do this after?  Isn't that -
whether the wq still needs rescuing after rescuer went through it once
- what we wanna know anyway?  e.g. something like the following.

	for_each_pwq_on_mayday_list {
		try to fetch work items from pwq->pool;
		if (none was fetched)
			goto remove_pwq;

		execute the fetched work items;

		if (need_to_create_worker()) {
			move the pwq to the tail;
			continue;
		}

	remove_pwq:
		remove the pwq;
	}

> +
> +		spin_unlock(&pool->lock);
> +		spin_unlock_irq(&wq_mayday_lock);
> +
> +		worker_attach_to_pool(rescuer, pool);
> +
> +		spin_lock_irq(&pool->lock);
> +		rescuer->pool = pool;
>  		process_scheduled_works(rescuer);
>  
>  		/*
> @@ -2293,7 +2303,7 @@ repeat:
>  		spin_unlock_irq(&pool->lock);
>  
>  		worker_detach_from_pool(rescuer, pool);
> -
> +		cond_resched();

Also, why this addition?  process_one_work() already has
cond_resched_rcu_qs().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-02 20:43                   ` Tejun Heo
@ 2014-12-03  0:40                     ` NeilBrown
  2014-12-03 17:20                       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2014-12-03  0:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7673 bytes --]

On Tue, 2 Dec 2014 15:43:04 -0500 Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Nov 18, 2014 at 03:27:54PM +1100, NeilBrown wrote:
> > @@ -2253,26 +2253,36 @@ repeat:
> >  					struct pool_workqueue, mayday_node);
> >  		struct worker_pool *pool = pwq->pool;
> >  		struct work_struct *work, *n;
> > +		int still_needed;
> >  
> >  		__set_current_state(TASK_RUNNING);
> > -		list_del_init(&pwq->mayday_node);
> > -
> > -		spin_unlock_irq(&wq_mayday_lock);
> > -
> > -		worker_attach_to_pool(rescuer, pool);
> > -
> > -		spin_lock_irq(&pool->lock);
> > -		rescuer->pool = pool;
> > -
> > +		spin_lock(&pool->lock);
> >  		/*
> >  		 * Slurp in all works issued via this workqueue and
> >  		 * process'em.
> >  		 */
> >  		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > +		still_needed = need_to_create_worker(pool);
> >  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> >  			if (get_work_pwq(work) == pwq)
> >  				move_linked_works(work, scheduled, &n);
> >  
> > +		if (!list_empty(scheduled))
> > +			still_needed = 1;
> > +		if (still_needed) {
> > +			list_move_tail(&pwq->mayday_node, &wq->maydays);
> > +			get_pwq(pwq);
> > +		} else
> > +			/* We can let go of this one now */
> > +			list_del_init(&pwq->mayday_node);
> 
> This seems rather convoluted.

Agreed.

>     Why are we testing this before
> executing the work item?  Can't we do this after? 

To execute the work item, we need to drop the locks.
To perform the "list_move_tail" and the "get_pwq" we need to hold both locks.

It seems easier to do the test while holding the required locks.

>    Isn't that -
> whether the wq still needs rescuing after rescuer went through it once
> - what we wanna know anyway?  e.g. something like the following.
> 
> 	for_each_pwq_on_mayday_list {
> 		try to fetch work items from pwq->pool;
> 		if (none was fetched)
> 			goto remove_pwq;
> 
> 		execute the fetched work items;
> 
> 		if (need_to_create_worker()) {
> 			move the pwq to the tail;
> 			continue;
> 		}
> 
> 	remove_pwq:
> 		remove the pwq;
> 	}

That looks nice.  I have a little difficulty matching the code to that
outline.
In particular I need to hold the pool->lock to call put_pwq() and after
calling put_pwq() it isn't clear that I have a safe reference to pool so that
it is safe to de-reference it... unless I always
attach_to_pool/detach_from_pool.
But to do that I have to drop the locks at awkward times.

Please correct me if I'm wrong, but it looks like I have to call
worker_attach_to_pool() or I cannot safely call put_pwq().
I then have to call worker_detach_from_pool() as the last step.

I need to hold wq_mayday_lock to either move the pwq to the end of the mayday
list or to remove it, and I have to drop that lock to call
process_scheduled_works and after that I cannot retake wq_mayday_lock without
first dropping pool->lock.

The net result is that if I try to map the code on to your outline, it
becomes even more convoluted due to the various locking/refcounting issues.

The best I can do is below (much the same as before).  If someone else can do
better I'd be very happy to see it.


> 
> > +
> > +		spin_unlock(&pool->lock);
> > +		spin_unlock_irq(&wq_mayday_lock);
> > +
> > +		worker_attach_to_pool(rescuer, pool);
> > +
> > +		spin_lock_irq(&pool->lock);
> > +		rescuer->pool = pool;
> >  		process_scheduled_works(rescuer);
> >  
> >  		/*
> > @@ -2293,7 +2303,7 @@ repeat:
> >  		spin_unlock_irq(&pool->lock);
> >  
> >  		worker_detach_from_pool(rescuer, pool);
> > -
> > +		cond_resched();
> 
> Also, why this addition?  process_one_work() already has
> cond_resched_rcu_qs().

Just being over-cautious I suppose.  I agree that isn't needed.

> 
> Thanks.
> 

Thanks,
NeilBrown


From: NeilBrown <neilb@suse.de>
Date: Tue, 18 Nov 2014 15:23:07 +1100
Subject: [PATCH] workqueue: allow rescuer thread to do more work.

When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests.  When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).

With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance.  This allows
all requests to be handled in a timely fashion.

The code is a bit awkward due to the need to hold both wq_mayday_lock
and pool->lock at the same time, and due to the lock ordering imposed
on them.  In particular we move work items from the ->worklist to the
rescuer list while holding both locks because we need pool->lock
to do the move, and need wq_mayday_lock to manipulate the mayday list
after we have found out if there was anything to do.

'need_to_create_worker()' is called *before* moving work items off
pool->worklist as an empty worklist will make it return false, but
after the move_linked_works() calls and before the
process_scheduled_works() call, an empty worklist does not indicate
that there is no work to do.

We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.

As the main rescuer loop now iterates an arbitrary number of time,
cond_resched() was inserted to avoid imposing excessive latencies.

I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread.  In that context
it significantly improves performance.  A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.

Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index caedde34ee7f..fc9a2771fc0d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2253,26 +2253,36 @@ repeat:
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
 		struct work_struct *work, *n;
+		int still_needed;
 
 		__set_current_state(TASK_RUNNING);
-		list_del_init(&pwq->mayday_node);
-
-		spin_unlock_irq(&wq_mayday_lock);
-
-		worker_attach_to_pool(rescuer, pool);
-
-		spin_lock_irq(&pool->lock);
-		rescuer->pool = pool;
-
+		spin_lock(&pool->lock);
 		/*
 		 * Slurp in all works issued via this workqueue and
 		 * process'em.
 		 */
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
+		still_needed = need_to_create_worker(pool);
 		list_for_each_entry_safe(work, n, &pool->worklist, entry)
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
+		if (list_empty(scheduled))
+			still_needed = 0;
+		if (still_needed) {
+			list_move_tail(&pwq->mayday_node, &wq->maydays);
+			get_pwq(pwq);
+		} else
+			/* We can let go of this one now */
+			list_del_init(&pwq->mayday_node);
+
+		spin_unlock(&pool->lock);
+		spin_unlock_irq(&wq_mayday_lock);
+
+		worker_attach_to_pool(rescuer, pool);
+
+		spin_lock_irq(&pool->lock);
+		rescuer->pool = pool;
 		process_scheduled_works(rescuer);
 
 		/*

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-03  0:40                     ` NeilBrown
@ 2014-12-03 17:20                       ` Tejun Heo
  2014-12-03 18:02                         ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-12-03 17:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

Hello,

On Wed, Dec 03, 2014 at 11:40:11AM +1100, NeilBrown wrote:
> To execute the work item, we need to drop the locks.
> To perform the "list_move_tail" and the "get_pwq" we need to hold both locks.
> 
> It seems easier to do the test while holding the required locks.

Hmmm... is it difficult to regrab the locks tho?  If the problem is
wq_mayday_lock nesting outside pool locks, I don't think switching the
order would be too difficult.  The only place the two are nested is
pool_mayday_timeout() anyway, right?

> >    Isn't that -
> > whether the wq still needs rescuing after rescuer went through it once
> > - what we wanna know anyway?  e.g. something like the following.
> > 
> > 	for_each_pwq_on_mayday_list {
> > 		try to fetch work items from pwq->pool;
> > 		if (none was fetched)
> > 			goto remove_pwq;
> > 
> > 		execute the fetched work items;
> > 
> > 		if (need_to_create_worker()) {
> > 			move the pwq to the tail;
> > 			continue;
> > 		}
> > 
> > 	remove_pwq:
> > 		remove the pwq;
> > 	}
> 
> That looks nice.  I have a little difficulty matching the code to that
> outline.
> In particular I need to hold the pool->lock to call put_pwq() and after
> calling put_pwq() it isn't clear that I have a safe reference to pool so that
> it is safe to de-reference it... unless I always
> attach_to_pool/detach_from_pool.
> But to do that I have to drop the locks at awkward times.

Would inverting pool locks and wq_mayday_lock make it less awkward?

> Please correct me if I'm wrong, but it looks like I have to call
> worker_attach_to_pool() or I cannot safely call put_pwq().
> I then have to call worker_detach_from_pool() as the last step.

If I'm not mistaken, the two aren't related.  detach is necessary iff
attach has been called.  put_pwq() can be called independently.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-03 17:20                       ` Tejun Heo
@ 2014-12-03 18:02                         ` Tejun Heo
  2014-12-03 22:31                           ` Dongsu Park
                                             ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Tejun Heo @ 2014-12-03 18:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

So, something like the following.  Only compile tested.  I'll test it
and post proper patches w/ due credits.

Thanks.

Index: work/kernel/workqueue.c
===================================================================
--- work.orig/kernel/workqueue.c
+++ work/kernel/workqueue.c
@@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
 	struct worker_pool *pool = (void *)__pool;
 	struct work_struct *work;
 
-	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
-	spin_lock(&pool->lock);
+	spin_lock_irq(&pool->lock);
+	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
 
 	if (need_to_create_worker(pool)) {
 		/*
@@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
 			send_mayday(work);
 	}
 
-	spin_unlock(&pool->lock);
-	spin_unlock_irq(&wq_mayday_lock);
+	spin_unlock(&wq_mayday_lock);
+	spin_unlock_irq(&pool->lock);
 
 	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
 }
@@ -2248,12 +2248,29 @@ repeat:
 		 * Slurp in all works issued via this workqueue and
 		 * process'em.
 		 */
-		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
+		WARN_ON_ONCE(!list_empty(scheduled));
 		list_for_each_entry_safe(work, n, &pool->worklist, entry)
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
-		process_scheduled_works(rescuer);
+		if (!list_empty(scheduled)) {
+			process_scheduled_works(rescuer);
+
+			/*
+			 * The above execution of rescued work items could
+			 * have created more to rescue through
+			 * pwq_activate_first_delayed() or chained
+			 * queueing.  Let's put @pwq back on mayday list so
+			 * that such back-to-back work items, which may be
+			 * being used to relieve memory pressure, don't
+			 * incur MAYDAY_INTERVAL delay inbetween.
+			 */
+			if (need_to_create_worker(pool)) {
+				spin_lock(&wq_mayday_lock);
+				list_move_tail(&pwq->mayday_node, &wq->maydays);
+				spin_unlock(&wq_mayday_lock);
+			}
+		}
 
 		/*
 		 * Put the reference grabbed by send_mayday().  @pool won't

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-03 18:02                         ` Tejun Heo
@ 2014-12-03 22:31                           ` Dongsu Park
  2014-12-04  1:19                             ` NeilBrown
  2014-12-04  1:01                           ` Lai Jiangshan
  2014-12-04 15:11                           ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Tejun Heo
  2 siblings, 1 reply; 26+ messages in thread
From: Dongsu Park @ 2014-12-03 22:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, Jan Kara, Lai Jiangshan, linux-kernel

Hi Tejun,

On 03.12.2014 13:02, Tejun Heo wrote:
> So, something like the following.  Only compile tested.  I'll test it
> and post proper patches w/ due credits.

I have been already satisfied with Neil's patch,
but your patch looks indeed a lot cleaner, I like it.
I just compiled and tested it shortly, which seems to work.
Though there's one nitpick. (see below)

> Thanks.
> 
> Index: work/kernel/workqueue.c
> ===================================================================
> --- work.orig/kernel/workqueue.c
> +++ work/kernel/workqueue.c
> @@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
>  	struct worker_pool *pool = (void *)__pool;
>  	struct work_struct *work;
>  
> -	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
> -	spin_lock(&pool->lock);
> +	spin_lock_irq(&pool->lock);
> +	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
>  
>  	if (need_to_create_worker(pool)) {
>  		/*
> @@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
>  			send_mayday(work);
>  	}
>  
> -	spin_unlock(&pool->lock);
> -	spin_unlock_irq(&wq_mayday_lock);
> +	spin_unlock(&wq_mayday_lock);
> +	spin_unlock_irq(&pool->lock);
>  
>  	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
>  }
> @@ -2248,12 +2248,29 @@ repeat:
>  		 * Slurp in all works issued via this workqueue and
>  		 * process'em.
>  		 */
> -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> +		WARN_ON_ONCE(!list_empty(scheduled));
>  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> -		process_scheduled_works(rescuer);
> +		if (!list_empty(scheduled)) {
> +			process_scheduled_works(rescuer);
> +
> +			/*
> +			 * The above execution of rescued work items could
> +			 * have created more to rescue through
> +			 * pwq_activate_first_delayed() or chained
> +			 * queueing.  Let's put @pwq back on mayday list so
> +			 * that such back-to-back work items, which may be
> +			 * being used to relieve memory pressure, don't
> +			 * incur MAYDAY_INTERVAL delay inbetween.
> +			 */
> +			if (need_to_create_worker(pool)) {
> +				spin_lock(&wq_mayday_lock);

Does it need to call get_pwq(pwq), doesn't it?

Thanks,
Dongsu

> +				list_move_tail(&pwq->mayday_node, &wq->maydays);
> +				spin_unlock(&wq_mayday_lock);
> +			}
> +		}
>  
>  		/*
>  		 * Put the reference grabbed by send_mayday().  @pool won't
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-03 18:02                         ` Tejun Heo
  2014-12-03 22:31                           ` Dongsu Park
@ 2014-12-04  1:01                           ` Lai Jiangshan
  2014-12-04 14:57                             ` Tejun Heo
  2014-12-04 15:11                           ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Tejun Heo
  2 siblings, 1 reply; 26+ messages in thread
From: Lai Jiangshan @ 2014-12-04  1:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, Jan Kara, Dongsu Park, linux-kernel

On 12/04/2014 02:02 AM, Tejun Heo wrote:
> So, something like the following.  Only compile tested.  I'll test it
> and post proper patches w/ due credits.
> 
> Thanks.
> 
> Index: work/kernel/workqueue.c
> ===================================================================
> --- work.orig/kernel/workqueue.c
> +++ work/kernel/workqueue.c
> @@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
>  	struct worker_pool *pool = (void *)__pool;
>  	struct work_struct *work;
>  
> -	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
> -	spin_lock(&pool->lock);
> +	spin_lock_irq(&pool->lock);
> +	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
>  
>  	if (need_to_create_worker(pool)) {
>  		/*
> @@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
>  			send_mayday(work);
>  	}
>  
> -	spin_unlock(&pool->lock);
> -	spin_unlock_irq(&wq_mayday_lock);
> +	spin_unlock(&wq_mayday_lock);
> +	spin_unlock_irq(&pool->lock);
>  
>  	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
>  }
> @@ -2248,12 +2248,29 @@ repeat:
>  		 * Slurp in all works issued via this workqueue and
>  		 * process'em.
>  		 */
> -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> +		WARN_ON_ONCE(!list_empty(scheduled));
>  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
>  			if (get_work_pwq(work) == pwq)
>  				move_linked_works(work, scheduled, &n);
>  
> -		process_scheduled_works(rescuer);
> +		if (!list_empty(scheduled)) {
> +			process_scheduled_works(rescuer);
> +
> +			/*
> +			 * The above execution of rescued work items could
> +			 * have created more to rescue through
> +			 * pwq_activate_first_delayed() or chained
> +			 * queueing.  Let's put @pwq back on mayday list so
> +			 * that such back-to-back work items, which may be
> +			 * being used to relieve memory pressure, don't
> +			 * incur MAYDAY_INTERVAL delay inbetween.
> +			 */
> +			if (need_to_create_worker(pool)) {
> +				spin_lock(&wq_mayday_lock);
> +				list_move_tail(&pwq->mayday_node, &wq->maydays);

how about the pwq is empty now? and ...

> +				spin_unlock(&wq_mayday_lock);
> +			}
> +		}
>  
>  		/*
>  		 * Put the reference grabbed by send_mayday().  @pool won't

and the pwq is scheduled free here> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-03 22:31                           ` Dongsu Park
@ 2014-12-04  1:19                             ` NeilBrown
  0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2014-12-04  1:19 UTC (permalink / raw)
  To: Dongsu Park; +Cc: Tejun Heo, Jan Kara, Lai Jiangshan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2895 bytes --]

On Wed, 3 Dec 2014 23:31:16 +0100 Dongsu Park <dongsu.park@profitbricks.com>
wrote:

> Hi Tejun,
> 
> On 03.12.2014 13:02, Tejun Heo wrote:
> > So, something like the following.  Only compile tested.  I'll test it
> > and post proper patches w/ due credits.
> 
> I have been already satisfied with Neil's patch,
> but your patch looks indeed a lot cleaner, I like it.
> I just compiled and tested it shortly, which seems to work.
> Though there's one nitpick. (see below)
> 
> > Thanks.
> > 
> > Index: work/kernel/workqueue.c
> > ===================================================================
> > --- work.orig/kernel/workqueue.c
> > +++ work/kernel/workqueue.c
> > @@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
> >  	struct worker_pool *pool = (void *)__pool;
> >  	struct work_struct *work;
> >  
> > -	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
> > -	spin_lock(&pool->lock);
> > +	spin_lock_irq(&pool->lock);
> > +	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
> >  
> >  	if (need_to_create_worker(pool)) {
> >  		/*
> > @@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
> >  			send_mayday(work);
> >  	}
> >  
> > -	spin_unlock(&pool->lock);
> > -	spin_unlock_irq(&wq_mayday_lock);
> > +	spin_unlock(&wq_mayday_lock);
> > +	spin_unlock_irq(&pool->lock);
> >  
> >  	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
> >  }
> > @@ -2248,12 +2248,29 @@ repeat:
> >  		 * Slurp in all works issued via this workqueue and
> >  		 * process'em.
> >  		 */
> > -		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
> > +		WARN_ON_ONCE(!list_empty(scheduled));
> >  		list_for_each_entry_safe(work, n, &pool->worklist, entry)
> >  			if (get_work_pwq(work) == pwq)
> >  				move_linked_works(work, scheduled, &n);
> >  
> > -		process_scheduled_works(rescuer);
> > +		if (!list_empty(scheduled)) {
> > +			process_scheduled_works(rescuer);
> > +
> > +			/*
> > +			 * The above execution of rescued work items could
> > +			 * have created more to rescue through
> > +			 * pwq_activate_first_delayed() or chained
> > +			 * queueing.  Let's put @pwq back on mayday list so
> > +			 * that such back-to-back work items, which may be
> > +			 * being used to relieve memory pressure, don't
> > +			 * incur MAYDAY_INTERVAL delay inbetween.
> > +			 */
> > +			if (need_to_create_worker(pool)) {
> > +				spin_lock(&wq_mayday_lock);
> 
> Does it need to call get_pwq(pwq), doesn't it?

Yes, I think it does.

Swapping the order of the locks make it so much nicer, doesn't it!!

Thanks,
NeilBrown


> 
> Thanks,
> Dongsu
> 
> > +				list_move_tail(&pwq->mayday_node, &wq->maydays);
> > +				spin_unlock(&wq_mayday_lock);
> > +			}
> > +		}
> >  
> >  		/*
> >  		 * Put the reference grabbed by send_mayday().  @pool won't
> > 
> > -- 
> > tejun


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH - v3?] workqueue: allow rescuer thread to do more work.
  2014-12-04  1:01                           ` Lai Jiangshan
@ 2014-12-04 14:57                             ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-12-04 14:57 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: NeilBrown, Jan Kara, Dongsu Park, linux-kernel

On Thu, Dec 04, 2014 at 09:01:50AM +0800, Lai Jiangshan wrote:
> > +			if (need_to_create_worker(pool)) {
> > +				spin_lock(&wq_mayday_lock);
> > +				list_move_tail(&pwq->mayday_node, &wq->maydays);
> 
> how about the pwq is empty now? and ...

It will be taken off list the next round.  It's silly to scan the list
again here.  The only thing that'd do is adding more code and adding
an extra scan when the pwq actually needs to be requeued.

> > +				spin_unlock(&wq_mayday_lock);
> > +			}
> > +		}
> >  
> >  		/*
> >  		 * Put the reference grabbed by send_mayday().  @pool won't
> 
> and the pwq is scheduled free here> 

Added get_pwq() before list_move_tail() as Dongsu suggested.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock
  2014-12-03 18:02                         ` Tejun Heo
  2014-12-03 22:31                           ` Dongsu Park
  2014-12-04  1:01                           ` Lai Jiangshan
@ 2014-12-04 15:11                           ` Tejun Heo
  2014-12-04 15:12                             ` [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work Tejun Heo
  2014-12-05  2:09                             ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Lai Jiangshan
  2 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2014-12-04 15:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

Currently, pool->lock nests inside pool->lock.  There's no inherent
reason for this order.  The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.

This nesting order turns out to complicate things with the planned
rescuer_thread() update.  Let's invert them.  This doesn't cause any
behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
 	struct worker_pool *pool = (void *)__pool;
 	struct work_struct *work;
 
-	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
-	spin_lock(&pool->lock);
+	spin_lock_irq(&pool->lock);
+	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
 
 	if (need_to_create_worker(pool)) {
 		/*
@@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
 			send_mayday(work);
 	}
 
-	spin_unlock(&pool->lock);
-	spin_unlock_irq(&wq_mayday_lock);
+	spin_unlock(&wq_mayday_lock);
+	spin_unlock_irq(&pool->lock);
 
 	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work
  2014-12-04 15:11                           ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Tejun Heo
@ 2014-12-04 15:12                             ` Tejun Heo
  2014-12-08 17:40                               ` Tejun Heo
  2014-12-05  2:09                             ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Lai Jiangshan
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-12-04 15:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

From: NeilBrown <neilb@suse.de>

When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests.  When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).

With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance.  This allows
all requests to be handled in a timely fashion.

We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.

I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread.  In that context
it significantly improves performance.  A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.

Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.

tj: Inverted the lock order between wq_mayday_lock and pool->lock with
    a preceding patch and simplified this patch.  Added comment and
    updated changelog accordingly.  Dongsu spotted missing get_pwq()
    in the simplified code.

Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello,

I'll apply these two patches to workqueue/for-3.18-fixes.  It'd be
great if other ppl can also confirm this actually works.

Thanks.

 kernel/workqueue.c |   20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2251,7 +2251,25 @@ repeat:
 			if (get_work_pwq(work) == pwq)
 				move_linked_works(work, scheduled, &n);
 
-		process_scheduled_works(rescuer);
+		if (!list_empty(scheduled)) {
+			process_scheduled_works(rescuer);
+
+			/*
+			 * The above execution of rescued work items could
+			 * have created more to rescue through
+			 * pwq_activate_first_delayed() or chained
+			 * queueing.  Let's put @pwq back on mayday list so
+			 * that such back-to-back work items, which may be
+			 * being used to relieve memory pressure, don't
+			 * incur MAYDAY_INTERVAL delay inbetween.
+			 */
+			if (need_to_create_worker(pool)) {
+				spin_lock(&wq_mayday_lock);
+				get_pwq(pwq);
+				list_move_tail(&pwq->mayday_node, &wq->maydays);
+				spin_unlock(&wq_mayday_lock);
+			}
+		}
 
 		/*
 		 * Put the reference grabbed by send_mayday().  @pool won't

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock
  2014-12-04 15:11                           ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Tejun Heo
  2014-12-04 15:12                             ` [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work Tejun Heo
@ 2014-12-05  2:09                             ` Lai Jiangshan
  1 sibling, 0 replies; 26+ messages in thread
From: Lai Jiangshan @ 2014-12-05  2:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, Jan Kara, Dongsu Park, linux-kernel

On 12/04/2014 11:11 PM, Tejun Heo wrote:
> Currently, pool->lock nests inside pool->lock.  There's no inherent
> reason for this order.  The only place where the two locks are held
> together is pool_mayday_timeout() and it just got decided that way.
> 
> This nesting order turns out to complicate things with the planned
> rescuer_thread() update.  Let's invert them.  This doesn't cause any
> behavior differences.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Dongsu Park <dongsu.park@profitbricks.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>

Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

> ---
>  kernel/workqueue.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1804,8 +1804,8 @@ static void pool_mayday_timeout(unsigned
>  	struct worker_pool *pool = (void *)__pool;
>  	struct work_struct *work;
>  
> -	spin_lock_irq(&wq_mayday_lock);		/* for wq->maydays */
> -	spin_lock(&pool->lock);
> +	spin_lock_irq(&pool->lock);
> +	spin_lock(&wq_mayday_lock);		/* for wq->maydays */
>  
>  	if (need_to_create_worker(pool)) {
>  		/*
> @@ -1818,8 +1818,8 @@ static void pool_mayday_timeout(unsigned
>  			send_mayday(work);
>  	}
>  
> -	spin_unlock(&pool->lock);
> -	spin_unlock_irq(&wq_mayday_lock);
> +	spin_unlock(&wq_mayday_lock);
> +	spin_unlock_irq(&pool->lock);
>  
>  	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
>  }
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work
  2014-12-04 15:12                             ` [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work Tejun Heo
@ 2014-12-08 17:40                               ` Tejun Heo
  2014-12-08 22:47                                 ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-12-08 17:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

On Thu, Dec 04, 2014 at 10:12:23AM -0500, Tejun Heo wrote:
> From: NeilBrown <neilb@suse.de>
> 
> When there is serious memory pressure, all workers in a pool could be
> blocked, and a new thread cannot be created because it requires memory
> allocation.
> 
> In this situation a WQ_MEM_RECLAIM workqueue will wake up the
> rescuer thread to do some work.
> 
> The rescuer will only handle requests that are already on ->worklist.
> If max_requests is 1, that means it will handle a single request.
> 
> The rescuer will be woken again in 100ms to handle another max_requests
> requests.
> 
> I've seen a machine (running a 3.0 based "enterprise" kernel) with
> thousands of requests queued for xfslogd, which has a max_requests of
> 1, and is needed for retiring all 'xfs' write requests.  When one of
> the worker pools gets into this state, it progresses extremely slowly
> and possibly never recovers (only waited an hour or two).
> 
> With this patch we leave a pool_workqueue on mayday list
> until it is clearly no longer in need of assistance.  This allows
> all requests to be handled in a timely fashion.
> 
> We keep each pool_workqueue on the mayday list until
> need_to_create_worker() is false, and no work for this workqueue is
> found in the pool.
> 
> I have tested this in combination with a (hackish) patch which forces
> all work items to be handled by the rescuer thread.  In that context
> it significantly improves performance.  A similar patch for a 3.0
> kernel significantly improved performance on a heavy work load.
> 
> Thanks to Jan Kara for some design ideas, and to Dongsu Park for
> some comments and testing.
> 
> tj: Inverted the lock order between wq_mayday_lock and pool->lock with
>     a preceding patch and simplified this patch.  Added comment and
>     updated changelog accordingly.  Dongsu spotted missing get_pwq()
>     in the simplified code.
> 
> Cc: Dongsu Park <dongsu.park@profitbricks.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Too late for for-3.18-fixes.  Applied the two patches to wq/for-3.19.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work
  2014-12-08 17:40                               ` Tejun Heo
@ 2014-12-08 22:47                                 ` NeilBrown
  0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2014-12-08 22:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Lai Jiangshan, Dongsu Park, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2605 bytes --]

On Mon, 8 Dec 2014 12:40:52 -0500 Tejun Heo <tj@kernel.org> wrote:

> On Thu, Dec 04, 2014 at 10:12:23AM -0500, Tejun Heo wrote:
> > From: NeilBrown <neilb@suse.de>
> > 
> > When there is serious memory pressure, all workers in a pool could be
> > blocked, and a new thread cannot be created because it requires memory
> > allocation.
> > 
> > In this situation a WQ_MEM_RECLAIM workqueue will wake up the
> > rescuer thread to do some work.
> > 
> > The rescuer will only handle requests that are already on ->worklist.
> > If max_requests is 1, that means it will handle a single request.
> > 
> > The rescuer will be woken again in 100ms to handle another max_requests
> > requests.
> > 
> > I've seen a machine (running a 3.0 based "enterprise" kernel) with
> > thousands of requests queued for xfslogd, which has a max_requests of
> > 1, and is needed for retiring all 'xfs' write requests.  When one of
> > the worker pools gets into this state, it progresses extremely slowly
> > and possibly never recovers (only waited an hour or two).
> > 
> > With this patch we leave a pool_workqueue on mayday list
> > until it is clearly no longer in need of assistance.  This allows
> > all requests to be handled in a timely fashion.
> > 
> > We keep each pool_workqueue on the mayday list until
> > need_to_create_worker() is false, and no work for this workqueue is
> > found in the pool.
> > 
> > I have tested this in combination with a (hackish) patch which forces
> > all work items to be handled by the rescuer thread.  In that context
> > it significantly improves performance.  A similar patch for a 3.0
> > kernel significantly improved performance on a heavy work load.
> > 
> > Thanks to Jan Kara for some design ideas, and to Dongsu Park for
> > some comments and testing.
> > 
> > tj: Inverted the lock order between wq_mayday_lock and pool->lock with
> >     a preceding patch and simplified this patch.  Added comment and
> >     updated changelog accordingly.  Dongsu spotted missing get_pwq()
> >     in the simplified code.
> > 
> > Cc: Dongsu Park <dongsu.park@profitbricks.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> Too late for for-3.18-fixes.  Applied the two patches to wq/for-3.19.

I've just run my synthetic test which forces most works to be handled by the
rescuer thread.  With these patches it works more smoothly than without.
And importantly: it still works :-)

Thanks,
NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-12-08 22:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-29  6:26 [PATCH/RFC] workqueue: allow rescuer thread to do more work NeilBrown
2014-10-29 14:32 ` Tejun Heo
2014-10-29 23:19   ` NeilBrown
2014-11-04 14:22     ` Tejun Heo
2014-11-06 16:58       ` Dongsu Park
2014-11-07  3:03         ` Lai Jiangshan
2014-11-10  5:28           ` NeilBrown
2014-11-10  8:52             ` Jan Kara
2014-11-10 22:04               ` NeilBrown
2014-11-14 17:21                 ` Tejun Heo
2014-11-18  4:27                 ` [PATCH - v3?] " NeilBrown
2014-11-18  6:01                   ` Lai Jiangshan
2014-11-18  6:11                     ` NeilBrown
2014-12-02 20:43                   ` Tejun Heo
2014-12-03  0:40                     ` NeilBrown
2014-12-03 17:20                       ` Tejun Heo
2014-12-03 18:02                         ` Tejun Heo
2014-12-03 22:31                           ` Dongsu Park
2014-12-04  1:19                             ` NeilBrown
2014-12-04  1:01                           ` Lai Jiangshan
2014-12-04 14:57                             ` Tejun Heo
2014-12-04 15:11                           ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Tejun Heo
2014-12-04 15:12                             ` [PATCH workqueue/for-3.18-fixes 2/2] workqueue: allow rescuer thread to do more work Tejun Heo
2014-12-08 17:40                               ` Tejun Heo
2014-12-08 22:47                                 ` NeilBrown
2014-12-05  2:09                             ` [PATCH workqueue/for-3.18-fixes 1/2] workqueue: invert the order between pool->lock and wq_mayday_lock Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).