LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
@ 2015-01-14 11:36 Vinayak Menon
  2015-01-14 16:50 ` Michal Hocko
  2015-01-16  1:17 ` Andrew Morton
  0 siblings, 2 replies; 30+ messages in thread
From: Vinayak Menon @ 2015-01-14 11:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, hannes, vdavydov, mhocko, mgorman, minchan, Vinayak Menon

It is observed that sometimes multiple tasks get blocked for long
in the congestion_wait loop below, in shrink_inactive_list. This
is because of vm_stat values not being synced.

(__schedule) from [<c0a03328>]
(schedule_timeout) from [<c0a04940>]
(io_schedule_timeout) from [<c01d585c>]
(congestion_wait) from [<c01cc9d8>]
(shrink_inactive_list) from [<c01cd034>]
(shrink_zone) from [<c01cdd08>]
(try_to_free_pages) from [<c01c442c>]
(__alloc_pages_nodemask) from [<c01f1884>]
(new_slab) from [<c09fcf60>]
(__slab_alloc) from [<c01f1a6c>]

In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
returned 92, and GFP_IOFS was set, and this resulted
in too_many_isolated returning true. But one of the CPU's
pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
actual isolated count was zero. As there weren't any more
updates to NR_ISOLATED_FILE and vmstat_update deffered work
had not been scheduled yet, 7 tasks were spinning in the
congestion wait loop for around 4 seconds, in the direct
reclaim path.

This patch uses zone_page_state_snapshot instead, but restricts
its usage to avoid performance penalty.

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/vmscan.c | 56 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8772b..266551f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1392,6 +1392,32 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
+static int __too_many_isolated(struct zone *zone, int file,
+	struct scan_control *sc, int safe)
+{
+	unsigned long inactive, isolated;
+
+	if (safe) {
+		inactive = zone_page_state_snapshot(zone,
+				NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state_snapshot(zone,
+				NR_ISOLATED_ANON + file);
+	} else {
+		inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state(zone, NR_ISOLATED_ANON + file);
+	}
+
+	/*
+	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
+	 * won't get blocked by normal direct-reclaimers, forming a circular
+	 * deadlock.
+	 */
+	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+		inactive >>= 3;
+
+	return isolated > inactive;
+}
+
 /*
  * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
  * then get resheduled. When there are massive number of tasks doing page
@@ -1400,33 +1426,22 @@ int isolate_lru_page(struct page *page)
  * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
-		struct scan_control *sc)
+		struct scan_control *sc, int safe)
 {
-	unsigned long inactive, isolated;
-
 	if (current_is_kswapd())
 		return 0;
 
 	if (!global_reclaim(sc))
 		return 0;
 
-	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
-	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+	if (unlikely(__too_many_isolated(zone, file, sc, 0))) {
+		if (safe)
+			return __too_many_isolated(zone, file, sc, safe);
+		else
+			return 1;
 	}
 
-	/*
-	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
-	 * won't get blocked by normal direct-reclaimers, forming a circular
-	 * deadlock.
-	 */
-	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
-		inactive >>= 3;
-
-	return isolated > inactive;
+	return 0;
 }
 
 static noinline_for_stack void
@@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
+	int safe = 0;
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
+
+		safe = 1;
 	}
 
 	lru_add_drain();
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-14 11:36 [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated Vinayak Menon
@ 2015-01-14 16:50 ` Michal Hocko
  2015-01-15 17:24   ` Vinayak Menon
  2015-01-16  1:17 ` Andrew Morton
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2015-01-14 16:50 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan

On Wed 14-01-15 17:06:59, Vinayak Menon wrote:
[...]
> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
> returned 92, and GFP_IOFS was set, and this resulted
> in too_many_isolated returning true. But one of the CPU's
> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
> actual isolated count was zero. As there weren't any more
> updates to NR_ISOLATED_FILE and vmstat_update deffered work
> had not been scheduled yet, 7 tasks were spinning in the
> congestion wait loop for around 4 seconds, in the direct
> reclaim path.

Not syncing for such a long time doesn't sound right. I am not familiar
with the vmstat syncing but sysctl_stat_interval is HZ so it should
happen much more often that every 4 seconds.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-14 16:50 ` Michal Hocko
@ 2015-01-15 17:24   ` Vinayak Menon
  2015-01-16 15:49     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Vinayak Menon @ 2015-01-15 17:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan

On 01/14/2015 10:20 PM, Michal Hocko wrote:
> On Wed 14-01-15 17:06:59, Vinayak Menon wrote:
> [...]
>> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
>> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
>> returned 92, and GFP_IOFS was set, and this resulted
>> in too_many_isolated returning true. But one of the CPU's
>> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
>> actual isolated count was zero. As there weren't any more
>> updates to NR_ISOLATED_FILE and vmstat_update deffered work
>> had not been scheduled yet, 7 tasks were spinning in the
>> congestion wait loop for around 4 seconds, in the direct
>> reclaim path.
>
> Not syncing for such a long time doesn't sound right. I am not familiar
> with the vmstat syncing but sysctl_stat_interval is HZ so it should
> happen much more often that every 4 seconds.
>

Though the interval is HZ, since the vmstat_work is declared as a 
deferrable work, IIUC the timer trigger can be deferred to the
next non-defferable timer expiry on the CPU which is in idle. This 
results in the vmstat syncing on an idle CPU delayed by seconds. May be 
in most cases this behavior is fine, except in cases like this. Even in 
usual cases were the timer triggers in 1-2 secs, is it fine to let the 
tasks in reclaim path wait that long unnecessarily when there isn't any 
real congestion?

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-14 11:36 [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated Vinayak Menon
  2015-01-14 16:50 ` Michal Hocko
@ 2015-01-16  1:17 ` Andrew Morton
  2015-01-16  5:10   ` Vinayak Menon
  2015-01-17 16:29   ` Vinayak Menon
  1 sibling, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2015-01-16  1:17 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan

On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote:

> It is observed that sometimes multiple tasks get blocked for long
> in the congestion_wait loop below, in shrink_inactive_list. This
> is because of vm_stat values not being synced.
> 
> (__schedule) from [<c0a03328>]
> (schedule_timeout) from [<c0a04940>]
> (io_schedule_timeout) from [<c01d585c>]
> (congestion_wait) from [<c01cc9d8>]
> (shrink_inactive_list) from [<c01cd034>]
> (shrink_zone) from [<c01cdd08>]
> (try_to_free_pages) from [<c01c442c>]
> (__alloc_pages_nodemask) from [<c01f1884>]
> (new_slab) from [<c09fcf60>]
> (__slab_alloc) from [<c01f1a6c>]
> 
> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
> returned 92, and GFP_IOFS was set, and this resulted
> in too_many_isolated returning true. But one of the CPU's
> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
> actual isolated count was zero. As there weren't any more
> updates to NR_ISOLATED_FILE and vmstat_update deffered work
> had not been scheduled yet, 7 tasks were spinning in the
> congestion wait loop for around 4 seconds, in the direct
> reclaim path.
> 
> This patch uses zone_page_state_snapshot instead, but restricts
> its usage to avoid performance penalty.

Seems reasonable.

>
> ...
>
> @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	unsigned long nr_immediate = 0;
>  	isolate_mode_t isolate_mode = 0;
>  	int file = is_file_lru(lru);
> +	int safe = 0;
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  
> -	while (unlikely(too_many_isolated(zone, file, sc))) {
> +	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */
>  		if (fatal_signal_pending(current))
>  			return SWAP_CLUSTER_MAX;
> +
> +		safe = 1;
>  	}

But here and under the circumstances you describe, we'll call
congestion_wait() a single time.  That shouldn't have occurred.

So how about we put the fallback logic into too_many_isolated() itself?



From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix

Move the zone_page_state_snapshot() fallback logic into
too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call
congestion_wait().

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix
+++ a/mm/vmscan.c
@@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page)
 }
 
 static int __too_many_isolated(struct zone *zone, int file,
-	struct scan_control *sc, int safe)
+			       struct scan_control *sc, int safe)
 {
 	unsigned long inactive, isolated;
 
@@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo
  * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
-		struct scan_control *sc, int safe)
+			     struct scan_control *sc)
 {
 	if (current_is_kswapd())
 		return 0;
@@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone
 	if (!global_reclaim(sc))
 		return 0;
 
-	if (unlikely(__too_many_isolated(zone, file, sc, 0))) {
-		if (safe)
-			return __too_many_isolated(zone, file, sc, safe);
-		else
-			return 1;
-	}
+	/*
+	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
+	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
+	 * like too_many_isolated() is about to return true, fall back to the
+	 * slower, more accurate zone_page_state_snapshot().
+	 */
+	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
+		return __too_many_isolated(zone, file, sc, safe);
 
 	return 0;
 }
@@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	int safe = 0;
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
+	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
-
-		safe = 1;
 	}
 
 	lru_add_drain();
_


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16  1:17 ` Andrew Morton
@ 2015-01-16  5:10   ` Vinayak Menon
  2015-01-17 16:29   ` Vinayak Menon
  1 sibling, 0 replies; 30+ messages in thread
From: Vinayak Menon @ 2015-01-16  5:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan

On 01/16/2015 06:47 AM, Andrew Morton wrote:
> On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote:
>
>> It is observed that sometimes multiple tasks get blocked for long
>> in the congestion_wait loop below, in shrink_inactive_list. This
>> is because of vm_stat values not being synced.
>>
>> (__schedule) from [<c0a03328>]
>> (schedule_timeout) from [<c0a04940>]
>> (io_schedule_timeout) from [<c01d585c>]
>> (congestion_wait) from [<c01cc9d8>]
>> (shrink_inactive_list) from [<c01cd034>]
>> (shrink_zone) from [<c01cdd08>]
>> (try_to_free_pages) from [<c01c442c>]
>> (__alloc_pages_nodemask) from [<c01f1884>]
>> (new_slab) from [<c09fcf60>]
>> (__slab_alloc) from [<c01f1a6c>]
>>
>> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
>> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
>> returned 92, and GFP_IOFS was set, and this resulted
>> in too_many_isolated returning true. But one of the CPU's
>> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
>> actual isolated count was zero. As there weren't any more
>> updates to NR_ISOLATED_FILE and vmstat_update deffered work
>> had not been scheduled yet, 7 tasks were spinning in the
>> congestion wait loop for around 4 seconds, in the direct
>> reclaim path.
>>
>> This patch uses zone_page_state_snapshot instead, but restricts
>> its usage to avoid performance penalty.
>
> Seems reasonable.
>
>>
>> ...
>>
>> @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>>   	unsigned long nr_immediate = 0;
>>   	isolate_mode_t isolate_mode = 0;
>>   	int file = is_file_lru(lru);
>> +	int safe = 0;
>>   	struct zone *zone = lruvec_zone(lruvec);
>>   	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>>
>> -	while (unlikely(too_many_isolated(zone, file, sc))) {
>> +	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
>>   		congestion_wait(BLK_RW_ASYNC, HZ/10);
>>
>>   		/* We are about to die and free our memory. Return now. */
>>   		if (fatal_signal_pending(current))
>>   			return SWAP_CLUSTER_MAX;
>> +
>> +		safe = 1;
>>   	}
>
> But here and under the circumstances you describe, we'll call
> congestion_wait() a single time.  That shouldn't have occurred.
>
> So how about we put the fallback logic into too_many_isolated() itself?
>
>

congestion_wait was allowed to run once as an optimization, considering 
that __too_many_isolated (unsafe and faster) can be correct in returning 
true most of the time. So we avoid calling the safe version, in most of 
the cases. But I agree that we should not call congestion_wait 
unnecessarily even in those rare cases. So this looks correct to me.


>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix
>
> Move the zone_page_state_snapshot() fallback logic into
> too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call
> congestion_wait().
>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Vinayak Menon <vinmenon@codeaurora.org>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>   mm/vmscan.c |   23 +++++++++++------------
>   1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix
> +++ a/mm/vmscan.c
> @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page)
>   }
>
>   static int __too_many_isolated(struct zone *zone, int file,
> -	struct scan_control *sc, int safe)
> +			       struct scan_control *sc, int safe)
>   {
>   	unsigned long inactive, isolated;
>
> @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo
>    * unnecessary swapping, thrashing and OOM.
>    */
>   static int too_many_isolated(struct zone *zone, int file,
> -		struct scan_control *sc, int safe)
> +			     struct scan_control *sc)
>   {
>   	if (current_is_kswapd())
>   		return 0;
> @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone
>   	if (!global_reclaim(sc))
>   		return 0;
>
> -	if (unlikely(__too_many_isolated(zone, file, sc, 0))) {
> -		if (safe)
> -			return __too_many_isolated(zone, file, sc, safe);
> -		else
> -			return 1;
> -	}
> +	/*
> +	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
> +	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
> +	 * like too_many_isolated() is about to return true, fall back to the
> +	 * slower, more accurate zone_page_state_snapshot().
> +	 */
> +	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
> +		return __too_many_isolated(zone, file, sc, safe);
>
>   	return 0;
>   }
> @@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to
>   	unsigned long nr_immediate = 0;
>   	isolate_mode_t isolate_mode = 0;
>   	int file = is_file_lru(lru);
> -	int safe = 0;
>   	struct zone *zone = lruvec_zone(lruvec);
>   	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
> +	while (unlikely(too_many_isolated(zone, file, sc))) {
>   		congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>   		/* We are about to die and free our memory. Return now. */
>   		if (fatal_signal_pending(current))
>   			return SWAP_CLUSTER_MAX;
> -
> -		safe = 1;
>   	}
>
>   	lru_add_drain();
> _
>


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-15 17:24   ` Vinayak Menon
@ 2015-01-16 15:49     ` Michal Hocko
  2015-01-16 17:57       ` Michal Hocko
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-16 15:49 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan,
	Christoph Lameter

On Thu 15-01-15 22:54:20, Vinayak Menon wrote:
> On 01/14/2015 10:20 PM, Michal Hocko wrote:
> >On Wed 14-01-15 17:06:59, Vinayak Menon wrote:
> >[...]
> >>In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
> >>had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
> >>returned 92, and GFP_IOFS was set, and this resulted
> >>in too_many_isolated returning true. But one of the CPU's
> >>pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
> >>actual isolated count was zero. As there weren't any more
> >>updates to NR_ISOLATED_FILE and vmstat_update deffered work
> >>had not been scheduled yet, 7 tasks were spinning in the
> >>congestion wait loop for around 4 seconds, in the direct
> >>reclaim path.
> >
> >Not syncing for such a long time doesn't sound right. I am not familiar
> >with the vmstat syncing but sysctl_stat_interval is HZ so it should
> >happen much more often that every 4 seconds.
> >
> 
> Though the interval is HZ, since the vmstat_work is declared as a
> deferrable work, IIUC the timer trigger can be deferred to the next
> non-defferable timer expiry on the CPU which is in idle. This results
> in the vmstat syncing on an idle CPU delayed by seconds. May be in
> most cases this behavior is fine, except in cases like this.

I am not sure I understand the above because CPU being idle doesn't
seem important AFAICS. Anyway I have checked the current code which has
changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers
V8). Let's CC Christoph (the thread starts here:
http://thread.gmane.org/gmane.linux.kernel.mm/127229).

__round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we
have vmstat_shepherd which waits to be queued and then wait to run. When
it runs finally it only queues per-cpu vmstat_work which can also end
up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just
for queuing. Not even mentioning work item latencies. Especially when
workers are overloaded e.g. by fs work items and no additional workers
cannot be created e.g. due to memory pressure so they are processed only
by the workqueue rescuer. And latencies would grow a lot.

We have seen an issue where rescuer had to process thousands of work
items because all workers where blocked on memory allocation - see
http://thread.gmane.org/gmane.linux.kernel/1816452 - which is mainline
already 008847f66c38 (workqueue: allow rescuer thread to do more work.)
the patch reduces the latency considerably but it doesn't remove it
completely.

If the time between two syncs is large then the per-cpu drift might be
really large as well. Isn't this going to hurt other places where we
rely on stats as well?

In this particular case the reclaimers are throttled because they see
too many isolated pages which was just a result of the per-cpu drift and
small LRU list. The system seems to be under serious memory pressure
already because there is basically no file cache. If the reclaimers
wouldn't be throttled they would fail the reclaim probably and trigger
OOM killer which might help to resolve the situation. Reclaimers are
throttled instead.

Why cannot we simply update the global counters from vmstat_shepherd
directly? Sure we would access remote CPU counters but that wouldn't
happen often. That still wouldn't be enough because there is the
workqueue latency. So why cannot we do that whole shepherd thing from a
kernel thread instead? If the NUMA zone->pageset->pcp thing has to be
done locally then it could have per-node kthread rather than being part
of the unrelated vmstat code.
I might be missing obvious things because I still haven't digested the
code completely but this whole thing seems overly complicated and I
do not see a good reason for that. If the primary motivation was cpu
isolation then kthreads would sound like a better idea because you can
play with the explicit affinity from the userspace.

> Even in usual cases were the timer triggers in 1-2 secs, is it fine to
> let the tasks in reclaim path wait that long unnecessarily when there
> isn't any real congestion?

But how do we find that out?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16 15:49     ` Michal Hocko
@ 2015-01-16 17:57       ` Michal Hocko
  2015-01-16 19:17         ` Christoph Lameter
  2015-01-17 15:18       ` Vinayak Menon
  2015-01-29 17:32       ` Christoph Lameter
  2 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2015-01-16 17:57 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan,
	Christoph Lameter

On Fri 16-01-15 16:49:22, Michal Hocko wrote:
[...]
> Why cannot we simply update the global counters from vmstat_shepherd
> directly?

OK, I should have checked the updating paths... This would be racy, so
update from remote is not an option without additional trickery (like
retries etc.) :/
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16 17:57       ` Michal Hocko
@ 2015-01-16 19:17         ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2015-01-16 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Fri, 16 Jan 2015, Michal Hocko wrote:

> On Fri 16-01-15 16:49:22, Michal Hocko wrote:
> [...]
> > Why cannot we simply update the global counters from vmstat_shepherd
> > directly?
>
> OK, I should have checked the updating paths... This would be racy, so
> update from remote is not an option without additional trickery (like
> retries etc.) :/

You can do that if you have a way to ensure that the other cpu does not
access the counter. F.e. if the other cpu is staying in user space all the
time or it is guaranteed to be idle.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16 15:49     ` Michal Hocko
  2015-01-16 17:57       ` Michal Hocko
@ 2015-01-17 15:18       ` Vinayak Menon
  2015-01-17 19:48         ` Christoph Lameter
  2015-01-29 17:32       ` Christoph Lameter
  2 siblings, 1 reply; 30+ messages in thread
From: Vinayak Menon @ 2015-01-17 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan,
	Christoph Lameter

On 01/16/2015 09:19 PM, Michal Hocko wrote:
> On Thu 15-01-15 22:54:20, Vinayak Menon wrote:
>> On 01/14/2015 10:20 PM, Michal Hocko wrote:
>>> On Wed 14-01-15 17:06:59, Vinayak Menon wrote:
>>> [...]
>>>> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
>>>> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
>>>> returned 92, and GFP_IOFS was set, and this resulted
>>>> in too_many_isolated returning true. But one of the CPU's
>>>> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
>>>> actual isolated count was zero. As there weren't any more
>>>> updates to NR_ISOLATED_FILE and vmstat_update deffered work
>>>> had not been scheduled yet, 7 tasks were spinning in the
>>>> congestion wait loop for around 4 seconds, in the direct
>>>> reclaim path.
>>>
>>> Not syncing for such a long time doesn't sound right. I am not familiar
>>> with the vmstat syncing but sysctl_stat_interval is HZ so it should
>>> happen much more often that every 4 seconds.
>>>
>>
>> Though the interval is HZ, since the vmstat_work is declared as a
>> deferrable work, IIUC the timer trigger can be deferred to the next
>> non-defferable timer expiry on the CPU which is in idle. This results
>> in the vmstat syncing on an idle CPU delayed by seconds. May be in
>> most cases this behavior is fine, except in cases like this.
>
> I am not sure I understand the above because CPU being idle doesn't
> seem important AFAICS. Anyway I have checked the current code which has
> changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers
> V8). Let's CC Christoph (the thread starts here:
> http://thread.gmane.org/gmane.linux.kernel.mm/127229).
>

I will try to explain the exact observations. All the cases which I had 
encountered, had similar symptoms. In one of the cases, it was CPU3 
alone which had not updated the vmstat_diff. This CPU was in idle for 
around 30 secs. When I looked at the tvec base for this CPU, the timer 
associated with vmstat_update had its expiry time less than current 
jiffies. This timer had its deferrable flag set, and was tied to the 
next non-deferrable timer in the list. Since deferrable timers can't 
wake up the CPU, the vmstat sync for this CPU was deferred for a long 
time i.e. till the expiry of next non-deferrable timer. The issue was 
caught because, one of the tasks which was in reclaim path and in the 
congestion_wait loop had an associated watchdog, which resulted in a 
panic after 4secs. So 4 secs is actually the watchdog expiry, and the 
time we can get blocked in the congestion loop can be even more.



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16  1:17 ` Andrew Morton
  2015-01-16  5:10   ` Vinayak Menon
@ 2015-01-17 16:29   ` Vinayak Menon
  2015-02-11 22:14     ` Andrew Morton
  1 sibling, 1 reply; 30+ messages in thread
From: Vinayak Menon @ 2015-01-17 16:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan

On 01/16/2015 06:47 AM, Andrew Morton wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix
>
> Move the zone_page_state_snapshot() fallback logic into
> too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call
> congestion_wait().
>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Vinayak Menon <vinmenon@codeaurora.org>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>   mm/vmscan.c |   23 +++++++++++------------
>   1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix
> +++ a/mm/vmscan.c
> @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page)
>   }
>
>   static int __too_many_isolated(struct zone *zone, int file,
> -	struct scan_control *sc, int safe)
> +			       struct scan_control *sc, int safe)
>   {
>   	unsigned long inactive, isolated;
>
> @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo
>    * unnecessary swapping, thrashing and OOM.
>    */
>   static int too_many_isolated(struct zone *zone, int file,
> -		struct scan_control *sc, int safe)
> +			     struct scan_control *sc)
>   {
>   	if (current_is_kswapd())
>   		return 0;
> @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone
>   	if (!global_reclaim(sc))
>   		return 0;
>
> -	if (unlikely(__too_many_isolated(zone, file, sc, 0))) {
> -		if (safe)
> -			return __too_many_isolated(zone, file, sc, safe);
> -		else
> -			return 1;
> -	}
> +	/*
> +	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
> +	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
> +	 * like too_many_isolated() is about to return true, fall back to the
> +	 * slower, more accurate zone_page_state_snapshot().
> +	 */
> +	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
> +		return __too_many_isolated(zone, file, sc, safe);

Just noticed now that, in the above statement it should be "1", instead 
of "safe". "safe" is not declared.



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-17 15:18       ` Vinayak Menon
@ 2015-01-17 19:48         ` Christoph Lameter
  2015-01-19  4:27           ` Vinayak Menon
  2015-01-26 17:28           ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Lameter @ 2015-01-17 19:48 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Sat, 17 Jan 2015, Vinayak Menon wrote:

> which had not updated the vmstat_diff. This CPU was in idle for around 30
> secs. When I looked at the tvec base for this CPU, the timer associated with
> vmstat_update had its expiry time less than current jiffies. This timer had
> its deferrable flag set, and was tied to the next non-deferrable timer in the

We can remove the deferrrable flag now since the vmstat threads are only
activated as necessary with the recent changes. Looks like this could fix
your issue?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-17 19:48         ` Christoph Lameter
@ 2015-01-19  4:27           ` Vinayak Menon
  2015-01-21 14:39             ` Michal Hocko
  2015-01-26 17:28           ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Vinayak Menon @ 2015-01-19  4:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On 01/18/2015 01:18 AM, Christoph Lameter wrote:
> On Sat, 17 Jan 2015, Vinayak Menon wrote:
>
>> which had not updated the vmstat_diff. This CPU was in idle for around 30
>> secs. When I looked at the tvec base for this CPU, the timer associated with
>> vmstat_update had its expiry time less than current jiffies. This timer had
>> its deferrable flag set, and was tied to the next non-deferrable timer in the
>
> We can remove the deferrrable flag now since the vmstat threads are only
> activated as necessary with the recent changes. Looks like this could fix
> your issue?
>

Yes, this should fix my issue.
But I think we may need the fix in too_many_isolated, since there can 
still be a delay of few seconds (HZ by default and even more because of 
reasons pointed out by Michal) which will result in reclaimers 
unnecessarily entering congestion_wait. No ?


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-19  4:27           ` Vinayak Menon
@ 2015-01-21 14:39             ` Michal Hocko
  2015-01-22 15:16               ` Vlastimil Babka
  2015-01-22 16:11               ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-21 14:39 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes,
	vdavydov, mgorman, minchan

On Mon 19-01-15 09:57:08, Vinayak Menon wrote:
> On 01/18/2015 01:18 AM, Christoph Lameter wrote:
> >On Sat, 17 Jan 2015, Vinayak Menon wrote:
> >
> >>which had not updated the vmstat_diff. This CPU was in idle for around 30
> >>secs. When I looked at the tvec base for this CPU, the timer associated with
> >>vmstat_update had its expiry time less than current jiffies. This timer had
> >>its deferrable flag set, and was tied to the next non-deferrable timer in the
> >
> >We can remove the deferrrable flag now since the vmstat threads are only
> >activated as necessary with the recent changes. Looks like this could fix
> >your issue?
> >
> 
> Yes, this should fix my issue.

Does it? Because I would prefer not getting into un-synced state much
more than playing around one specific place which shows the problems
right now.

> But I think we may need the fix in too_many_isolated, since there can still
> be a delay of few seconds (HZ by default and even more because of reasons
> pointed out by Michal) which will result in reclaimers unnecessarily
> entering congestion_wait. No ?

I think we can solve this as well. We can stick vmstat_shepherd into a
kernel thread with a loop with the configured timeout and then create a
mask of CPUs which need the update and run vmstat_update from
IPI context (smp_call_function_many).
We would have to drop cond_resched from refresh_cpu_vm_stats of
course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context
shouldn't be excessive but I haven't measured that so I might be easily
wrong.

Anyway, that should work more reliably than the current scheme and
should help to reduce pointless wakeups which the original patchset was
addressing.  Or am I missing something?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-21 14:39             ` Michal Hocko
@ 2015-01-22 15:16               ` Vlastimil Babka
  2015-01-22 16:11               ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2015-01-22 15:16 UTC (permalink / raw)
  To: Michal Hocko, Vinayak Menon
  Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes,
	vdavydov, mgorman, minchan

On 01/21/2015 03:39 PM, Michal Hocko wrote:
> On Mon 19-01-15 09:57:08, Vinayak Menon wrote:
>> On 01/18/2015 01:18 AM, Christoph Lameter wrote:
>>> On Sat, 17 Jan 2015, Vinayak Menon wrote:
>>>
>>>> which had not updated the vmstat_diff. This CPU was in idle for around 30
>>>> secs. When I looked at the tvec base for this CPU, the timer associated with
>>>> vmstat_update had its expiry time less than current jiffies. This timer had
>>>> its deferrable flag set, and was tied to the next non-deferrable timer in the
>>>
>>> We can remove the deferrrable flag now since the vmstat threads are only
>>> activated as necessary with the recent changes. Looks like this could fix
>>> your issue?
>>>
>>
>> Yes, this should fix my issue.
>
> Does it? Because I would prefer not getting into un-synced state much
> more than playing around one specific place which shows the problems
> right now.
>
>> But I think we may need the fix in too_many_isolated, since there can still
>> be a delay of few seconds (HZ by default and even more because of reasons
>> pointed out by Michal) which will result in reclaimers unnecessarily
>> entering congestion_wait. No ?
>
> I think we can solve this as well. We can stick vmstat_shepherd into a
> kernel thread with a loop with the configured timeout and then create a
> mask of CPUs which need the update and run vmstat_update from
> IPI context (smp_call_function_many).
> We would have to drop cond_resched from refresh_cpu_vm_stats of
> course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context
> shouldn't be excessive but I haven't measured that so I might be easily
> wrong.
>
> Anyway, that should work more reliably than the current scheme and
> should help to reduce pointless wakeups which the original patchset was
> addressing.  Or am I missing something?

Maybe to further reduce wakeups, a CPU could check and update its 
counters before going idle? (unless that already happens)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-21 14:39             ` Michal Hocko
  2015-01-22 15:16               ` Vlastimil Babka
@ 2015-01-22 16:11               ` Christoph Lameter
  2015-01-26 17:46                 ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2015-01-22 16:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Wed, 21 Jan 2015, Michal Hocko wrote:

> I think we can solve this as well. We can stick vmstat_shepherd into a
> kernel thread with a loop with the configured timeout and then create a
> mask of CPUs which need the update and run vmstat_update from
> IPI context (smp_call_function_many).

Please do not run the vmstat_updates concurrently. They update shared
cachelines and therefore can cause bouncing cachelines if run concurrently
on multiple cpus.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-17 19:48         ` Christoph Lameter
  2015-01-19  4:27           ` Vinayak Menon
@ 2015-01-26 17:28           ` Michal Hocko
  2015-01-26 18:35             ` Christoph Lameter
                               ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-26 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Sat 17-01-15 13:48:34, Christoph Lameter wrote:
> On Sat, 17 Jan 2015, Vinayak Menon wrote:
> 
> > which had not updated the vmstat_diff. This CPU was in idle for around 30
> > secs. When I looked at the tvec base for this CPU, the timer associated with
> > vmstat_update had its expiry time less than current jiffies. This timer had
> > its deferrable flag set, and was tied to the next non-deferrable timer in the
> 
> We can remove the deferrrable flag now since the vmstat threads are only
> activated as necessary with the recent changes. Looks like this could fix
> your issue?

OK, I have checked the history and the deferrable behavior has been
introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which
hasn't offered any numbers which would justify the change. So I think it
would be a good idea to revert this one as it can clearly cause issues.

Could you retest with this change? It still wouldn't help with the
highly overloaded workqueues but that sounds like a bigger change and
this one sounds like quite safe to me so it is a good start.
---
>From 12d00a8066e336d3e1311600b50fa9b588798448 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 Jan 2015 18:07:51 +0100
Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update

Vinayak Menon has reported that excessive number of tasks was throttled
in the direct reclaim inside too_many_isolated because NR_ISOLATED_FILE
was relatively high compared to NR_INACTIVE_FILE. However it turned out
that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
vm_stat_diff wasn't transfered into the global counter.

vmstat_work which is responsible for the sync is defined as deferrable
delayed work which means that the defined timeout doesn't wake up an
idle CPU. A CPU might stay in an idle state for a long time and general
effort is to keep such a CPU in this state as long as possible which
might lead to all sorts of troubles for vmstat consumers as can be seen
with the excessive direct reclaim throttling.

This patch basically reverts 39bf6270f524 (VM statistics: Make timer
deferrable) but it shouldn't cause any problems for idle CPUs because
only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5
(vmstat: on-demand vmstat workers v8) and CPUs which are idle for a
longer time shouldn't have per-cpu drift.

Fixes: 39bf6270f524 (VM statistics: Make timer deferrable)
Reported-and-debugged-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c95d6b39ac91..b9b9deec1d54 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1453,7 +1453,7 @@ static void __init start_shepherd_timer(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
+		INIT_DELAYED_WORK(per_cpu_ptr(&vmstat_work, cpu),
 			vmstat_update);
 
 	if (!alloc_cpumask_var(&cpu_stat_off, GFP_KERNEL))
-- 
2.1.4
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-22 16:11               ` Christoph Lameter
@ 2015-01-26 17:46                 ` Michal Hocko
  2015-01-26 18:35                   ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2015-01-26 17:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Thu 22-01-15 10:11:31, Christoph Lameter wrote:
> On Wed, 21 Jan 2015, Michal Hocko wrote:
> 
> > I think we can solve this as well. We can stick vmstat_shepherd into a
> > kernel thread with a loop with the configured timeout and then create a
> > mask of CPUs which need the update and run vmstat_update from
> > IPI context (smp_call_function_many).
> 
> Please do not run the vmstat_updates concurrently. They update shared
> cachelines and therefore can cause bouncing cachelines if run concurrently
> on multiple cpus.

Would you preffer to call smp_call_function_single on each CPU
which needs an update? That would make vmstat_shepherd slower but that
is not a big deal, is it?

Anyway I am wondering whether the cache line bouncing between
vmstat_update instances is a big deal in the real life. Updating shared
counters whould bounce with many CPUs but this is an operation which is
not done often. Also all the CPUs would have update the same counters
all the time and I am not sure this happens that often. Do you have a
load where this would be measurable?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 17:46                 ` Michal Hocko
@ 2015-01-26 18:35                   ` Christoph Lameter
  2015-01-27 10:52                     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Mon, 26 Jan 2015, Michal Hocko wrote:

> > Please do not run the vmstat_updates concurrently. They update shared
> > cachelines and therefore can cause bouncing cachelines if run concurrently
> > on multiple cpus.
>
> Would you preffer to call smp_call_function_single on each CPU
> which needs an update? That would make vmstat_shepherd slower but that
> is not a big deal, is it?

Run it from the timer interrupt as usual from a work request? Those are
staggered.

> Anyway I am wondering whether the cache line bouncing between
> vmstat_update instances is a big deal in the real life. Updating shared
> counters whould bounce with many CPUs but this is an operation which is
> not done often. Also all the CPUs would have update the same counters
> all the time and I am not sure this happens that often. Do you have a
> load where this would be measurable?

Concurrent page faults update lots of counters concurrently. But will
those trigger the smp_call_function?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 17:28           ` Michal Hocko
@ 2015-01-26 18:35             ` Christoph Lameter
  2015-01-26 22:11             ` Andrew Morton
  2015-01-27 10:33             ` Vinayak Menon
  2 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Mon, 26 Jan 2015, Michal Hocko wrote:
> >From 12d00a8066e336d3e1311600b50fa9b588798448 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 26 Jan 2015 18:07:51 +0100
> Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 17:28           ` Michal Hocko
  2015-01-26 18:35             ` Christoph Lameter
@ 2015-01-26 22:11             ` Andrew Morton
  2015-01-27 10:41               ` Michal Hocko
  2015-01-27 10:33             ` Vinayak Menon
  2 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2015-01-26 22:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes,
	vdavydov, mgorman, minchan

On Mon, 26 Jan 2015 18:28:32 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update
> 
> Vinayak Menon has reported that excessive number of tasks was throttled
> in the direct reclaim inside too_many_isolated because NR_ISOLATED_FILE
> was relatively high compared to NR_INACTIVE_FILE. However it turned out
> that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
> vm_stat_diff wasn't transfered into the global counter.
> 
> vmstat_work which is responsible for the sync is defined as deferrable
> delayed work which means that the defined timeout doesn't wake up an
> idle CPU. A CPU might stay in an idle state for a long time and general
> effort is to keep such a CPU in this state as long as possible which
> might lead to all sorts of troubles for vmstat consumers as can be seen
> with the excessive direct reclaim throttling.
> 
> This patch basically reverts 39bf6270f524 (VM statistics: Make timer
> deferrable) but it shouldn't cause any problems for idle CPUs because
> only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5
> (vmstat: on-demand vmstat workers v8) and CPUs which are idle for a
> longer time shouldn't have per-cpu drift.

So do we drop
mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch?


From: Vinayak Menon <vinmenon@codeaurora.org>
Subject: mm: vmscan: fix the page state calculation in too_many_isolated

It is observed that sometimes multiple tasks get blocked for long in the
congestion_wait loop below, in shrink_inactive_list.  This is because of
vm_stat values not being synced.

(__schedule) from [<c0a03328>]
(schedule_timeout) from [<c0a04940>]
(io_schedule_timeout) from [<c01d585c>]
(congestion_wait) from [<c01cc9d8>]
(shrink_inactive_list) from [<c01cd034>]
(shrink_zone) from [<c01cdd08>]
(try_to_free_pages) from [<c01c442c>]
(__alloc_pages_nodemask) from [<c01f1884>]
(new_slab) from [<c09fcf60>]
(__slab_alloc) from [<c01f1a6c>]

In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned
14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was
set, and this resulted in too_many_isolated returning true.  But one of
the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14".  So the
actual isolated count was zero.  As there weren't any more updates to
NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled
yet, 7 tasks were spinning in the congestion wait loop for around 4
seconds, in the direct reclaim path.

This patch uses zone_page_state_snapshot instead, but restricts its usage
to avoid performance penalty.


The vmstat sync interval is HZ (sysctl_stat_interval), but since the
vmstat_work is declared as a deferrable work, the timer trigger can be
deferred to the next non-defferable timer expiry on the CPU which is in
idle.  This results in the vmstat syncing on an idle CPU being delayed by
seconds.  May be in most cases this behavior is fine, except in cases like
this.

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   56 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated
+++ a/mm/vmscan.c
@@ -1401,6 +1401,32 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
+static int __too_many_isolated(struct zone *zone, int file,
+	struct scan_control *sc, int safe)
+{
+	unsigned long inactive, isolated;
+
+	if (safe) {
+		inactive = zone_page_state_snapshot(zone,
+				NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state_snapshot(zone,
+				NR_ISOLATED_ANON + file);
+	} else {
+		inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state(zone, NR_ISOLATED_ANON + file);
+	}
+
+	/*
+	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
+	 * won't get blocked by normal direct-reclaimers, forming a circular
+	 * deadlock.
+	 */
+	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+		inactive >>= 3;
+
+	return isolated > inactive;
+}
+
 /*
  * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
  * then get resheduled. When there are massive number of tasks doing page
@@ -1409,33 +1435,22 @@ int isolate_lru_page(struct page *page)
  * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
-		struct scan_control *sc)
+		struct scan_control *sc, int safe)
 {
-	unsigned long inactive, isolated;
-
 	if (current_is_kswapd())
 		return 0;
 
 	if (!global_reclaim(sc))
 		return 0;
 
-	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
-	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+	if (unlikely(__too_many_isolated(zone, file, sc, 0))) {
+		if (safe)
+			return __too_many_isolated(zone, file, sc, safe);
+		else
+			return 1;
 	}
 
-	/*
-	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
-	 * won't get blocked by normal direct-reclaimers, forming a circular
-	 * deadlock.
-	 */
-	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
-		inactive >>= 3;
-
-	return isolated > inactive;
+	return 0;
 }
 
 static noinline_for_stack void
@@ -1525,15 +1540,18 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
+	int safe = 0;
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	while (unlikely(too_many_isolated(zone, file, sc, safe))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
+
+		safe = 1;
 	}
 
 	lru_add_drain();
_


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 17:28           ` Michal Hocko
  2015-01-26 18:35             ` Christoph Lameter
  2015-01-26 22:11             ` Andrew Morton
@ 2015-01-27 10:33             ` Vinayak Menon
  2015-01-27 10:45               ` Michal Hocko
  2 siblings, 1 reply; 30+ messages in thread
From: Vinayak Menon @ 2015-01-27 10:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes,
	vdavydov, mgorman, minchan

On 01/26/2015 10:58 PM, Michal Hocko wrote:
> On Sat 17-01-15 13:48:34, Christoph Lameter wrote:
>> On Sat, 17 Jan 2015, Vinayak Menon wrote:
>>
>>> which had not updated the vmstat_diff. This CPU was in idle for around 30
>>> secs. When I looked at the tvec base for this CPU, the timer associated with
>>> vmstat_update had its expiry time less than current jiffies. This timer had
>>> its deferrable flag set, and was tied to the next non-deferrable timer in the
>>
>> We can remove the deferrrable flag now since the vmstat threads are only
>> activated as necessary with the recent changes. Looks like this could fix
>> your issue?
>
> OK, I have checked the history and the deferrable behavior has been
> introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which
> hasn't offered any numbers which would justify the change. So I think it
> would be a good idea to revert this one as it can clearly cause issues.
>
> Could you retest with this change? It still wouldn't help with the
> highly overloaded workqueues but that sounds like a bigger change and
> this one sounds like quite safe to me so it is a good start.

Sure, I can retest.
Even without highly overloaded workqueues, there can be a delay of HZ in 
updating the counters. This means reclaim path can be blocked for a 
second or more, when there aren't really any isolated pages. So we need 
the fix in too_many_isolated also right ?


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 22:11             ` Andrew Morton
@ 2015-01-27 10:41               ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-27 10:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes,
	vdavydov, mgorman, minchan

On Mon 26-01-15 14:11:59, Andrew Morton wrote:
[...]
> So do we drop
> mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch?

I would like to see a confirmation from Vinayak that this really helped
in his test case first and only then drop the above patch. I really
think that we shouldn't spread hacks around the code just workaround
inefficiency in the vmstat code. We already have two places which need a
special treatment and who knows how many more will show up later.

Even with this patch applied we have other issues related to the
overloaded workqueues as mentioned earlier but those should be fixed by
a separate fixe(s).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-27 10:33             ` Vinayak Menon
@ 2015-01-27 10:45               ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-27 10:45 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes,
	vdavydov, mgorman, minchan

On Tue 27-01-15 16:03:57, Vinayak Menon wrote:
[...]
> Sure, I can retest.

Thanks!

> Even without highly overloaded workqueues, there can be a delay of HZ in
> updating the counters. This means reclaim path can be blocked for a second
> or more, when there aren't really any isolated pages. So we need the fix in
> too_many_isolated also right ?

Is this a big deal though? What you are hitting is certainly a corner
case. I assume your system is trashing heavily already with so few pages
on the file LRU list.

Anyway as mentioned in other email I would rather see vmstat data more
reliable than spread hacks to the code where we see immediate issues.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-26 18:35                   ` Christoph Lameter
@ 2015-01-27 10:52                     ` Michal Hocko
  2015-01-27 16:59                       ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2015-01-27 10:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Mon 26-01-15 12:35:00, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Michal Hocko wrote:
> 
> > > Please do not run the vmstat_updates concurrently. They update shared
> > > cachelines and therefore can cause bouncing cachelines if run concurrently
> > > on multiple cpus.
> >
> > Would you preffer to call smp_call_function_single on each CPU
> > which needs an update? That would make vmstat_shepherd slower but that
> > is not a big deal, is it?
> 
> Run it from the timer interrupt as usual from a work request? Those are
> staggered.

I am not following. The idea was to run vmstat_shepherd in a kernel
thread and waking up as per defined timeout and then check need_update
for each CPU and call smp_call_function_single to refresh the timer
rather than building a mask and then calling sm_call_function_many to
reduce paralel contention on the shared counters.

> > Anyway I am wondering whether the cache line bouncing between
> > vmstat_update instances is a big deal in the real life. Updating shared
> > counters whould bounce with many CPUs but this is an operation which is
> > not done often. Also all the CPUs would have update the same counters
> > all the time and I am not sure this happens that often. Do you have a
> > load where this would be measurable?
> 
> Concurrent page faults update lots of counters concurrently.

True

> But will those trigger the smp_call_function?

The smp_call_function was meant to be called only from the
vmstat_shepherd context which does happen "rarely". Or am I missing your
point here?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-27 10:52                     ` Michal Hocko
@ 2015-01-27 16:59                       ` Christoph Lameter
  2015-01-30 15:28                         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2015-01-27 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Tue, 27 Jan 2015, Michal Hocko wrote:

> I am not following. The idea was to run vmstat_shepherd in a kernel
> thread and waking up as per defined timeout and then check need_update
> for each CPU and call smp_call_function_single to refresh the timer
> rather than building a mask and then calling sm_call_function_many to
> reduce paralel contention on the shared counters.

Thats ok.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-16 15:49     ` Michal Hocko
  2015-01-16 17:57       ` Michal Hocko
  2015-01-17 15:18       ` Vinayak Menon
@ 2015-01-29 17:32       ` Christoph Lameter
  2015-01-30 15:27         ` Michal Hocko
  2 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2015-01-29 17:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Fri, 16 Jan 2015, Michal Hocko wrote:

> __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we
> have vmstat_shepherd which waits to be queued and then wait to run. When
> it runs finally it only queues per-cpu vmstat_work which can also end
> up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just
> for queuing. Not even mentioning work item latencies. Especially when
> workers are overloaded e.g. by fs work items and no additional workers
> cannot be created e.g. due to memory pressure so they are processed only
> by the workqueue rescuer. And latencies would grow a lot.

Here is a small fix to ensure that the 4 seconds interval does not happen:




Subject: vmstat: Reduce time interval to stat update on idle cpu

It was noted that the vm stat shepherd runs every 2 seconds and
that the vmstat update is then scheduled 2 seconds in the future.

This yields an interval of double the time interval which is not
desired.

Change the shepherd so that it does not delay the vmstat update
on the other cpu. We stil have to use schedule_delayed_work since
we are using a delayed_work_struct but we can set the delay to 0.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-			schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu),
-				__round_jiffies_relative(sysctl_stat_interval, cpu));
+			schedule_delayed_work_on(cpu,
+				&per_cpu(vmstat_work, cpu), 0);

 	put_online_cpus();


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-29 17:32       ` Christoph Lameter
@ 2015-01-30 15:27         ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-30 15:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Thu 29-01-15 11:32:43, Christoph Lameter wrote:
[...]
> Subject: vmstat: Reduce time interval to stat update on idle cpu
> 
> It was noted that the vm stat shepherd runs every 2 seconds and
> that the vmstat update is then scheduled 2 seconds in the future.
> 
> This yields an interval of double the time interval which is not
> desired.
> 
> Change the shepherd so that it does not delay the vmstat update
> on the other cpu. We stil have to use schedule_delayed_work since
> we are using a delayed_work_struct but we can set the delay to 0.
>
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> 
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_
>  		if (need_update(cpu) &&
>  			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
> 
> -			schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu),
> -				__round_jiffies_relative(sysctl_stat_interval, cpu));
> +			schedule_delayed_work_on(cpu,
> +				&per_cpu(vmstat_work, cpu), 0);
> 
>  	put_online_cpus();
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-27 16:59                       ` Christoph Lameter
@ 2015-01-30 15:28                         ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2015-01-30 15:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov,
	mgorman, minchan

On Tue 27-01-15 10:59:59, Christoph Lameter wrote:
> On Tue, 27 Jan 2015, Michal Hocko wrote:
> 
> > I am not following. The idea was to run vmstat_shepherd in a kernel
> > thread and waking up as per defined timeout and then check need_update
> > for each CPU and call smp_call_function_single to refresh the timer
> > rather than building a mask and then calling sm_call_function_many to
> > reduce paralel contention on the shared counters.
> 
> Thats ok.

OK, I will put that on my todo list and try to find some time to
implement it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-01-17 16:29   ` Vinayak Menon
@ 2015-02-11 22:14     ` Andrew Morton
  2015-02-12 16:19       ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2015-02-11 22:14 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan


Did we end up deciding to merge this, or is
http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch
a sufficient fix?

From: Vinayak Menon <vinmenon@codeaurora.org>
Subject: mm: vmscan: fix the page state calculation in too_many_isolated

It is observed that sometimes multiple tasks get blocked for long in the
congestion_wait loop below, in shrink_inactive_list.  This is because of
vm_stat values not being synced.

(__schedule) from [<c0a03328>]
(schedule_timeout) from [<c0a04940>]
(io_schedule_timeout) from [<c01d585c>]
(congestion_wait) from [<c01cc9d8>]
(shrink_inactive_list) from [<c01cd034>]
(shrink_zone) from [<c01cdd08>]
(try_to_free_pages) from [<c01c442c>]
(__alloc_pages_nodemask) from [<c01f1884>]
(new_slab) from [<c09fcf60>]
(__slab_alloc) from [<c01f1a6c>]

In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned
14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was
set, and this resulted in too_many_isolated returning true.  But one of
the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14".  So the
actual isolated count was zero.  As there weren't any more updates to
NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled
yet, 7 tasks were spinning in the congestion wait loop for around 4
seconds, in the direct reclaim path.

This patch uses zone_page_state_snapshot instead, but restricts its usage
to avoid performance penalty.


The vmstat sync interval is HZ (sysctl_stat_interval), but since the
vmstat_work is declared as a deferrable work, the timer trigger can be
deferred to the next non-defferable timer expiry on the CPU which is in
idle.  This results in the vmstat syncing on an idle CPU being delayed by
seconds.  May be in most cases this behavior is fine, except in cases like
this.

[akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   51 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated
+++ a/mm/vmscan.c
@@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
 
+static int __too_many_isolated(struct zone *zone, int file,
+			       struct scan_control *sc, int safe)
+{
+	unsigned long inactive, isolated;
+
+	if (safe) {
+		inactive = zone_page_state_snapshot(zone,
+				NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state_snapshot(zone,
+				NR_ISOLATED_ANON + file);
+	} else {
+		inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file);
+		isolated = zone_page_state(zone, NR_ISOLATED_ANON + file);
+	}
+
+	/*
+	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
+	 * won't get blocked by normal direct-reclaimers, forming a circular
+	 * deadlock.
+	 */
+	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+		inactive >>= 3;
+
+	return isolated > inactive;
+}
+
 /*
  * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
  * then get resheduled. When there are massive number of tasks doing page
@@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page)
  * unnecessary swapping, thrashing and OOM.
  */
 static int too_many_isolated(struct zone *zone, int file,
-		struct scan_control *sc)
+			     struct scan_control *sc)
 {
-	unsigned long inactive, isolated;
-
 	if (current_is_kswapd())
 		return 0;
 
 	if (!global_reclaim(sc))
 		return 0;
 
-	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
-	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
-	}
-
 	/*
-	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
-	 * won't get blocked by normal direct-reclaimers, forming a circular
-	 * deadlock.
+	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
+	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
+	 * like too_many_isolated() is about to return true, fall back to the
+	 * slower, more accurate zone_page_state_snapshot().
 	 */
-	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
-		inactive >>= 3;
+	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
+		return __too_many_isolated(zone, file, sc, 1);
 
-	return isolated > inactive;
+	return 0;
 }
 
 static noinline_for_stack void
_


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated
  2015-02-11 22:14     ` Andrew Morton
@ 2015-02-12 16:19       ` Vlastimil Babka
  0 siblings, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2015-02-12 16:19 UTC (permalink / raw)
  To: Andrew Morton, Vinayak Menon
  Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan

On 02/11/2015 11:14 PM, Andrew Morton wrote:
> 
> Did we end up deciding to merge this, or is
> http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch
> a sufficient fix?

I think Michal wanted to have the general vmstat worker fix from elsewhere in
the thread tested, if it solves the problem by itself, without this patch.

> From: Vinayak Menon <vinmenon@codeaurora.org>
> Subject: mm: vmscan: fix the page state calculation in too_many_isolated
> 
> It is observed that sometimes multiple tasks get blocked for long in the
> congestion_wait loop below, in shrink_inactive_list.  This is because of
> vm_stat values not being synced.
> 
> (__schedule) from [<c0a03328>]
> (schedule_timeout) from [<c0a04940>]
> (io_schedule_timeout) from [<c01d585c>]
> (congestion_wait) from [<c01cc9d8>]
> (shrink_inactive_list) from [<c01cd034>]
> (shrink_zone) from [<c01cdd08>]
> (try_to_free_pages) from [<c01c442c>]
> (__alloc_pages_nodemask) from [<c01f1884>]
> (new_slab) from [<c09fcf60>]
> (__slab_alloc) from [<c01f1a6c>]
> 
> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned
> 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was
> set, and this resulted in too_many_isolated returning true.  But one of
> the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14".  So the
> actual isolated count was zero.  As there weren't any more updates to
> NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled
> yet, 7 tasks were spinning in the congestion wait loop for around 4
> seconds, in the direct reclaim path.
> 
> This patch uses zone_page_state_snapshot instead, but restricts its usage
> to avoid performance penalty.
> 
> 
> The vmstat sync interval is HZ (sysctl_stat_interval), but since the
> vmstat_work is declared as a deferrable work, the timer trigger can be
> deferred to the next non-defferable timer expiry on the CPU which is in
> idle.  This results in the vmstat syncing on an idle CPU being delayed by
> seconds.  May be in most cases this behavior is fine, except in cases like
> this.
> 
> [akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()]
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/vmscan.c |   51 +++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated
> +++ a/mm/vmscan.c
> @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page)
>  	return ret;
>  }
>  
> +static int __too_many_isolated(struct zone *zone, int file,
> +			       struct scan_control *sc, int safe)
> +{
> +	unsigned long inactive, isolated;
> +
> +	if (safe) {
> +		inactive = zone_page_state_snapshot(zone,
> +				NR_INACTIVE_ANON + 2 * file);
> +		isolated = zone_page_state_snapshot(zone,
> +				NR_ISOLATED_ANON + file);
> +	} else {
> +		inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file);
> +		isolated = zone_page_state(zone, NR_ISOLATED_ANON + file);
> +	}
> +
> +	/*
> +	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
> +	 * won't get blocked by normal direct-reclaimers, forming a circular
> +	 * deadlock.
> +	 */
> +	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +		inactive >>= 3;
> +
> +	return isolated > inactive;
> +}
> +
>  /*
>   * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and
>   * then get resheduled. When there are massive number of tasks doing page
> @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page)
>   * unnecessary swapping, thrashing and OOM.
>   */
>  static int too_many_isolated(struct zone *zone, int file,
> -		struct scan_control *sc)
> +			     struct scan_control *sc)
>  {
> -	unsigned long inactive, isolated;
> -
>  	if (current_is_kswapd())
>  		return 0;
>  
>  	if (!global_reclaim(sc))
>  		return 0;
>  
> -	if (file) {
> -		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> -	} else {
> -		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> -	}
> -
>  	/*
> -	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
> -	 * won't get blocked by normal direct-reclaimers, forming a circular
> -	 * deadlock.
> +	 * __too_many_isolated(safe=0) is fast but inaccurate, because it
> +	 * doesn't account for the vm_stat_diff[] counters.  So if it looks
> +	 * like too_many_isolated() is about to return true, fall back to the
> +	 * slower, more accurate zone_page_state_snapshot().
>  	 */
> -	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> -		inactive >>= 3;
> +	if (unlikely(__too_many_isolated(zone, file, sc, 0)))
> +		return __too_many_isolated(zone, file, sc, 1);
>  
> -	return isolated > inactive;
> +	return 0;
>  }
>  
>  static noinline_for_stack void
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-02-12 16:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-14 11:36 [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated Vinayak Menon
2015-01-14 16:50 ` Michal Hocko
2015-01-15 17:24   ` Vinayak Menon
2015-01-16 15:49     ` Michal Hocko
2015-01-16 17:57       ` Michal Hocko
2015-01-16 19:17         ` Christoph Lameter
2015-01-17 15:18       ` Vinayak Menon
2015-01-17 19:48         ` Christoph Lameter
2015-01-19  4:27           ` Vinayak Menon
2015-01-21 14:39             ` Michal Hocko
2015-01-22 15:16               ` Vlastimil Babka
2015-01-22 16:11               ` Christoph Lameter
2015-01-26 17:46                 ` Michal Hocko
2015-01-26 18:35                   ` Christoph Lameter
2015-01-27 10:52                     ` Michal Hocko
2015-01-27 16:59                       ` Christoph Lameter
2015-01-30 15:28                         ` Michal Hocko
2015-01-26 17:28           ` Michal Hocko
2015-01-26 18:35             ` Christoph Lameter
2015-01-26 22:11             ` Andrew Morton
2015-01-27 10:41               ` Michal Hocko
2015-01-27 10:33             ` Vinayak Menon
2015-01-27 10:45               ` Michal Hocko
2015-01-29 17:32       ` Christoph Lameter
2015-01-30 15:27         ` Michal Hocko
2015-01-16  1:17 ` Andrew Morton
2015-01-16  5:10   ` Vinayak Menon
2015-01-17 16:29   ` Vinayak Menon
2015-02-11 22:14     ` Andrew Morton
2015-02-12 16:19       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).