LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 14:31 Mel Gorman
  2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

(Revisting this from a year ago and following on from the thread
"Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
clustering". Posting an prototype to see if anything obvious is
being missed)

Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was really low). That
writing back a small number of pages is ok has been heavily disputed
for quite some time and Dave Chinner explained it well;

	It doesn't have to be a very high number to be a problem. IO
	is orders of magnitude slower than the CPU time it takes to
	flush a page, so the cost of making a bad flush decision is
	very high. And single page writeback from the LRU is almost
	always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig

	xfs tries to write it back if the requester is kswapd
	ext4 ignores the request if it's a delayed allocation
	btrfs ignores the request entirely

I think ext3 just writes back the page but I didn't double check.
Either way, each filesystem will have different performance
characteristics when under memory pressure and there are a lot of
dirty pages.

The objective of this series to for memory reclaim to play nicely
with writeback that is already in progress and throttle reclaimers
appropriately when dirty pages are encountered. The assumption is that
the flushers will always write pages faster than if reclaim issues
the IO. The problem is that reclaim has very little control over how
long before a page in a particular zone or container is cleaned.
This is a serious problem but as the behaviour of ->writepage is
filesystem-dependant, we are already faced with a situation where
reclaim has poor control over page cleaning.

A secondary goal is to avoid the problem whereby direct reclaim
splices two potentially deep call stacks together.

Patch 1 disables writeback of filesystem pages from direct reclaim
	entirely. Anonymous pages are still written

Patch 2 disables writeback of filesystem pages from kswapd unless
	the priority is raised to the point where kswapd is considered
	to be in trouble.

Patch 3 throttles reclaimers if too many dirty pages are being
	encountered and the zones or backing devices are congested.

Patch 4 invalidates dirty pages found at the end of the LRU so they
	are reclaimed quickly after being written back rather than
	waiting for a reclaimer to find them

Patch 5 tries to prioritise inodes backing dirty pages found at the end
	of the LRU.

This is a prototype only and it's probable that I forgot or omitted
some issue brought up over the last year and a bit. I have not thought
about how this affects memcg and I have some concerns about patches
4 and 5. Patch 4 may reclaim too many pages as a reclaimer will skip
the dirty page, reclaim a clean page and later the dirty page gets
reclaimed anyway when writeback completes. I don't think it matters
but it's worth thinking about. Patch 5 is potentially a problem
because move_expired_inodes() is now walking the full delayed_queue
list. Is that a problem? I also have no double checked it's safe
to add I_DIRTY_RECLAIM or that the locking is correct. Basically,
patch 5 is a quick hack to see if it's worthwhile and may be rendered
unnecessary by Wu Fengguang or Jan Kara.

I consider this series to be orthogonal to the writeback work going
on at the moment so shout if that assumption is in error.

I tested this on ext3, ext4, btrfs and xfs using fs_mark and a micro
benchmark that does a streaming write to a large mapping (exercises
use-once LRU logic). The command line for fs_mark looked something like

./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760

The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
this triggers the worst behaviour.

6 kernels are tested.

vanilla	3.0-rc6
nodirectwb-v1r3		patch 1
lesskswapdwb-v1r3p	patches 1-2
throttle-v1r10		patches 1-3
immediate-v1r10		patches 1-4
prioinode-v1r10		patches 1-5

During testing, a number of monitors were running to gather information
from ftrace in particular. This disrupts the results of course because
recording the information generates IO in itself but I'm ignoring
that for the moment so the effect of the patches can be seen.

I've posted the raw reports for each filesystem at

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html

As it was Dave and Christoph that brought this back up, here is the
XFS report in a bit more detail;

FS-Mark
                        fsmark-3.0.0         3.0.0-rc6               3.0.0-rc6         3.0.0-rc6               3.0.0-rc6         3.0.0-rc6
                         rc6-vanilla   nodirectwb-v1r3       lesskswapdwb-v1r3    throttle-v1r10         immediate-v1r10   prioinode-v1r10
Files/s  min           5.30 ( 0.00%)        5.10 (-3.92%)        5.40 ( 1.85%)        5.70 ( 7.02%)        5.80 ( 8.62%)        5.70 ( 7.02%)
Files/s  mean          6.93 ( 0.00%)        6.96 ( 0.40%)        7.11 ( 2.53%)        7.52 ( 7.82%)        7.44 ( 6.83%)        7.48 ( 7.38%)
Files/s  stddev        0.89 ( 0.00%)        0.99 (10.62%)        0.85 (-4.18%)        1.02 (13.23%)        1.08 (18.06%)        1.00 (10.72%)
Files/s  max           8.10 ( 0.00%)        8.60 ( 5.81%)        8.20 ( 1.22%)        9.50 (14.74%)        9.00 (10.00%)        9.10 (10.99%)
Overhead min        6623.00 ( 0.00%)     6417.00 ( 3.21%)     6035.00 ( 9.74%)     6354.00 ( 4.23%)     6213.00 ( 6.60%)     6491.00 ( 2.03%)
Overhead mean      29678.24 ( 0.00%)    40053.96 (-25.90%)    18278.56 (62.37%)    16365.20 (81.35%)    11987.40 (147.58%)    15606.36 (90.17%)
Overhead stddev    68727.49 ( 0.00%)   116258.18 (-40.88%)    34121.42 (101.42%)    28963.27 (137.29%)    17221.33 (299.08%)    26231.50 (162.00%)
Overhead max      339993.00 ( 0.00%)   588147.00 (-42.19%)   148281.00 (129.29%)   140568.00 (141.87%)    77836.00 (336.81%)   124728.00 (172.59%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         34.97     35.31     31.16     30.47     29.85     29.66
Total Elapsed Time (seconds)                567.08    566.84    551.75    525.81    534.91    526.32

Average files per second is increased by a nice percentage albeit
just within the standard deviation. Consider the type of test this is,
variability was inevitable but will double check without monitoring.

The overhead (time spent in non-filesystem-related activities) is
reduced a *lot* and is a lot less variable. Time to completion is
improved across the board which is always good because it implies
that IO was consistently higher which is sortof visible 4 minutes into the test at

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-sandy.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-smooth-sandy.png

kswapd CPU usage is also interesting

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/kswapdcpu-comparison-smooth-sandy.png

Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but the throttle brings it back down
and reduced further by patches 4 and 5.

MMTests Statistics: vmstat
Page Ins                                    189840    196608    189864    128120    126148    151888
Page Outs                                 38439897  38420872  38422937  38395008  38367766  38396612
Swap Ins                                     19468     20555     20024      4933      3799      4588
Swap Outs                                    10019     10388     10353      4737      3617      4084
Direct pages scanned                       4865170   4903030   1359813    408460    101716    199483
Kswapd pages scanned                       8202014   8146467  16980235  19428420  14269907  14103872
Kswapd pages reclaimed                     4700400   4665093   8205753   9143997   9449722   9358347
Direct pages reclaimed                     4864514   4901411   1359368    407711    100520    198323
Kswapd efficiency                              57%       57%       48%       47%       66%       66%
Kswapd velocity                          14463.592 14371.722 30775.233 36949.506 26677.211 26797.142
Direct efficiency                              99%       99%       99%       99%       98%       99%
Direct velocity                           8579.336  8649.760  2464.546   776.821   190.155   379.015
Percentage direct scans                        37%       37%        7%        2%        0%        1%
Page writes by reclaim                       14511     14721     10387      4819      3617      4084
Page writes skipped                              0        30   2300502   2774735         0         0
Page reclaim invalidate                          0         0         0         0      5155      3509
Page reclaim throttled                           0         0         0     65112       190       190
Slabs scanned                                16512     17920     18048     17536     16640     17408
Direct inode steals                              0         0         0         0         0         0
Kswapd inode steals                           5180      5318      5177      5178      5179      5193
Kswapd skipped wait                            131         0         4        44         0         0
Compaction stalls                                2         2         0         0         5         1
Compaction success                               2         2         0         0         2         1
Compaction failures                              0         0         0         0         3         0
Compaction pages moved                           0         0         0         0      1049         0
Compaction move failure                          0         0         0         0        96         0

These stats are based on information from /proc/vmstat

"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency starts
low at 57% but increases to 66% with all the patches applied.

"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked
on page writes so it's expected.

Direct reclaim work is significantly reduced going from 37% of all
pages scanned to 1% with all patches applied. This implies that
processes are getting stalled less.

Page writes by reclaim is what is motivating this series. It goes
from 14511 pages to 4084 which is a big improvement. We'll see later
if these were anonymous or file-backed pages.

"Page writes skipped" are dirty pages encountered at the end of the
LRU and only exists for patches 2, 3 and 4. It shows that kswapd is
encountering very large numbers of dirty pages (debugging showed they
weren't under writeback). The number of pages that get invalidated and
freed later is a more reasonable number and "page reclaim throttled"
shows that throttling is not a major problem.

FTrace Reclaim Statistics: vmscan
		       	           fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6
               			    rc6-vanilla   nodirectwb-v1r3 lesskswapdwb-v1r3    throttle-v1r10   immediate-v1r10   prioinode-v1r10
Direct reclaims                              89145      89785      24921       7546       1954       3747 
Direct reclaim pages scanned               4865170    4903030    1359813     408460     101716     199483 
Direct reclaim pages reclaimed             4864514    4901411    1359368     407711     100520     198323 
Direct reclaim write file async I/O              0          0          0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          3          1          0 
Direct reclaim write file sync I/O               0          0          0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0          0          0 
Wake kswapd requests                         11152      11021      21223      24029      26797      26672 
Kswapd wakeups                                 421        397        761        778        776        742 
Kswapd pages scanned                       8202014    8146467   16980235   19428420   14269907   14103872 
Kswapd pages reclaimed                     4700400    4665093    8205753    9143997    9449722    9358347 
Kswapd reclaim write file async I/O           4483       4286          0          1          0          0 
Kswapd reclaim write anon async I/O          10027      10435      10387       4815       3616       4084 
Kswapd reclaim write file sync I/O               0          0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0          0 
Time stalled direct reclaim (seconds)         0.26       0.25       0.08       0.05       0.04       0.08 
Time kswapd awake (seconds)                 493.26     494.05     430.09     420.52     428.55     428.81 

Total pages scanned                       13067184  13049497  18340048  19836880  14371623  14303355
Total pages reclaimed                      9564914   9566504   9565121   9551708   9550242   9556670
%age total pages scanned/reclaimed          73.20%    73.31%    52.15%    48.15%    66.45%    66.81%
%age total pages scanned/written             0.11%     0.11%     0.06%     0.02%     0.03%     0.03%
%age  file pages scanned/written             0.03%     0.03%     0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim         0.74%     0.70%     0.26%     0.16%     0.13%     0.27%
Percentage Time kswapd Awake                86.98%    87.16%    77.95%    79.98%    80.12%    81.47%

This is based on information from the vmscan tracepoints introduced
the last time this issue came up.

Direct reclaim writes were never a problem according to this.

kswapd writes of file-backed pages on the other hand went from 4483 to
0 which is nice and part of the objective after all. The page writes of
4084 recorded from /proc/vmstat with all patches applied iwas clearly
due to writing anonymous pages as there is a direct correlation there.

Time spent in direct reclaim is reduced quite a bit as well as the
time kswapd spent awake.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 0          0          0          0          0          0 
Direct time   congest     waited               0ms        0ms        0ms        0ms        0ms        0ms 
Direct full   congest     waited                 0          0          0          0          0          0 
Direct number conditional waited                 0          1          0         56          8          0 
Direct time   conditional waited               0ms        0ms        0ms        0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0          0          0          0 
KSwapd number congest     waited                 4          0          1          0          6          0 
KSwapd time   congest     waited             400ms        0ms      100ms        0ms      501ms        0ms 
KSwapd full   congest     waited                 4          0          1          0          5          0 
KSwapd number conditional waited                 0          0          0      65056        189        190 
KSwapd time   conditional waited               0ms        0ms        0ms        1ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0          0          0 

This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled 190 times with all patches
applied, it spent negligible time asleep so probably just called
cond_resched().  This implies that neither the zone or the backing
device was congested.  As there is only once source of IO, this is
expected. With multiple processes, this picture might change.


MICRO
                   micro-3.0.0         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6
                   rc6-vanilla   nodirectwb-v1r3 lesskswapdwb-v1r3    throttle-v1r10   immediate-v1r10   prioinode-v1r10
MMTests Statistics: duration
User/Sys Time Running Test (seconds)          6.95       7.2      6.84      6.33      5.97      6.13
Total Elapsed Time (seconds)                 56.34     65.04     66.53     63.24     52.48     63.00

This is a test that just writes a mapping. Unfortunately, the time to
completion is increased by the series. Again I'll have to run without
any monitoring to confirm it's a problem.

MMTests Statistics: vmstat
Page Ins                                     46928     50660     48504     42888     42648     43036
Page Outs                                  4990816   4994987   4987572   4999242   4981324   4990627
Swap Ins                                      2573      3234      2470      1396      1352      1297
Swap Outs                                     2316      2578      2360       937       912       873
Direct pages scanned                       1834430   2016994   1623675   1843754   1922668   1941916
Kswapd pages scanned                       1399007   1272637   1842874   1810867   1425366   1426536
Kswapd pages reclaimed                      637708    657418    860512    884531    906608    927206
Direct pages reclaimed                      536567    517876    314115    289472    272265    252361
Kswapd efficiency                              45%       51%       46%       48%       63%       64%
Kswapd velocity                          24831.505 19566.990 27699.895 28634.836 27160.175 22643.429
Direct efficiency                              29%       25%       19%       15%       14%       12%
Direct velocity                          32559.993 31011.593 24405.156 29154.870 36636.204 30824.063
Percentage direct scans                        56%       61%       46%       50%       57%       57%
Page writes by reclaim                        2706      2910      2416       969       912       873
Page writes skipped                              0     12640    148339    166844         0         0
Page reclaim invalidate                          0         0         0         0        12        58
Page reclaim throttled                           0         0         0      4788         7         9
Slabs scanned                                 4096      5248      5120      6656      4480     16768
Direct inode steals                            531      1189       348      1166       700      3783
Kswapd inode steals                            164         0       349         0         0         9
Kswapd skipped wait                             78        35        74        51        14        10
Compaction stalls                                0         0         1         0         0         0
Compaction success                               0         0         1         0         0         0
Compaction failures                              0         0         0         0         0         0
Compaction pages moved                           0         0         0         0         0         0
Compaction move failure                          0         0         0         0         0         0

Kswapd efficiency up but kswapd was doing less work according to kswapd velocity.

Direct reclaim efficiency is worse as well.

It's writing fewer pages at least.

FTrace Reclaim Statistics: vmscan
                   micro-3.0.0         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6         3.0.0-rc6
                   rc6-vanilla   nodirectwb-v1r3 lesskswapdwb-v1r3    throttle-v1r10   immediate-v1r10   prioinode-v1r10
Direct reclaims                               9823       9477       5737       5347       5078       4720 
Direct reclaim pages scanned               1834430    2016994    1623675    1843754    1922668    1941916 
Direct reclaim pages reclaimed              536567     517876     314115     289472     272265     252361 
Direct reclaim write file async I/O              0          0          0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0         16          0 
Direct reclaim write file sync I/O               0          0          0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0          0          0 
Wake kswapd requests                          1636       1692       2177       2403       2707       2757 
Kswapd wakeups                                  28         29         30         34         15         23 
Kswapd pages scanned                       1399007    1272637    1842874    1810867    1425366    1426536 
Kswapd pages reclaimed                      637708     657418     860512     884531     906608     927206 
Kswapd reclaim write file async I/O            380        332         56         32          0          0 
Kswapd reclaim write anon async I/O           2326       2578       2360        937        896        873 
Kswapd reclaim write file sync I/O               0          0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0          0 
Time stalled direct reclaim (seconds)         2.06       2.10       1.62       2.65       2.25       1.86 
Time kswapd awake (seconds)                  49.44      56.39      54.31      55.45      47.00      56.74 

Total pages scanned                        3233437   3289631   3466549   3654621   3348034   3368452
Total pages reclaimed                      1174275   1175294   1174627   1174003   1178873   1179567
%age total pages scanned/reclaimed          36.32%    35.73%    33.88%    32.12%    35.21%    35.02%
%age total pages scanned/written             0.08%     0.09%     0.07%     0.03%     0.03%     0.03%
%age  file pages scanned/written             0.01%     0.01%     0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        22.86%    22.58%    19.15%    29.51%    27.37%    23.28%
Percentage Time kswapd Awake                87.75%    86.70%    81.63%    87.68%    89.56%    90.06%

Again, writes of file pages are reduced but kswapd is clearly awake
for longer.

What is interesting is that the number of pages written without the
patches was already quite low. This means there is relatively little room
for improvement in this benchmark.

FTrace Reclaim Statistics: congestion_wait
Generating ftrace report ftrace-3.0.0-rc6-prioinode-v1r10-micro-congestion.report
Direct number congest     waited                 0          0          0          0          0          0 
Direct time   congest     waited               0ms        0ms        0ms        0ms        0ms        0ms 
Direct full   congest     waited                 0          0          0          0          0          0 
Direct number conditional waited               768        793        704       1359        608        674 
Direct time   conditional waited               0ms        0ms        0ms        0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0          0          0          0 
KSwapd number congest     waited                41         22         58         43         78         92 
KSwapd time   congest     waited            2937ms     2200ms     4543ms     4300ms     7800ms     9200ms 
KSwapd full   congest     waited                29         22         45         43         78         92 
KSwapd number conditional waited                 0          0          0       4284          4          9 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0          0          0 

Some throttling but little time sleep.

The objective of the series - reducing writes from reclaim - is
met with filesystem writes from reclaim reduced to 0 with reclaim
in general doing less work. ext3, ext4 and xfs all showed marked
improvements for fs_mark in this configuration. btrfs looked worse
but it's within the noise and I'd expect the patches to have little
or no impact there due it ignoring ->writepage from reclaim.

I'm rerunning the tests without monitors at the moment to verify the
performance improvements which will take about 6 hours to complete
but so far it looks promising.

Comments?

 fs/fs-writeback.c         |   56 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h        |    5 ++-
 include/linux/mmzone.h    |    2 +
 include/linux/writeback.h |    1 +
 mm/vmscan.c               |   55 +++++++++++++++++++++++++++++++++++++++++--
 mm/vmstat.c               |    2 +
 6 files changed, 115 insertions(+), 6 deletions(-)

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
@ 2011-07-13 14:31 ` Mel Gorman
  2011-07-13 23:34   ` Dave Chinner
  2011-07-14  1:38   ` KAMEZAWA Hiroyuki
  2011-07-13 14:31 ` [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

From: Mel Gorman <mel@csn.ul.ie>

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty
page is encountered during the scan, this page is written to backing
storage using mapping->writepage.

This causes two problems. First, it can result in very deep call
stacks, particularly if the target storage or filesystem are complex.
Some filesystems ignore write requests from direct reclaim as a result.
The second is that a single-page flush is inefficient in terms of IO.
While there is an expectation that the elevator will merge requests,
this does not always happen. Quoting Christoph Hellwig;

	The elevator has a relatively small window it can operate on,
	and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to
swap as there is not the equivalent of a flusher thread for anonymos
pages. If the dirty pages cannot be written back, they are placed
back on the LRU lists.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    1 +
 mm/vmscan.c            |    9 +++++++++
 mm/vmstat.c            |    1 +
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..b70a0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,6 +100,7 @@ enum zone_stat_item {
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	NR_VMSCAN_WRITE_SKIP,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..2d3e5b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..fd109f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,6 +702,7 @@ const char * const vmstat_text[] = {
 	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
+	"nr_vmscan_write_skip",
 	"nr_writeback_temp",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
  2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
@ 2011-07-13 14:31 ` Mel Gorman
  2011-07-13 23:37   ` Dave Chinner
  2011-07-13 14:31 ` [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

It is preferable that no dirty pages are dispatched for cleaning from
the page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.

However, page reclaim does have a requirement that pages be freed
in a particular zone. If it is failing to make sufficient progress
(reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
considered to tbe the point where kswapd is getting into trouble
reclaiming pages. If this priority is reached, kswapd will dispatch
pages for writing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d3e5b6..e272951 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
-				      struct scan_control *sc)
+				      struct scan_control *sc,
+				      int priority)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -827,9 +828,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/*
 			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow
+			 * avoid risk of stack overflow but do not writeback
+			 * unless under significant pressure.
 			 */
-			if (page_is_file_cache(page) && !current_is_kswapd()) {
+			if (page_is_file_cache(page) &&
+					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
 				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
 				goto keep_locked;
 			}
@@ -1465,12 +1468,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
 	}
 
 	local_irq_disable();
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
  2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
  2011-07-13 14:31 ` [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
@ 2011-07-13 14:31 ` Mel Gorman
  2011-07-13 23:41   ` Dave Chinner
  2011-07-13 14:31 ` [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Workloads that are allocating frequently and writing files place a
large number of dirty pages on the LRU. With use-once logic, it is
possible for them to reach the end of the LRU quickly requiring the
reclaimer to scan more to find clean pages. Ordinarily, processes that
are dirtying memory will get throttled by dirty balancing but this
is a global heuristic and does not take into account that LRUs are
maintained on a per-zone basis. This can lead to a situation whereby
reclaim is scanning heavily, skipping over a large number of pages
under writeback and recycling them around the LRU consuming CPU.

This patch checks how many of the number of pages isolated from the
LRU were dirty. If a percentage of them are dirty, the process will be
throttled if a blocking device is congested or the zone being scanned
is marked congested. The percentage that must be dirty depends on
the priority. At default priority, all of them must be dirty. At
DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
etc. i.e.  as pressure increases the greater the likelihood the process
will get throttled to allow the flusher threads to make some progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    1 +
 mm/vmscan.c            |   23 ++++++++++++++++++++---
 mm/vmstat.c            |    1 +
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b70a0c0..c4508a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -101,6 +101,7 @@ enum zone_stat_item {
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_WRITE_SKIP,
+	NR_VMSCAN_THROTTLED,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e272951..9826086 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
-				      int priority)
+				      int priority,
+				      unsigned long *ret_nr_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -971,6 +972,7 @@ keep_lumpy:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*ret_nr_dirty += nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1420,6 +1422,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1468,12 +1471,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+							priority, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+							priority, &nr_dirty);
 	}
 
 	local_irq_disable();
@@ -1483,6 +1488,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
+	/*
+	 * If we have encountered a high number of dirty pages then they
+	 * are reaching the end of the LRU too quickly and global limits are
+	 * not enough to throttle processes due to the page distribution
+	 * throughout zones. Scale the number of dirty pages that must be
+	 * dirty before being throttled to priority.
+	 */
+	if (nr_dirty && nr_dirty >= (nr_taken >> (DEF_PRIORITY-priority))) {
+		inc_zone_state(zone, NR_VMSCAN_THROTTLED);
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+	}
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd109f3..59ee17c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,6 +703,7 @@ const char * const vmstat_text[] = {
 	"nr_bounce",
 	"nr_vmscan_write",
 	"nr_vmscan_write_skip",
+	"nr_vmscan_throttled",
 	"nr_writeback_temp",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
                   ` (2 preceding siblings ...)
  2011-07-13 14:31 ` [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
@ 2011-07-13 14:31 ` Mel Gorman
  2011-07-13 16:40   ` Johannes Weiner
  2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

When direct reclaim encounters a dirty page, it gets recycled around
the LRU for another cycle. This patch marks the page PageReclaim using
deactivate_page() so that the page gets reclaimed almost immediately
after the page gets cleaned. This is to avoid reclaiming clean pages
that are younger than a dirty page encountered at the end of the LRU
that might have been something like a use-once page.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 +-
 mm/vmscan.c            |   10 ++++++++--
 mm/vmstat.c            |    2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4508a2..bea7858 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ enum zone_stat_item {
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
-	NR_VMSCAN_WRITE_SKIP,
+	NR_VMSCAN_INVALIDATE,
 	NR_VMSCAN_THROTTLED,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9826086..8e00aee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
-				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
-				goto keep_locked;
+				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
+
+				/* Immediately reclaim when written back */
+				unlock_page(page);
+				deactivate_page(page);
+
+				goto keep_dirty;
 			}
 
 			if (references == PAGEREF_RECLAIM_CLEAN)
@@ -956,6 +961,7 @@ keep:
 		reset_reclaim_mode(sc);
 keep_lumpy:
 		list_add(&page->lru, &ret_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59ee17c..2c82ae5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,7 +702,7 @@ const char * const vmstat_text[] = {
 	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
-	"nr_vmscan_write_skip",
+	"nr_vmscan_invalidate",
 	"nr_vmscan_throttled",
 	"nr_writeback_temp",
 	"nr_isolated_anon",
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
                   ` (3 preceding siblings ...)
  2011-07-13 14:31 ` [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
@ 2011-07-13 14:31 ` Mel Gorman
  2011-07-13 21:39   ` Jan Kara
                     ` (2 more replies)
  2011-07-13 15:31 ` [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
  2011-07-14  0:33 ` Dave Chinner
  6 siblings, 3 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

It is preferable that no dirty pages are dispatched from the page
reclaim path. If reclaim is encountering dirty pages, it implies that
either reclaim is getting ahead of writeback or use-once logic has
prioritise pages for reclaiming that are young relative to when the
inode was dirtied.

When dirty pages are encounted on the LRU, this patch marks the inodes
I_DIRTY_RECLAIM and wakes the background flusher. When the background
flusher runs, it moves such inodes immediately to the dispatch queue
regardless of inode age. There is no guarantee that pages reclaim
cares about will be cleaned first but the expectation is that the
flusher threads will clean the page quicker than if reclaim tried to
clean a single page.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/fs-writeback.c         |   56 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h        |    5 ++-
 include/linux/writeback.h |    1 +
 mm/vmscan.c               |   16 ++++++++++++-
 4 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..1201052 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *tinode;
 	int do_sb_sort = 0;
 
+	/* Move inodes reclaim found at end of LRU to dispatch queue */
+	list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
+		/* Move any inode found at end of LRU to dispatch queue */
+		if (inode->i_state & I_DIRTY_RECLAIM) {
+			inode->i_state &= ~I_DIRTY_RECLAIM;
+			list_move(&inode->i_wb_list, &tmp);
+
+			if (sb && sb != inode->i_sb)
+				do_sb_sort = 1;
+			sb = inode->i_sb;
+		}
+	}
+
+	sb = NULL;
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
 		if (older_than_this &&
@@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
 	rcu_read_unlock();
 }
 
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+	struct page *page;
+	struct address_space *mapping;
+	struct inode *inode;
+
+	list_for_each_entry(page, page_list, lru) {
+		if (!PageDirty(page))
+			continue;
+
+		if (PageSwapBacked(page))
+			continue;
+
+		lock_page(page);
+		mapping = page_mapping(page);
+		if (!mapping)
+			goto unlock;
+
+		/*
+		 * Test outside the lock to see as if it is already set. Inode
+		 * should be pinned by the lock_page
+		 */
+		inode = page->mapping->host;
+		if (inode->i_state & I_DIRTY_RECLAIM)
+			goto unlock;
+
+		spin_lock(&inode->i_lock);
+		inode->i_state |= I_DIRTY_RECLAIM;
+		spin_unlock(&inode->i_lock);
+unlock:
+		unlock_page(page);
+	}
+
+	wakeup_flusher_threads(nr_pages);
+}
+
 static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 {
 	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..bb0f4c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1650,8 +1650,8 @@ struct super_operations {
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
  *
  * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
  * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1706,6 +1706,7 @@ struct super_operations {
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
+#define I_DIRTY_RECLAIM		(1 << 9)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..1e77793 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -66,6 +66,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		struct writeback_control *wbc);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e00aee..db62af1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,8 +725,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(dirty_pages);
+
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 
@@ -830,7 +833,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * unless under significant pressure. For dirty pages
+			 * not under writeback, create a list and pass the
+			 * inodes to the flusher threads later
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
@@ -840,6 +845,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				unlock_page(page);
 				deactivate_page(page);
 
+				/* Prioritise the backing inodes later */
+				nr_unqueued_dirty++;
+				list_add(&page->lru, &dirty_pages);
+
 				goto keep_dirty;
 			}
 
@@ -976,6 +985,11 @@ keep_dirty:
 
 	free_page_list(&free_pages);
 
+	if (!list_empty(&dirty_pages)) {
+		wakeup_flusher_threads_pages(nr_unqueued_dirty, &dirty_pages);
+		list_splice(&ret_pages, &dirty_pages);
+	}
+
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	*ret_nr_dirty += nr_dirty;
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
                   ` (4 preceding siblings ...)
  2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
@ 2011-07-13 15:31 ` Mel Gorman
  2011-07-14  0:33 ` Dave Chinner
  6 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 15:31 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> <SNIP>
> The objective of the series - reducing writes from reclaim - is
> met with filesystem writes from reclaim reduced to 0 with reclaim
> in general doing less work. ext3, ext4 and xfs all showed marked
> improvements for fs_mark in this configuration. btrfs looked worse
> but it's within the noise and I'd expect the patches to have little
> or no impact there due it ignoring ->writepage from reclaim.
> 

My bad, I accidentally looked at an old report for btrfs based on
older patches. In the report posted with all patches applied, the
performance of btrfs does look better but as the patches should make
no difference, it's still in the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-13 14:31 ` [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
@ 2011-07-13 16:40   ` Johannes Weiner
  2011-07-13 17:15     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2011-07-13 16:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim using
> deactivate_page() so that the page gets reclaimed almost immediately
> after the page gets cleaned. This is to avoid reclaiming clean pages
> that are younger than a dirty page encountered at the end of the LRU
> that might have been something like a use-once page.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/mmzone.h |    2 +-
>  mm/vmscan.c            |   10 ++++++++--
>  mm/vmstat.c            |    2 +-
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c4508a2..bea7858 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,7 +100,7 @@ enum zone_stat_item {
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
>  	NR_VMSCAN_WRITE,
> -	NR_VMSCAN_WRITE_SKIP,
> +	NR_VMSCAN_INVALIDATE,
>  	NR_VMSCAN_THROTTLED,
>  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9826086..8e00aee 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			 */
>  			if (page_is_file_cache(page) &&
>  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> -				goto keep_locked;
> +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> +
> +				/* Immediately reclaim when written back */
> +				unlock_page(page);
> +				deactivate_page(page);
> +
> +				goto keep_dirty;
>  			}
>  
>  			if (references == PAGEREF_RECLAIM_CLEAN)
> @@ -956,6 +961,7 @@ keep:
>  		reset_reclaim_mode(sc);
>  keep_lumpy:
>  		list_add(&page->lru, &ret_pages);
> +keep_dirty:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}

I really like the idea behind this patch, but I think all those pages
are lost as PageLRU is cleared on isolation and lru_deactivate_fn
bails on them in turn.

If I'm not mistaken, the reference from the isolation is also leaked.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-13 16:40   ` Johannes Weiner
@ 2011-07-13 17:15     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-13 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 06:40:40PM +0200, Johannes Weiner wrote:
> On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim using
> > deactivate_page() so that the page gets reclaimed almost immediately
> > after the page gets cleaned. This is to avoid reclaiming clean pages
> > that are younger than a dirty page encountered at the end of the LRU
> > that might have been something like a use-once page.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/mmzone.h |    2 +-
> >  mm/vmscan.c            |   10 ++++++++--
> >  mm/vmstat.c            |    2 +-
> >  3 files changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c4508a2..bea7858 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -100,7 +100,7 @@ enum zone_stat_item {
> >  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
> >  	NR_BOUNCE,
> >  	NR_VMSCAN_WRITE,
> > -	NR_VMSCAN_WRITE_SKIP,
> > +	NR_VMSCAN_INVALIDATE,
> >  	NR_VMSCAN_THROTTLED,
> >  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
> >  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9826086..8e00aee 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			 */
> >  			if (page_is_file_cache(page) &&
> >  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > -				goto keep_locked;
> > +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > +
> > +				/* Immediately reclaim when written back */
> > +				unlock_page(page);
> > +				deactivate_page(page);
> > +
> > +				goto keep_dirty;
> >  			}
> >  
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> > @@ -956,6 +961,7 @@ keep:
> >  		reset_reclaim_mode(sc);
> >  keep_lumpy:
> >  		list_add(&page->lru, &ret_pages);
> > +keep_dirty:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> 
> I really like the idea behind this patch, but I think all those pages
> are lost as PageLRU is cleared on isolation and lru_deactivate_fn
> bails on them in turn.
> 
> If I'm not mistaken, the reference from the isolation is also leaked.

I think you're right. This patch was rushed and not thought through
properly.  The surprise it appeared to work at all. Will rework
it. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
@ 2011-07-13 21:39   ` Jan Kara
  2011-07-14  0:09     ` Dave Chinner
  2011-07-14  7:03     ` Mel Gorman
  2011-07-13 23:56   ` Dave Chinner
  2011-07-14 15:09   ` Christoph Hellwig
  2 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-07-13 21:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
> 
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
  Hmm, I was looking through your numbers but I didn't see any significant
difference this patch would make. Do you?

I was thinking about the problem and actually doing IO from kswapd would be
a small problem if we submitted more than just a single page. Just to give
you idea - time to write a single page on plain SATA drive might be like 4
ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
these numbers but the orders should be right). So to write 1000 times more
data you just need like 20 times longer. That's a factor of 50 in IO
efficiency. So when reclaim/kswapd submits a single page IO once every
couple of miliseconds, your IO throughput just went close to zero...
BTW: I just checked your numbers in fsmark test with vanilla kernel.  You
wrote like 14500 pages from reclaim in 567 seconds. That is about one page
per 39 ms. That is going to have noticeable impact on IO throughput (not
with XFS because it plays tricks with writing more than asked but with ext2
or ext3 you would see it I guess).

So when kswapd sees high percentage of dirty pages at the end of LRU, it
could call something like fdatawrite_range() for the range of 4 MB
(provided the file is large enough) containing that page and IO thoughput
would not be hit that much and you will get reasonably bounded time when
the page gets cleaned... If you wanted to be clever, you could possibly be
more sophisticated in picking the file and range to write so that you get
rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
cycles. Does this sound reasonable to you?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
@ 2011-07-13 23:34   ` Dave Chinner
  2011-07-14  6:17     ` Mel Gorman
  2011-07-14  1:38   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-13 23:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> From: Mel Gorman <mel@csn.ul.ie>
> 
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
> 
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
> 
> 	The elevator has a relatively small window it can operate on,
> 	and can never fix up a bad large scale writeback pattern.
> 
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
condition redundant. In effect the patch should be a no-op for those
filesystems. Can you also remove the checks in the filesystems?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-13 14:31 ` [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
@ 2011-07-13 23:37   ` Dave Chinner
  2011-07-14  6:29     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-13 23:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched for cleaning from
> the page reclaim path. At normal priorities, this patch prevents kswapd
> writing pages.
> 
> However, page reclaim does have a requirement that pages be freed
> in a particular zone. If it is failing to make sufficient progress
> (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> considered to tbe the point where kswapd is getting into trouble
> reclaiming pages. If this priority is reached, kswapd will dispatch
> pages for writing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Seems reasonable, but btrfs still will ignore this writeback from
kswapd, and it doesn't fall over. Given that data point, I'd like to
see the results when you stop kswapd from doing writeback altogether
as well.

Can you try removing it altogether and seeing what that does to your
test results? i.e

			if (page_is_file_cache(page)) {
				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
				goto keep_locked;
			}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-13 14:31 ` [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
@ 2011-07-13 23:41   ` Dave Chinner
  2011-07-14  6:33     ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-13 23:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
> 
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e.  as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.

It still doesn't take into account how many pages under writeback
were skipped. If there are lots of pages that are under writeback, I
think we still want to throttle to give IO a chance to complete and
clean those pages before scanning again....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
  2011-07-13 21:39   ` Jan Kara
@ 2011-07-13 23:56   ` Dave Chinner
  2011-07-14  7:30     ` Mel Gorman
  2011-07-14 15:09   ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-13 23:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
> 
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  fs/fs-writeback.c         |   56 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs.h        |    5 ++-
>  include/linux/writeback.h |    1 +
>  mm/vmscan.c               |   16 ++++++++++++-
>  4 files changed, 74 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0f015a0..1201052 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
> -	struct inode *inode;
> +	struct inode *inode, *tinode;
>  	int do_sb_sort = 0;
>  
> +	/* Move inodes reclaim found at end of LRU to dispatch queue */
> +	list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> +		/* Move any inode found at end of LRU to dispatch queue */
> +		if (inode->i_state & I_DIRTY_RECLAIM) {
> +			inode->i_state &= ~I_DIRTY_RECLAIM;
> +			list_move(&inode->i_wb_list, &tmp);
> +
> +			if (sb && sb != inode->i_sb)
> +				do_sb_sort = 1;
> +			sb = inode->i_sb;
> +		}
> +	}

This is not a good idea. move_expired_inodes() already sucks a large
amount of CPU when there are lots of dirty inodes on the list (think
hundreds of thousands), and that is when the traversal terminates at
*older_than_this. It's not uncommon in my testing to see this
one function consume 30-35% of the bdi-flusher thread CPU usage
in such conditions.

By adding an entire list traversal in addition to the aging
traversal, this is going significantly increase the CPU overhead of
the function and hence could significantly increase
bdi->wb_list_lock contention and decrease writeback throughput.

> +
> +	sb = NULL;
>  	while (!list_empty(delaying_queue)) {
>  		inode = wb_inode(delaying_queue->prev);
>  		if (older_than_this &&
> @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
>  	rcu_read_unlock();
>  }
>  
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> +	struct page *page;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +
> +	list_for_each_entry(page, page_list, lru) {
> +		if (!PageDirty(page))
> +			continue;
> +
> +		if (PageSwapBacked(page))
> +			continue;
> +
> +		lock_page(page);
> +		mapping = page_mapping(page);
> +		if (!mapping)
> +			goto unlock;
> +
> +		/*
> +		 * Test outside the lock to see as if it is already set. Inode
> +		 * should be pinned by the lock_page
> +		 */
> +		inode = page->mapping->host;
> +		if (inode->i_state & I_DIRTY_RECLAIM)
> +			goto unlock;
> +
> +		spin_lock(&inode->i_lock);
> +		inode->i_state |= I_DIRTY_RECLAIM;
> +		spin_unlock(&inode->i_lock);

Micro optimisations like this are unnecessary - the inode->i_lock is
not contended.

As it is, this code won't really work as you think it might.
There's no guarantee a dirty inode is on the dirty - it might have
already been expired, and it might even currently be under
writeback.  In that case, if it is still dirty it goes to the
b_more_io list and writeback bandwidth is shared between all the
other dirty inodes and completely ignores this flag...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 21:39   ` Jan Kara
@ 2011-07-14  0:09     ` Dave Chinner
  2011-07-14  7:03     ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2011-07-14  0:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> > 
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
>   Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
> 
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).

I'm not so concerned about single drives - the numbers look far worse
when you have a high throughput filesystem. For arguments sake, lets
call that 1GB/s (even though I know of plenty of 10+GB/s XFS
filesystems out there). That gives you 4ms for a 4k IO, and 4MB of
data in 4ms seek + 4ms data transfer time, for 8ms total IO time.

> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency.

In the case I tend to care about, it's more like factor of 1000 in
IO efficiency - 3 orders of magnitude or greater difference in
performance.

> So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel.  You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
> 
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?

That's what Wu's patch did - it pushed it off to the bdi-flusher
because you can't call iput() in memory reclaim context and you need
a reference to the inode before calling fdatawrite_range().

As I mentioned for that patch, writing 4MB instead of a single page
will cause different problems - after just 25 dirty pages, we've
queued 100MB of IO and on a typical desktop system that will take at
least a second to complete. Now we get the opposite problem of IO
latency to clean a specific page and the potential to stall normal
background expired inode writeback forever if we keep hitting dirty
pages during page reclaim.

It's just yet another reason I'd really like to see numbers showing
that not doing IO from memory reclaim causes problems in the cases
where it is said to be needed (like reclaiming memory from a
specific node) and that issuing IO is the -only- solution. If numbers
can't be produced showing that we *need* to do IO from memory
reclaim, then why jump through hoops like we currently are trying to
fix all the nasty corner cases?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
  2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
                   ` (5 preceding siblings ...)
  2011-07-13 15:31 ` [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
@ 2011-07-14  0:33 ` Dave Chinner
  2011-07-14  4:51   ` Christoph Hellwig
  2011-07-14  7:37   ` Mel Gorman
  6 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2011-07-14  0:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> (Revisting this from a year ago and following on from the thread
> "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> clustering". Posting an prototype to see if anything obvious is
> being missed)

Hi Mel,

Thanks for picking this up again. The results are definitely
promising, but I'd like to see a comparison against simply not doing
IO from memory reclaim at all combined with the enhancements in this
patchset. After all, that's what I keep asking for (so we can get
rid of .writepage altogether), and if the numbers don't add up, then
I'll shut up about it. ;)

.....

> use-once LRU logic). The command line for fs_mark looked something like
> 
> ./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
> 
> The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> this triggers the worst behaviour.
....
> During testing, a number of monitors were running to gather information
> from ftrace in particular. This disrupts the results of course because
> recording the information generates IO in itself but I'm ignoring
> that for the moment so the effect of the patches can be seen.
> 
> I've posted the raw reports for each filesystem at
> 
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
.....
> Average files per second is increased by a nice percentage albeit
> just within the standard deviation. Consider the type of test this is,
> variability was inevitable but will double check without monitoring.
> 
> The overhead (time spent in non-filesystem-related activities) is
> reduced a *lot* and is a lot less variable.

Given that userspace is doing the same amount of work in all test
runs, that implies that the userspace process is retaining it's
working set hot in the cache over syscalls with this patchset.

> Direct reclaim work is significantly reduced going from 37% of all
> pages scanned to 1% with all patches applied. This implies that
> processes are getting stalled less.

And that directly implicates page scanning during direct reclaim as
the prime contributor to turfing the application's working set out
of the CPU cache....

> Page writes by reclaim is what is motivating this series. It goes
> from 14511 pages to 4084 which is a big improvement. We'll see later
> if these were anonymous or file-backed pages.

Which were anon pages, so this is a major improvement. However,
given that there were no dirty pages writen directly by memory
reclaim, perhaps we don't need to do IO at all from here and
throttling is all that is needed?  ;)

> Direct reclaim writes were never a problem according to this.

That's true. but we disable direct reclaim for other reasons, namely
that writeback from direct reclaim blows the stack.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
  2011-07-13 23:34   ` Dave Chinner
@ 2011-07-14  1:38   ` KAMEZAWA Hiroyuki
  2011-07-14  4:46     ` Christoph Hellwig
  2011-07-14  6:19     ` Mel Gorman
  1 sibling, 2 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  1:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Wed, 13 Jul 2011 15:31:23 +0100
Mel Gorman <mgorman@suse.de> wrote:

> From: Mel Gorman <mel@csn.ul.ie>
> 
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
> 
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
> 
> 	The elevator has a relatively small window it can operate on,
> 	and can never fix up a bad large scale writeback pattern.
> 
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Hm.


> ---
>  include/linux/mmzone.h |    1 +
>  mm/vmscan.c            |    9 +++++++++
>  mm/vmstat.c            |    1 +
>  3 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9f7c3eb..b70a0c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,6 +100,7 @@ enum zone_stat_item {
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
>  	NR_VMSCAN_WRITE,
> +	NR_VMSCAN_WRITE_SKIP,
>  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..2d3e5b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageDirty(page)) {
>  			nr_dirty++;
>  
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> +				goto keep_locked;
> +			}
> +


This will cause tons of memcg OOM kill because we have no help of kswapd (now).

Could you make this

	if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
...


Then...sorry, please keep file system hook for a while. I'll do memcg dirty_ratio work
by myself if Greg will not post new version until the next month. After that, we can
remove scanning_global_lru(sc), I think.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  4:46     ` Christoph Hellwig
@ 2011-07-14  4:46       ` KAMEZAWA Hiroyuki
  2011-07-14 15:07         ` Christoph Hellwig
  2011-07-15  2:22         ` Dave Chinner
  0 siblings, 2 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  4:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, 14 Jul 2011 00:46:43 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > +			/*
> > > +			 * Only kswapd can writeback filesystem pages to
> > > +			 * avoid risk of stack overflow
> > > +			 */
> > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > +				goto keep_locked;
> > > +			}
> > > +
> > 
> > 
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> 
> XFS and btrfs already disable writeback from memcg context, as does ext4
> for the typical non-overwrite workloads, and none has fallen apart.
> 
> In fact there's no way we can enable them as the memcg calling contexts
> tend to have massive stack usage.
> 

Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  1:38   ` KAMEZAWA Hiroyuki
@ 2011-07-14  4:46     ` Christoph Hellwig
  2011-07-14  4:46       ` KAMEZAWA Hiroyuki
  2011-07-14  6:19     ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-14  4:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > +				goto keep_locked;
> > +			}
> > +
> 
> 
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).

XFS and btrfs already disable writeback from memcg context, as does ext4
for the typical non-overwrite workloads, and none has fallen apart.

In fact there's no way we can enable them as the memcg calling contexts
tend to have massive stack usage.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
  2011-07-14  0:33 ` Dave Chinner
@ 2011-07-14  4:51   ` Christoph Hellwig
  2011-07-14  7:37   ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-14  4:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> patchset. After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)

Unfortunately there's a few more users of ->writepage in addition to
memory reclaim.  The most visible on is page migration, but there's also
a write_one_page helper used by a few filesystems that would either
need to get a writepage-like callback or a bigger rewrite.

I agree that killing of ->writepage would be a worthwhile goal, though.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  6:19     ` Mel Gorman
@ 2011-07-14  6:17       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  6:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, 14 Jul 2011 07:19:15 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		if (PageDirty(page)) {
> > >  			nr_dirty++;
> > >  
> > > +			/*
> > > +			 * Only kswapd can writeback filesystem pages to
> > > +			 * avoid risk of stack overflow
> > > +			 */
> > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > +				goto keep_locked;
> > > +			}
> > > +
> > 
> > 
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > 
> > Could you make this
> > 
> > 	if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> > ...
> > 
> 
> I can, but as Christoph points out, the request is already being
> ignored so how will it help?
> 

Hmm, ok, please go as you do now. I'll do hurry up to implement dirty_ratio by myself
without waiting for original patch writer. I think his latest version was really
near to be merged... I'll start rebase his one to mmotm in this end of month.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-13 23:34   ` Dave Chinner
@ 2011-07-14  6:17     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  6:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 09:34:49AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mel@csn.ul.ie>
> > 
> > When kswapd is failing to keep zones above the min watermark, a process
> > will enter direct reclaim in the same manner kswapd does. If a dirty
> > page is encountered during the scan, this page is written to backing
> > storage using mapping->writepage.
> > 
> > This causes two problems. First, it can result in very deep call
> > stacks, particularly if the target storage or filesystem are complex.
> > Some filesystems ignore write requests from direct reclaim as a result.
> > The second is that a single-page flush is inefficient in terms of IO.
> > While there is an expectation that the elevator will merge requests,
> > this does not always happen. Quoting Christoph Hellwig;
> > 
> > 	The elevator has a relatively small window it can operate on,
> > 	and can never fix up a bad large scale writeback pattern.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by
> > checking if current is kswapd. Anonymous pages are still written to
> > swap as there is not the equivalent of a flusher thread for anonymos
> > pages. If the dirty pages cannot be written back, they are placed
> > back on the LRU lists.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
> condition redundant. In effect the patch should be a no-op for those
> filesystems. Can you also remove the checks in the filesystems?
> 

I'll convert them to warnings just in case it regresses due to an
oversight.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  1:38   ` KAMEZAWA Hiroyuki
  2011-07-14  4:46     ` Christoph Hellwig
@ 2011-07-14  6:19     ` Mel Gorman
  2011-07-14  6:17       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  6:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		if (PageDirty(page)) {
> >  			nr_dirty++;
> >  
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > +				goto keep_locked;
> > +			}
> > +
> 
> 
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> 
> Could you make this
> 
> 	if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> ...
> 

I can, but as Christoph points out, the request is already being
ignored so how will it help?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-13 23:37   ` Dave Chinner
@ 2011-07-14  6:29     ` Mel Gorman
  2011-07-14 11:52       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  6:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched for cleaning from
> > the page reclaim path. At normal priorities, this patch prevents kswapd
> > writing pages.
> > 
> > However, page reclaim does have a requirement that pages be freed
> > in a particular zone. If it is failing to make sufficient progress
> > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > considered to tbe the point where kswapd is getting into trouble
> > reclaiming pages. If this priority is reached, kswapd will dispatch
> > pages for writing.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Seems reasonable, but btrfs still will ignore this writeback from
> kswapd, and it doesn't fall over.

At least there are no reports of it falling over :)

> Given that data point, I'd like to
> see the results when you stop kswapd from doing writeback altogether
> as well.
> 

The results for this test will be identical because the ftrace results
show that kswapd is already writing 0 filesystem pages.

Where it makes a difference is when the system is under enough
pressure that it is failing to reclaim any memory and is in danger
of prematurely triggering the OOM killer. Andrea outlined some of
the concerns before at http://lkml.org/lkml/2010/6/15/246

> Can you try removing it altogether and seeing what that does to your
> test results? i.e
> 
> 			if (page_is_file_cache(page)) {
> 				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> 				goto keep_locked;
> 			}

It won't do anything, it'll still be writing 0 filesystem-backed pages.

Because of the possibility for the OOM killer triggering prematurely due
to the inability of kswapd to write pages, I'd prefer to separate such a
change by at least one release so that if there is an increase in OOM
reports, it'll be obvious what was the culprit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-13 23:41   ` Dave Chinner
@ 2011-07-14  6:33     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  6:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 09:41:50AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> > Workloads that are allocating frequently and writing files place a
> > large number of dirty pages on the LRU. With use-once logic, it is
> > possible for them to reach the end of the LRU quickly requiring the
> > reclaimer to scan more to find clean pages. Ordinarily, processes that
> > are dirtying memory will get throttled by dirty balancing but this
> > is a global heuristic and does not take into account that LRUs are
> > maintained on a per-zone basis. This can lead to a situation whereby
> > reclaim is scanning heavily, skipping over a large number of pages
> > under writeback and recycling them around the LRU consuming CPU.
> > 
> > This patch checks how many of the number of pages isolated from the
> > LRU were dirty. If a percentage of them are dirty, the process will be
> > throttled if a blocking device is congested or the zone being scanned
> > is marked congested. The percentage that must be dirty depends on
> > the priority. At default priority, all of them must be dirty. At
> > DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> > etc. i.e.  as pressure increases the greater the likelihood the process
> > will get throttled to allow the flusher threads to make some progress.
> 
> It still doesn't take into account how many pages under writeback
> were skipped. If there are lots of pages that are under writeback, I
> think we still want to throttle to give IO a chance to complete and
> clean those pages before scanning again....
> 

An earlier revision did take them into account but in these tests at
least, 0 pages at the end of the LRU were PageWriteback. I expect this
to change when multiple processes and CPUs were in use but am ignoring
it for the moment.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 21:39   ` Jan Kara
  2011-07-14  0:09     ` Dave Chinner
@ 2011-07-14  7:03     ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  7:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim

On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> > 
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
>   Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
> 

Marginal and well within noise. I'm very skeptical about the patch
but the VM needs some way of prioritising what pages are getting
written back to that pages in a particular zone can be cleaned.

> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).

It's as good as number as any for arguements sake. It's not the
first time such a patch has done the rounds. The last one I did along
similar lines was http://lkml.org/lkml/2010/6/8/85 although I mucked
it up with respect to racing with iput.

Wu posted a patch that deferred the writing of ranges to a
flusher thread http://www.spinics.net/lists/xfs/msg05659.html
which Dave has already commented on at
http://www.spinics.net/lists/xfs/msg05665.html. The clustering size
could be easily fixed but the scalability problem he pointed out is
a far greater problem.

> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency. So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel.  You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
> 
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
> 

Semi-reasonable and it's along the same lines as what
http://lkml.org/lkml/2010/6/8/85 tried to achieve but maybe the effort
of fixing it up with respect to racing with iput() just isn't worth it.

I think I'll leave it as kswapd will call writepage if the priority is
high enough until a good solution for how the VM can tell the flusher to
prioritise a particular page is devised.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 23:56   ` Dave Chinner
@ 2011-07-14  7:30     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  7:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 09:56:06AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> > 
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  fs/fs-writeback.c         |   56 ++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/fs.h        |    5 ++-
> >  include/linux/writeback.h |    1 +
> >  mm/vmscan.c               |   16 ++++++++++++-
> >  4 files changed, 74 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0f015a0..1201052 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> >  	LIST_HEAD(tmp);
> >  	struct list_head *pos, *node;
> >  	struct super_block *sb = NULL;
> > -	struct inode *inode;
> > +	struct inode *inode, *tinode;
> >  	int do_sb_sort = 0;
> >  
> > +	/* Move inodes reclaim found at end of LRU to dispatch queue */
> > +	list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> > +		/* Move any inode found at end of LRU to dispatch queue */
> > +		if (inode->i_state & I_DIRTY_RECLAIM) {
> > +			inode->i_state &= ~I_DIRTY_RECLAIM;
> > +			list_move(&inode->i_wb_list, &tmp);
> > +
> > +			if (sb && sb != inode->i_sb)
> > +				do_sb_sort = 1;
> > +			sb = inode->i_sb;
> > +		}
> > +	}
> 
> This is not a good idea. move_expired_inodes() already sucks a large
> amount of CPU when there are lots of dirty inodes on the list (think
> hundreds of thousands), and that is when the traversal terminates at
> *older_than_this. It's not uncommon in my testing to see this
> one function consume 30-35% of the bdi-flusher thread CPU usage
> in such conditions.
> 

I thought this might be the case. I wasn't sure how bad it could be but
I mentioned in the leader it might be a problem. I'll consider other
ways that pages found at the end of the LRU could be prioritised for
writeback.

> > <SNIP>
> > +
> > +	sb = NULL;
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = wb_inode(delaying_queue->prev);
> >  		if (older_than_this &&
> > @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> >  	rcu_read_unlock();
> >  }
> >  
> > +/*
> > + * Similar to wakeup_flusher_threads except prioritise inodes contained
> > + * in the page_list regardless of age
> > + */
> > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> > +{
> > +	struct page *page;
> > +	struct address_space *mapping;
> > +	struct inode *inode;
> > +
> > +	list_for_each_entry(page, page_list, lru) {
> > +		if (!PageDirty(page))
> > +			continue;
> > +
> > +		if (PageSwapBacked(page))
> > +			continue;
> > +
> > +		lock_page(page);
> > +		mapping = page_mapping(page);
> > +		if (!mapping)
> > +			goto unlock;
> > +
> > +		/*
> > +		 * Test outside the lock to see as if it is already set. Inode
> > +		 * should be pinned by the lock_page
> > +		 */
> > +		inode = page->mapping->host;
> > +		if (inode->i_state & I_DIRTY_RECLAIM)
> > +			goto unlock;
> > +
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state |= I_DIRTY_RECLAIM;
> > +		spin_unlock(&inode->i_lock);
> 
> Micro optimisations like this are unnecessary - the inode->i_lock is
> not contended.
> 

This patch was brought forward from a time when it would have been
taking the global inode_lock. I wasn't sure how badly inode->i_lock
was being contended and hadn't set up lock stats. Thanks for the
clarification.

> As it is, this code won't really work as you think it might.
> There's no guarantee a dirty inode is on the dirty - it might have
> already been expired, and it might even currently be under
> writeback.  In that case, if it is still dirty it goes to the
> b_more_io list and writeback bandwidth is shared between all the
> other dirty inodes and completely ignores this flag...
> 

Ok, it's a total bust. If I revisit this at all, it'll either be in
the context of Wu's approach or calling fdatawrite_range but but it
might be pointless and overall it might just be better for now to
leave kswapd calling ->writepage if reclaim is failing and priority
is raised.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
  2011-07-14  0:33 ` Dave Chinner
  2011-07-14  4:51   ` Christoph Hellwig
@ 2011-07-14  7:37   ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14  7:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> > (Revisting this from a year ago and following on from the thread
> > "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> > clustering". Posting an prototype to see if anything obvious is
> > being missed)
> 
> Hi Mel,
> 
> Thanks for picking this up again. The results are definitely
> promising, but I'd like to see a comparison against simply not doing
> IO from memory reclaim at all combined with the enhancements in this
> patchset.

Convered elsewhere. In these tests we are already writing 0 pages so it
won't make a difference and I'm wary of eliminating writes entirely
unless kswapd has a way of priotising pages the flusher writes back
because of the risk of premature OOM kill.

> After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
> 

Christoph covered this.

> .....
> 
> > use-once LRU logic). The command line for fs_mark looked something like
> > 
> > ./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
> > 
> > The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> > this triggers the worst behaviour.
> ....
> > During testing, a number of monitors were running to gather information
> > from ftrace in particular. This disrupts the results of course because
> > recording the information generates IO in itself but I'm ignoring
> > that for the moment so the effect of the patches can be seen.
> > 
> > I've posted the raw reports for each filesystem at
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
> .....
> > Average files per second is increased by a nice percentage albeit
> > just within the standard deviation. Consider the type of test this is,
> > variability was inevitable but will double check without monitoring.
> > 
> > The overhead (time spent in non-filesystem-related activities) is
> > reduced a *lot* and is a lot less variable.
> 
> Given that userspace is doing the same amount of work in all test
> runs, that implies that the userspace process is retaining it's
> working set hot in the cache over syscalls with this patchset.
> 

It's one possibility. The more likely one is that the fs_marks anonymous
pages are getting swapped out leading to variability. If IO is less
seeky as a result of the change, the swap in/outs would be faster.

> > Direct reclaim work is significantly reduced going from 37% of all
> > pages scanned to 1% with all patches applied. This implies that
> > processes are getting stalled less.
> 
> And that directly implicates page scanning during direct reclaim as
> the prime contributor to turfing the application's working set out
> of the CPU cache....
> 

It's a possibility.

> > Page writes by reclaim is what is motivating this series. It goes
> > from 14511 pages to 4084 which is a big improvement. We'll see later
> > if these were anonymous or file-backed pages.
> 
> Which were anon pages, so this is a major improvement. However,
> given that there were no dirty pages writen directly by memory
> reclaim, perhaps we don't need to do IO at all from here and
> throttling is all that is needed?  ;)
> 

I wouldn't bet my life on it due to potential premature OOM kill problem
if we cannot reclaim pages at all :)

> > Direct reclaim writes were never a problem according to this.
> 
> That's true. but we disable direct reclaim for other reasons, namely
> that writeback from direct reclaim blows the stack.
> 

Correct. I should have been clearer and said direct reclaim wasn't
a problem in terms of queueing pages for IO.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-14  6:29     ` Mel Gorman
@ 2011-07-14 11:52       ` Dave Chinner
  2011-07-14 13:17         ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-14 11:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > It is preferable that no dirty pages are dispatched for cleaning from
> > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > writing pages.
> > > 
> > > However, page reclaim does have a requirement that pages be freed
> > > in a particular zone. If it is failing to make sufficient progress
> > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > considered to tbe the point where kswapd is getting into trouble
> > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > pages for writing.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > 
> > Seems reasonable, but btrfs still will ignore this writeback from
> > kswapd, and it doesn't fall over.
> 
> At least there are no reports of it falling over :)

However you want to spin it.

> > Given that data point, I'd like to
> > see the results when you stop kswapd from doing writeback altogether
> > as well.
> > 
> 
> The results for this test will be identical because the ftrace results
> show that kswapd is already writing 0 filesystem pages.

You mean these numbers:

Kswapd reclaim write file async I/O           4483       4286 0          1          0          0

Which shows that kswapd, under this workload has been improved to
the point that it doesn't need to do IO. Yes, you've addressed the
one problematic workload, but the numbers do not provide the answers
to the fundamental question that have been raised during
discussions. i.e. do we even need IO at all from reclaim?

> Where it makes a difference is when the system is under enough
> pressure that it is failing to reclaim any memory and is in danger
> of prematurely triggering the OOM killer. Andrea outlined some of
> the concerns before at http://lkml.org/lkml/2010/6/15/246

So put the system under more pressure such that with this patch
series memory reclaim still writes from kswapd. Can you even get it
to that stage, and if you can, does the system OOM more or less if
you don't do file IO from reclaim?

> > Can you try removing it altogether and seeing what that does to your
> > test results? i.e
> > 
> > 			if (page_is_file_cache(page)) {
> > 				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > 				goto keep_locked;
> > 			}
> 
> It won't do anything, it'll still be writing 0 filesystem-backed pages.
> 
> Because of the possibility for the OOM killer triggering prematurely due
> to the inability of kswapd to write pages, I'd prefer to separate such a
> change by at least one release so that if there is an increase in OOM
> reports, it'll be obvious what was the culprit.

I'm not asking for release quality patches or even when such fixes
would roll out.

What you've shown here is that memory reclaim can be more efficient
without issuing IO itself under medium memory pressure. Now the
question is whether it can do so under heavy, sustained, near OOM
memory pressure?

IOWs, what I want to see is whether the fundamental principle of
IO-less reclaim can be validated as workable or struck down.  This
patchset demonstrates that IO-less reclaim is superior for a
workload that produces medium levels of sustained IO-based memory
pressure, which leads to the conclusion that the approach has merit
and needs further investigation.

It's that next step that I'm asking you to test now. What form
potential changes take or when they are released is irrelevant to me
at this point, because we still haven't determined if the
fundamental concept is completely sound or not. If the concept is
sound I'm quite happy to wait until the implementation is fully
baked before it gets rolled out....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-14 11:52       ` Dave Chinner
@ 2011-07-14 13:17         ` Mel Gorman
  2011-07-15  3:12           ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2011-07-14 13:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > writing pages.
> > > > 
> > > > However, page reclaim does have a requirement that pages be freed
> > > > in a particular zone. If it is failing to make sufficient progress
> > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > considered to tbe the point where kswapd is getting into trouble
> > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > pages for writing.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > 
> > > Seems reasonable, but btrfs still will ignore this writeback from
> > > kswapd, and it doesn't fall over.
> > 
> > At least there are no reports of it falling over :)
> 
> However you want to spin it.
> 

I regret that it is coming across as spin. My primary concern is
that if we get OOM-related bugs due to this series later that it'll
be difficult to pinpoint whether the whole series is at fault or whether
preventing kswapd writing any pages was at fault.

> > > Given that data point, I'd like to
> > > see the results when you stop kswapd from doing writeback altogether
> > > as well.
> > > 
> > 
> > The results for this test will be identical because the ftrace results
> > show that kswapd is already writing 0 filesystem pages.
> 
> You mean these numbers:
> 
> Kswapd reclaim write file async I/O           4483       4286 0          1          0          0
> 
> Which shows that kswapd, under this workload has been improved to
> the point that it doesn't need to do IO. Yes, you've addressed the
> one problematic workload, but the numbers do not provide the answers
> to the fundamental question that have been raised during
> discussions. i.e. do we even need IO at all from reclaim?
> 

I don't know and at best will only be able to test with a single
disk which is why I wanted to separate this series from a complete
preventing of kswapd writing pages. I may be able to get access to
a machine with more disks but it'll take time.

> > Where it makes a difference is when the system is under enough
> > pressure that it is failing to reclaim any memory and is in danger
> > of prematurely triggering the OOM killer. Andrea outlined some of
> > the concerns before at http://lkml.org/lkml/2010/6/15/246
> 
> So put the system under more pressure such that with this patch
> series memory reclaim still writes from kswapd. Can you even get it
> to that stage, and if you can, does the system OOM more or less if
> you don't do file IO from reclaim?
> 

I can setup such a tests, it'll be at least next week before I
configure such a test and get it queued. It'll probably take a few
days to run then because more iterations will be required to pinpoint
where the OOM threshold is.  I know from the past that pushing a
system near OOM causes a non-deterministic number of triggers that
depend heavily on what was killed so the only real choice is to start
light and increase the load until boom which is time consuming.

Even then, the test will be inconclusive because it'll be just one
or two machines that I'll have to test on. There will be important
corner cases that I won't be able to test for.  For example;

  o small lowest zone that is critical for operation of some reason and
    the pages must be cleaned from there even though there is a large
    amount of memory overall

  o small highest zone causing high kswapd usage as it fails to balance
    continually due to pages being dirtied constantly and the window
    between when flushers clean the page and kswapd reclaim the page
    being too big. I might be able to simulate this one but bugs of
    this nature tend to be workload specific and affect some machines
    worse than others

  o Machines with many nodes and dirty pages spread semi-randomly
    on all nodes. If the flusher thread is not cleaning pages from
    a particular node that is under memory pressure due to affinity,
    processes will stall for long periods of time until the relevant
    inodes expire and gets cleaned. This will be particularly
    problematic if zone_reclaim is enabled

Questions about scenarios like this are going to cause problems in
review because it's reasonable to ask if any of them can occur and
we can't give an iron-clad answer.

> > > Can you try removing it altogether and seeing what that does to your
> > > test results? i.e
> > > 
> > > 			if (page_is_file_cache(page)) {
> > > 				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > 				goto keep_locked;
> > > 			}
> > 
> > It won't do anything, it'll still be writing 0 filesystem-backed pages.
> > 
> > Because of the possibility for the OOM killer triggering prematurely due
> > to the inability of kswapd to write pages, I'd prefer to separate such a
> > change by at least one release so that if there is an increase in OOM
> > reports, it'll be obvious what was the culprit.
> 
> I'm not asking for release quality patches or even when such fixes
> would roll out.
> 

Very well. I was hoping to start with just this series and handle the
complete disabling of writing later but it can wait a few weeks too. It
was always a stretch that the next merge window was going to be hit.

> What you've shown here is that memory reclaim can be more efficient
> without issuing IO itself under medium memory pressure. Now the
> question is whether it can do so under heavy, sustained, near OOM
> memory pressure?
> 
> IOWs, what I want to see is whether the fundamental principle of
> IO-less reclaim can be validated as workable or struck down.  This
> patchset demonstrates that IO-less reclaim is superior for a
> workload that produces medium levels of sustained IO-based memory
> pressure, which leads to the conclusion that the approach has merit
> and needs further investigation.
> 
> It's that next step that I'm asking you to test now. What form
> potential changes take or when they are released is irrelevant to me
> at this point, because we still haven't determined if the
> fundamental concept is completely sound or not. If the concept is
> sound I'm quite happy to wait until the implementation is fully
> baked before it gets rolled out....
> 

I'll setup a suitable test next week then.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  4:46       ` KAMEZAWA Hiroyuki
@ 2011-07-14 15:07         ` Christoph Hellwig
  2011-07-14 23:55           ` KAMEZAWA Hiroyuki
  2011-07-15  2:22         ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> > 
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> > 
> 
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?

We're using a fairly deep stack in normal buffered read/write,
wich is almost 100% common code.  It's not just the long callchain
(see below), but also that we put the unneeded kiocb and a vector
of I/O vects on the stack:

vfs_writev
do_readv_writev
do_sync_write
generic_file_aio_write
__generic_file_aio_write
generic_file_buffered_write
generic_perform_write
block_write_begin
grab_cache_page_write_begin
add_to_page_cache_lru
add_to_page_cache
add_to_page_cache_locked
mem_cgroup_cache_charge

this might additionally come from in-kernel callers like nfsd,
which has even more stack space used.  And at this point we only
enter the memcg/reclaim code, which last time I had a stack trace
ate up another about 3k of stack space.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
  2011-07-13 21:39   ` Jan Kara
  2011-07-13 23:56   ` Dave Chinner
@ 2011-07-14 15:09   ` Christoph Hellwig
  2011-07-14 15:49     ` Mel Gorman
  2 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.

what does this buy us?  If at all we should prioritize by a zone,
e.g. tell write_cache_pages only to bother with writing things out
if the dirty page is in a given zone.   We'd probably still cluster
around it to make sure we get good I/O patterns, but would only start
I/O if it has a page we actually care about.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
  2011-07-14 15:09   ` Christoph Hellwig
@ 2011-07-14 15:49     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2011-07-14 15:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner, Wu Fengguang,
	Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 11:09:59AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> 
> what does this buy us? 

Very little. The vague intention was to avoid a situation where kswapds
priority was raised such that it had to write pages to clean a
particular zone.

> If at all we should prioritize by a zone,
> e.g. tell write_cache_pages only to bother with writing things out
> if the dirty page is in a given zone.   We'd probably still cluster
> around it to make sure we get good I/O patterns, but would only start
> I/O if it has a page we actually care about.
> 

That would make more sense. I've dropped this patch entirely.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14 15:07         ` Christoph Hellwig
@ 2011-07-14 23:55           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, 14 Jul 2011 11:07:00 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > > 
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > > 
> > 
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> 
> We're using a fairly deep stack in normal buffered read/write,
> wich is almost 100% common code.  It's not just the long callchain
> (see below), but also that we put the unneeded kiocb and a vector
> of I/O vects on the stack:
> 
> vfs_writev
> do_readv_writev
> do_sync_write
> generic_file_aio_write
> __generic_file_aio_write
> generic_file_buffered_write
> generic_perform_write
> block_write_begin
> grab_cache_page_write_begin
> add_to_page_cache_lru
> add_to_page_cache
> add_to_page_cache_locked
> mem_cgroup_cache_charge
> 
> this might additionally come from in-kernel callers like nfsd,
> which has even more stack space used.  And at this point we only
> enter the memcg/reclaim code, which last time I had a stack trace
> ate up another about 3k of stack space.
> 

Hmm. I'll prepare 2 functions for memcg 
  1. asynchronous memory reclaim as kswapd does.
  2. dirty_ratio

please remove ->writepage 1st. It may break memcg but it happens sometimes.
We'll do fix.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-14  4:46       ` KAMEZAWA Hiroyuki
  2011-07-14 15:07         ` Christoph Hellwig
@ 2011-07-15  2:22         ` Dave Chinner
  2011-07-18  2:22           ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-15  2:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 00:46:43 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > +			/*
> > > > +			 * Only kswapd can writeback filesystem pages to
> > > > +			 * avoid risk of stack overflow
> > > > +			 */
> > > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > +				goto keep_locked;
> > > > +			}
> > > > +
> > > 
> > > 
> > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > 
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> > 
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> > 
> 
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?

Here's an example writeback stack trace. Notice how deep it is from
the __writepage() call?

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (50 entries)
        -----    ----   --------
  0)     5000      80   enqueue_task_fair+0x63/0x4f0
  1)     4920      48   enqueue_task+0x6a/0x80
  2)     4872      32   activate_task+0x2d/0x40
  3)     4840      32   ttwu_activate+0x21/0x50
  4)     4808      32   T.2130+0x3c/0x60
  5)     4776     112   try_to_wake_up+0x25e/0x2d0
  6)     4664      16   wake_up_process+0x15/0x20
  7)     4648      16   wake_up_worker+0x24/0x30
  8)     4632      16   insert_work+0x6f/0x80
  9)     4616      96   __queue_work+0xf9/0x3f0
 10)     4520      16   queue_work_on+0x25/0x40
 11)     4504      16   queue_work+0x1f/0x30
 12)     4488      16   queue_delayed_work+0x2d/0x40
 13)     4472      32   blk_run_queue_async+0x41/0x60
 14)     4440      64   queue_unplugged+0x8e/0xc0
 15)     4376     112   blk_flush_plug_list+0x1f5/0x240
 16)     4264     176   schedule+0x4c3/0x8b0
 17)     4088     128   schedule_timeout+0x1a5/0x280
 18)     3960     160   wait_for_common+0xdb/0x180
 19)     3800      16   wait_for_completion+0x1d/0x20
 20)     3784      48   xfs_buf_iowait+0x30/0xc0
 21)     3736      32   _xfs_buf_read+0x60/0x70
 22)     3704      48   xfs_buf_read+0xa2/0x100
 23)     3656      80   xfs_trans_read_buf+0x1ef/0x430
 24)     3576      96   xfs_btree_read_buf_block+0x5e/0xd0
 25)     3480      96   xfs_btree_lookup_get_block+0x83/0xf0
 26)     3384     176   xfs_btree_lookup+0xd7/0x490
 27)     3208      16   xfs_alloc_lookup_eq+0x19/0x20
 28)     3192     112   xfs_alloc_fixup_trees+0x2b5/0x350
 29)     3080     224   xfs_alloc_ag_vextent_near+0x631/0xb60
 30)     2856      32   xfs_alloc_ag_vextent+0xd5/0x100
 31)     2824      96   xfs_alloc_vextent+0x2a4/0x5f0
 32)     2728     256   xfs_bmap_btalloc+0x257/0x720
 33)     2472      16   xfs_bmap_alloc+0x21/0x40
 34)     2456     432   xfs_bmapi+0x9b7/0x1150
 35)     2024     192   xfs_iomap_write_allocate+0x17d/0x350
 36)     1832     144   xfs_map_blocks+0x1e2/0x270
 37)     1688     208   xfs_vm_writepage+0x19f/0x500
 38)     1480      32   __writepage+0x17/0x40
 39)     1448     304   write_cache_pages+0x21d/0x4d0
 40)     1144      96   generic_writepages+0x51/0x80
 41)     1048      48   xfs_vm_writepages+0x5d/0x80
 42)     1000      16   do_writepages+0x21/0x40
 43)      984      96   writeback_single_inode+0x10e/0x270
 44)      888      96   writeback_sb_inodes+0xdb/0x1b0
 45)      792     208   wb_writeback+0x1bf/0x420
 46)      584     160   wb_do_writeback+0x9f/0x270
 47)      424     144   bdi_writeback_thread+0xaa/0x270
 48)      280      96   kthread+0x96/0xa0
 49)      184     184   kernel_thread_helper+0x4/0x10

So from ->writepage, there is about 3.5k of stack usage here.  2.5k
of that is in XFS, and the worst I've seen is around 4k before
getting to the IO subsystem, which in the worst case I've seen
consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
down to IO take over 6k of stack space on x86_64....


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-14 13:17         ` Mel Gorman
@ 2011-07-15  3:12           ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2011-07-15  3:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 14, 2011 at 02:17:45PM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > > writing pages.
> > > > > 
> > > > > However, page reclaim does have a requirement that pages be freed
> > > > > in a particular zone. If it is failing to make sufficient progress
> > > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > > considered to tbe the point where kswapd is getting into trouble
> > > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > > pages for writing.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > 
> > > > Seems reasonable, but btrfs still will ignore this writeback from
> > > > kswapd, and it doesn't fall over.
> > > 
> > > At least there are no reports of it falling over :)
> > 
> > However you want to spin it.
> 
> I regret that it is coming across as spin.

Shit, sorry, I didn't mean it that way. I forgot to add the smiley
at the end of that comment. It was meant in jest and not to be
derogatory - I do understand your concerns.

> > > > Given that data point, I'd like to
> > > > see the results when you stop kswapd from doing writeback altogether
> > > > as well.
> > > > 
> > > 
> > > The results for this test will be identical because the ftrace results
> > > show that kswapd is already writing 0 filesystem pages.
> > 
> > You mean these numbers:
> > 
> > Kswapd reclaim write file async I/O           4483       4286 0          1          0          0
> > 
> > Which shows that kswapd, under this workload has been improved to
> > the point that it doesn't need to do IO. Yes, you've addressed the
> > one problematic workload, but the numbers do not provide the answers
> > to the fundamental question that have been raised during
> > discussions. i.e. do we even need IO at all from reclaim?
> 
> I don't know and at best will only be able to test with a single
> disk which is why I wanted to separate this series from a complete
> preventing of kswapd writing pages. I may be able to get access to
> a machine with more disks but it'll take time.

That, to me, seems like a major problem, and explains why swapping
was affecting your results - you've got your test filesystem and
your swap partition on the same spindle. In the server admin world,
that's the first thing anyone concerned with performance avoids and
as such I tend to avoid doing that, too.

The lack of spindles/bandwidth used in testing the mm code is also
potentially another reason why XFS tends to show up mm problems.
That is, most testing and production use of XFS occurs on disk
subsystems much more bandwidth than a single spindle, and hence the
effects of bad IO show up much more obviously than for a single
spindle.

> > > Where it makes a difference is when the system is under enough
> > > pressure that it is failing to reclaim any memory and is in danger
> > > of prematurely triggering the OOM killer. Andrea outlined some of
> > > the concerns before at http://lkml.org/lkml/2010/6/15/246
> > 
> > So put the system under more pressure such that with this patch
> > series memory reclaim still writes from kswapd. Can you even get it
> > to that stage, and if you can, does the system OOM more or less if
> > you don't do file IO from reclaim?
> 
> I can setup such a tests, it'll be at least next week before I
> configure such a test and get it queued. It'll probably take a few
> days to run then because more iterations will be required to pinpoint
> where the OOM threshold is.  I know from the past that pushing a
> system near OOM causes a non-deterministic number of triggers that
> depend heavily on what was killed so the only real choice is to start
> light and increase the load until boom which is time consuming.
> 
> Even then, the test will be inconclusive because it'll be just one
> or two machines that I'll have to test on.

Which is why I have a bunch of test VMs with different
CPU/RAM/platform configs.  I regularly use 1p/1GB x86-64, 1p/2GB
i686 (to stress highmem), 2p/2GB, 8p/4GB and 8p/16GB x86-64 VMs. I
have a bunch of different disk images for the VMs to work off,
located on storage from shared single SATA spindles to a 16TB volume
to a short-stroked, 1GB/s, 5kiops, 12 disk dm RAID-0 setup.

I mix and match the VMs with the disk images all the time - this is
one of the benefits of using a virtualised test environment. One
slightly beefy piece of hardware that costs $10k can be used to test
many, many different configurations. That's why I complain about
corner cases all the time ;)

> There will be important
> corner cases that I won't be able to test for.  For example;
> 
>   o small lowest zone that is critical for operation of some reason and
>     the pages must be cleaned from there even though there is a large
>     amount of memory overall

That's the i686 highmem case, using a large amount of memory (e.g.
4GB or more) to make sure that the highmem zone is much larger than
the lowmem zone. inode caching uses low memory, so directory
intensive operations on large sets of files (e.g. 10 million)
tend to stress low memory availability.

>   o small highest zone causing high kswapd usage as it fails to balance
>     continually due to pages being dirtied constantly and the window
>     between when flushers clean the page and kswapd reclaim the page
>     being too big. I might be able to simulate this one but bugs of
>     this nature tend to be workload specific and affect some machines
>     worse than others

And that is also testable with i686 highmem, but simply use smaller
amounts of ram (say 1.5GB). Use page cache pressure to fill and
dirty highmem, and inode cache pressure to fill lowmem.

Guess what one of my ad hoc tests for XFS shrinker balancing is.  :)

>   o Machines with many nodes and dirty pages spread semi-randomly
>     on all nodes. If the flusher thread is not cleaning pages from
>     a particular node that is under memory pressure due to affinity,
>     processes will stall for long periods of time until the relevant
>     inodes expire and gets cleaned. This will be particularly
>     problematic if zone_reclaim is enabled

And you can create large node-count virtual machines via the kvm
-numa option. I haven't been doing this as yet because getting stuff
working well on single node SMP needs to be done first.

So, like you, I really only have one or two tests machine available
locally, but I've been creative in working around that
limitation.... :/

> > It's that next step that I'm asking you to test now. What form
> > potential changes take or when they are released is irrelevant to me
> > at this point, because we still haven't determined if the
> > fundamental concept is completely sound or not. If the concept is
> > sound I'm quite happy to wait until the implementation is fully
> > baked before it gets rolled out....
> 
> I'll setup a suitable test next week then.

Sounds great. Thanks Mel.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-15  2:22         ` Dave Chinner
@ 2011-07-18  2:22           ` Dave Chinner
  2011-07-18  3:06             ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2011-07-18  2:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > +			/*
> > > > > +			 * Only kswapd can writeback filesystem pages to
> > > > > +			 * avoid risk of stack overflow
> > > > > +			 */
> > > > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > +				goto keep_locked;
> > > > > +			}
> > > > > +
> > > > 
> > > > 
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > 
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > > 
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > > 
> > 
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> 
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
> 
> So from ->writepage, there is about 3.5k of stack usage here.  2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....

BTW, here's a stack frame that indicates swap IO:

dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (46 entries)
        -----    ----   --------
  0)     5080      40   zone_statistics+0xad/0xc0
  1)     5040     272   get_page_from_freelist+0x2ad/0x7e0
  2)     4768     288   __alloc_pages_nodemask+0x133/0x7b0
  3)     4480      48   kmem_getpages+0x62/0x160
  4)     4432     112   cache_grow+0x2d1/0x300
  5)     4320      80   cache_alloc_refill+0x219/0x260
  6)     4240      64   kmem_cache_alloc+0x182/0x190
  7)     4176      16   mempool_alloc_slab+0x15/0x20
  8)     4160     144   mempool_alloc+0x63/0x140
  9)     4016      16   scsi_sg_alloc+0x4c/0x60
 10)     4000     112   __sg_alloc_table+0x66/0x140
 11)     3888      32   scsi_init_sgtable+0x33/0x90
 12)     3856      48   scsi_init_io+0x31/0xc0
 13)     3808      32   scsi_setup_fs_cmnd+0x79/0xe0
 14)     3776     112   sd_prep_fn+0x150/0xa90
 15)     3664      64   blk_peek_request+0xc7/0x230
 16)     3600      96   scsi_request_fn+0x68/0x500
 17)     3504      16   __blk_run_queue+0x1b/0x20
 18)     3488      96   __make_request+0x2cb/0x310
 19)     3392     192   generic_make_request+0x26d/0x500
 20)     3200      96   submit_bio+0x64/0xe0
 21)     3104      48   swap_writepage+0x83/0xd0
 22)     3056     112   pageout+0x122/0x2f0
 23)     2944     192   shrink_page_list+0x458/0x5f0
 24)     2752     192   shrink_inactive_list+0x1ec/0x410
 25)     2560     224   shrink_zone+0x468/0x500
 26)     2336     144   do_try_to_free_pages+0x2b7/0x3f0
 27)     2192     176   try_to_free_pages+0xa4/0x120
 28)     2016     288   __alloc_pages_nodemask+0x43f/0x7b0
 29)     1728      48   kmem_getpages+0x62/0x160
 30)     1680     128   fallback_alloc+0x192/0x240
 31)     1552      96   ____cache_alloc_node+0x9a/0x170
 32)     1456      16   __kmalloc+0x17d/0x200
 33)     1440     128   kmem_alloc+0x77/0xf0
 34)     1312     128   xfs_log_commit_cil+0x95/0x3d0
 35)     1184      96   _xfs_trans_commit+0x1e9/0x2a0
 36)     1088     208   xfs_create+0x57a/0x640
 37)      880      96   xfs_vn_mknod+0xa1/0x1b0
 38)      784      16   xfs_vn_create+0x10/0x20
 39)      768      64   vfs_create+0xb1/0xe0
 40)      704      96   do_last+0x5f5/0x770
 41)      608     144   path_openat+0xd5/0x400
 42)      464     224   do_filp_open+0x49/0xa0
 43)      240      96   do_sys_open+0x107/0x1e0
 44)      144      16   sys_open+0x20/0x30
 45)      128     128   system_call_fastpath+0x16/0x1b


That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....

Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-18  2:22           ` Dave Chinner
@ 2011-07-18  3:06             ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2011-07-18  3:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Mon, Jul 18, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 00:46:43 -0400
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > +			/*
> > > > > > +			 * Only kswapd can writeback filesystem pages to
> > > > > > +			 * avoid risk of stack overflow
> > > > > > +			 */
> > > > > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > > +				goto keep_locked;
> > > > > > +			}
> > > > > > +
> > > > > 
> > > > > 
> > > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > > 
> > > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > > for the typical non-overwrite workloads, and none has fallen apart.
> > > > 
> > > > In fact there's no way we can enable them as the memcg calling contexts
> > > > tend to have massive stack usage.
> > > > 
> > > 
> > > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> > 
> > Here's an example writeback stack trace. Notice how deep it is from
> > the __writepage() call?
> ....
> > 
> > So from ->writepage, there is about 3.5k of stack usage here.  2.5k
> > of that is in XFS, and the worst I've seen is around 4k before
> > getting to the IO subsystem, which in the worst case I've seen
> > consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> > down to IO take over 6k of stack space on x86_64....
> 
> BTW, here's a stack frame that indicates swap IO:
> 
> dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
>         Depth    Size   Location    (46 entries)
>         -----    ----   --------
>   0)     5080      40   zone_statistics+0xad/0xc0
>   1)     5040     272   get_page_from_freelist+0x2ad/0x7e0
>   2)     4768     288   __alloc_pages_nodemask+0x133/0x7b0
>   3)     4480      48   kmem_getpages+0x62/0x160
>   4)     4432     112   cache_grow+0x2d1/0x300
>   5)     4320      80   cache_alloc_refill+0x219/0x260
>   6)     4240      64   kmem_cache_alloc+0x182/0x190
>   7)     4176      16   mempool_alloc_slab+0x15/0x20
>   8)     4160     144   mempool_alloc+0x63/0x140
>   9)     4016      16   scsi_sg_alloc+0x4c/0x60
>  10)     4000     112   __sg_alloc_table+0x66/0x140
>  11)     3888      32   scsi_init_sgtable+0x33/0x90
>  12)     3856      48   scsi_init_io+0x31/0xc0
>  13)     3808      32   scsi_setup_fs_cmnd+0x79/0xe0
>  14)     3776     112   sd_prep_fn+0x150/0xa90
>  15)     3664      64   blk_peek_request+0xc7/0x230
>  16)     3600      96   scsi_request_fn+0x68/0x500
>  17)     3504      16   __blk_run_queue+0x1b/0x20
>  18)     3488      96   __make_request+0x2cb/0x310
>  19)     3392     192   generic_make_request+0x26d/0x500
>  20)     3200      96   submit_bio+0x64/0xe0
>  21)     3104      48   swap_writepage+0x83/0xd0
>  22)     3056     112   pageout+0x122/0x2f0
>  23)     2944     192   shrink_page_list+0x458/0x5f0
>  24)     2752     192   shrink_inactive_list+0x1ec/0x410
>  25)     2560     224   shrink_zone+0x468/0x500
>  26)     2336     144   do_try_to_free_pages+0x2b7/0x3f0
>  27)     2192     176   try_to_free_pages+0xa4/0x120
>  28)     2016     288   __alloc_pages_nodemask+0x43f/0x7b0
>  29)     1728      48   kmem_getpages+0x62/0x160
>  30)     1680     128   fallback_alloc+0x192/0x240
>  31)     1552      96   ____cache_alloc_node+0x9a/0x170
>  32)     1456      16   __kmalloc+0x17d/0x200
>  33)     1440     128   kmem_alloc+0x77/0xf0
>  34)     1312     128   xfs_log_commit_cil+0x95/0x3d0
>  35)     1184      96   _xfs_trans_commit+0x1e9/0x2a0
>  36)     1088     208   xfs_create+0x57a/0x640
>  37)      880      96   xfs_vn_mknod+0xa1/0x1b0
>  38)      784      16   xfs_vn_create+0x10/0x20
>  39)      768      64   vfs_create+0xb1/0xe0
>  40)      704      96   do_last+0x5f5/0x770
>  41)      608     144   path_openat+0xd5/0x400
>  42)      464     224   do_filp_open+0x49/0xa0
>  43)      240      96   do_sys_open+0x107/0x1e0
>  44)      144      16   sys_open+0x20/0x30
>  45)      128     128   system_call_fastpath+0x16/0x1b
> 
> 
> That's pretty damn bad. From kmem_alloc to the top of the stack is
> more than 3.5k through the direct reclaim swap IO path. That, to me,
> kind of indicates that even doing swap IO on dirty anonymous pages
> from direct reclaim risks overflowing the 8k stack on x86_64....
> 
> Umm, hold on a second, WTF is my standard create-lots-of-zero-length
> inodes-in-parallel doing swapping? Oh, shit, it's also running about
> 50% slower (50-60k files/s instead of 110-120l files/s)....

It's the memory demand caused by the stack tracer causing the
swapping, and the slowdown is just the overhead of tracer.  2.6.38
doesn't swap very much at all, 2.6.39 swaps a bit more more and
3.0-rc7 is about the same....

IOWs the act of measuring stack usage causes the worst case stack
usage for that workload on 2.6.39 and 3.0-rc7.

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-07-18  3:06 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2011-07-13 23:34   ` Dave Chinner
2011-07-14  6:17     ` Mel Gorman
2011-07-14  1:38   ` KAMEZAWA Hiroyuki
2011-07-14  4:46     ` Christoph Hellwig
2011-07-14  4:46       ` KAMEZAWA Hiroyuki
2011-07-14 15:07         ` Christoph Hellwig
2011-07-14 23:55           ` KAMEZAWA Hiroyuki
2011-07-15  2:22         ` Dave Chinner
2011-07-18  2:22           ` Dave Chinner
2011-07-18  3:06             ` Dave Chinner
2011-07-14  6:19     ` Mel Gorman
2011-07-14  6:17       ` KAMEZAWA Hiroyuki
2011-07-13 14:31 ` [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
2011-07-13 23:37   ` Dave Chinner
2011-07-14  6:29     ` Mel Gorman
2011-07-14 11:52       ` Dave Chinner
2011-07-14 13:17         ` Mel Gorman
2011-07-15  3:12           ` Dave Chinner
2011-07-13 14:31 ` [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
2011-07-13 23:41   ` Dave Chinner
2011-07-14  6:33     ` Mel Gorman
2011-07-13 14:31 ` [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
2011-07-13 16:40   ` Johannes Weiner
2011-07-13 17:15     ` Mel Gorman
2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
2011-07-13 21:39   ` Jan Kara
2011-07-14  0:09     ` Dave Chinner
2011-07-14  7:03     ` Mel Gorman
2011-07-13 23:56   ` Dave Chinner
2011-07-14  7:30     ` Mel Gorman
2011-07-14 15:09   ` Christoph Hellwig
2011-07-14 15:49     ` Mel Gorman
2011-07-13 15:31 ` [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
2011-07-14  0:33 ` Dave Chinner
2011-07-14  4:51   ` Christoph Hellwig
2011-07-14  7:37   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).