LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
To: Neil Brown <neilb@suse.de>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines
Date: Tue, 9 Jan 2007 02:10:17 -0800	[thread overview]
Message-ID: <20070109021017.447b682d.akpm@osdl.org> (raw)
In-Reply-To: <17827.22798.625018.673326@notabene.brown>

On Tue, 9 Jan 2007 19:57:50 +1100
Neil Brown <neilb@suse.de> wrote:

> 
> Imagine a machine with lots of memory - say 100Gig.
> 
> Suppose there is one (largish) filesystem that is ext3 (or maybe
> reiser) with the default data=ordered.
> 
> Suppose this filesystem is being written to steadily so that the
> maximum amount of memory is always dirty.  With the default
> vm.dirty_ratio of 40%, this could be 40Gig.
> 
> When the journal triggers a commit, all the dirty data needs to be
> flushed out in order to adhere to the "data=ordered" semantics.
> This can take a while.
> 
> While this is happening, some small updates such as 'atime' update can
> block waiting for the journal to be unlocked again after the flush.

Actually, ext3 doesn't work that way.  The atime update will go into the
"running transaction", which is an instance of journal_t which is separate
from the committing transaction.

But there are situations (ie; journal free-space exhaustion) where things
can go synchronous.  They're more likely to occur during metadata storms
though, and perhaps indicate an undersized journal.

But yeah, overall point agreed with.

> Waiting for 40gig to flush for an atime update to complete is clearly
> unsatisfactory. 
> 
> We can reduce the amount of dirty memory by setting vm.dirty_ratio
> down to 1 still allows 1Gig of dirty data which can cause unpleasant
> pauses (and this was on a kernel where '1' still meant something.  In
> current kernels, '5' is the effective minimum).
> 
> So this patch removes the minimum of '5' and introduces a new tunable
> 'vm.dirty_kb' which sets an upper limit in Kibibytes.

kibibytes?  We're feeding the kernel catfood now?

> This allows the amount of dirty memory to be limited to - say - 50M
> which should flush fast enough.
> 
> So: is this patch acceptable?  And should a lower default value for
> vm_dirty_kb be used?
> 
> 
> Some of the details in the above description might not be 100%
> accurate (I'm not sure of the exact connection between atime updates
> and journal commits).  The symptoms are:
>   While generating constant write traffic on a machine with > 20Gig
>   of RAM, performing assorted read-only operations can sometimes
>   produces a pause of 10s of seconds.
>   The pause can be removed by:
>     - mounting noatime
>     - mounting data=writeback
>     - setting vm.dirty_kb to 1000 with this patch.

Could be IO scheduler borkage, could be ext3 borkage.  A well-timed sysrq-T
will tell us, and is worth doing (please).

Does increasing the journal size help?

> @@ -149,15 +154,21 @@ get_dirty_limits(long *pbackground, long
>  	if (dirty_ratio > unmapped_ratio / 2)
>  		dirty_ratio = unmapped_ratio / 2;
>  
> -	if (dirty_ratio < 5)
> -		dirty_ratio = 5;
> -
>  	background_ratio = dirty_background_ratio;
>  	if (background_ratio >= dirty_ratio)
>  		background_ratio = dirty_ratio / 2;
> +	if (dirty_background_ratio && !background_ratio)
> +		background_ratio = 1;
>  
> -	background = (background_ratio * available_memory) / 100;
>  	dirty = (dirty_ratio * available_memory) / 100;
> +	if (dirty > vm_dirty_kb / (PAGE_SIZE/1024))
> +		dirty = vm_dirty_kb / (PAGE_SIZE/1024);
> +	if (dirty_ratio == 0)
> +		background = 0;
> +	else if (background_ratio >= dirty_ratio)
> +		background = dirty / 2;
> +	else
> +		background = dirty * background_ratio / dirty_ratio;
>  	tsk = current;
>  	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
>  		background += background / 4;

It would be better if we can avoid creating the second global variable.  Is
it not possible to remove dirty_ratio?  Make everything work off
vm_dirty_kb and do arithmetricks at the /proc/sys/vm/dirty_ratio interface?

We should perform the same conversion to dirty_background_ratio, I suspect.

And these guys should be `long', not `int'.  Otherwise things will go
pearshaped at 2 tabbybytes.


  reply	other threads:[~2007-01-09 10:10 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-09  8:57 Neil Brown
2007-01-09 10:10 ` Andrew Morton [this message]
2007-01-10  3:04   ` Neil Brown
2007-01-10  3:29   ` Neil Brown
2007-01-10  3:41     ` Andrew Morton
2007-01-11 11:04 ` dean gaudet
2007-01-11 20:21   ` Andrew Morton
2007-01-11 22:35     ` dean gaudet
2007-01-11 22:48       ` Andrew Morton
2007-03-07 10:23         ` Leroy van Logchem

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070109021017.447b682d.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --subject='Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).