LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@sgi.com>,
	Andrew Morton <akpm@osdl.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] nfs: fix congestion control
Date: Fri, 19 Jan 2007 11:51:46 -0500	[thread overview]
Message-ID: <1169225506.5775.15.camel@lade.trondhjem.org> (raw)
In-Reply-To: <1169212022.6197.148.camel@twins>

On Fri, 2007-01-19 at 14:07 +0100, Peter Zijlstra wrote:
> On Fri, 2007-01-19 at 10:34 +0100, Peter Zijlstra wrote:
> > On Thu, 2007-01-18 at 10:49 -0500, Trond Myklebust wrote:
> > 
> > > After the dirty page has been written to unstable storage, it marks the
> > > inode using I_DIRTY_DATASYNC, which should then ensure that the VFS
> > > calls write_inode() on the next pass through __sync_single_inode.
> > 
> > > I'd rather like to see fs/fs-writeback.c do this correctly (assuming
> > > that it is incorrect now).
> > 
> > balance_dirty_pages()
> >   wbc.sync_mode = WB_SYNC_NONE;
> >   writeback_inodes()
> >     sync_sb_inodes()
> >       __writeback_single_inode()
> >         __sync_single_inode()
> >           write_inode()
> >             nfs_write_inode()
> > 
> > Ah, yes, I see. That ought to work.
> > 
> > /me goes verify he didn't mess it up himself...
> 
> And of course I did. This seems to work.
> 
> The problem I had was that throttle_vm_writeout() got stuck on commit
> pages. But that can be solved with a much much simpler and better
> solution. Force all swap traffic to be |FLUSH_STABLE, unstable swap
> space doesn't make sense anyway :-)

Yep. Since we don't care too much about metadata updates on the swap
file (the file size should never change), we could perhaps make the
swapper write out the pages with DATA_SYNC rather than the full
FILE_SYNC.

> So with that out of the way I now have this

Looks much better. Just one obvious buglet...

> ---
> 
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severly broken, it marks the backing
> device congested during each nfs_writepages() call but doesn't mirror this in
> nfs_writepage() which makes for deadlocks. Also it implements its own waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap on the
> number of active writeback pages and uses the bdi congestion waitqueue.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
> ---
>  fs/nfs/write.c              |  113 ++++++++++++++++++++++++++++----------------
>  include/linux/backing-dev.h |    1 
>  include/linux/nfs_fs_sb.h   |    1 
>  mm/backing-dev.c            |   16 ++++++
>  4 files changed, 90 insertions(+), 41 deletions(-)
> 
> Index: linux-2.6-git/fs/nfs/write.c
> ===================================================================
> --- linux-2.6-git.orig/fs/nfs/write.c	2007-01-19 13:57:49.000000000 +0100
> +++ linux-2.6-git/fs/nfs/write.c	2007-01-19 14:03:30.000000000 +0100
> @@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
>  static mempool_t *nfs_wdata_mempool;
>  static mempool_t *nfs_commit_mempool;
>  
> -static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
> -
>  struct nfs_write_data *nfs_commit_alloc(void)
>  {
>  	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
> @@ -245,6 +243,39 @@ static int wb_priority(struct writeback_
>  }
>  
>  /*
> + * NFS congestion control
> + */
> +
> +static unsigned long nfs_congestion_pages;
> +
> +#define NFS_CONGESTION_ON_THRESH 	nfs_congestion_pages
> +#define NFS_CONGESTION_OFF_THRESH	\
> +	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
> +
> +static inline void nfs_set_page_writeback(struct page *page)
> +{
> +	if (!test_set_page_writeback(page)) {
> +		struct inode *inode = page->mapping->host;
> +		struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +		if (atomic_inc_return(&nfss->writeback) > NFS_CONGESTION_ON_THRESH)
> +			set_bdi_congested(&nfss->backing_dev_info, WRITE);
> +	}
> +}
> +
> +static inline void nfs_end_page_writeback(struct page *page)
> +{
> +	struct inode *inode = page->mapping->host;
> +	struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +	end_page_writeback(page);
> +	if (atomic_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
> +		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
> +		congestion_end(WRITE);
> +	}
> +}
> +
> +/*
>   * Find an associated nfs write request, and prepare to flush it out
>   * Returns 1 if there was no write request, or if the request was
>   * already tagged by nfs_set_page_dirty.Returns 0 if the request
> @@ -281,7 +312,7 @@ static int nfs_page_mark_flush(struct pa
>  	spin_unlock(req_lock);
>  	if (test_and_set_bit(PG_FLUSHING, &req->wb_flags) == 0) {
>  		nfs_mark_request_dirty(req);
> -		set_page_writeback(page);
> +		nfs_set_page_writeback(page);
>  	}
>  	ret = test_bit(PG_NEED_FLUSH, &req->wb_flags);
>  	nfs_unlock_request(req);
> @@ -336,13 +367,8 @@ int nfs_writepage(struct page *page, str
>  	return err; 
>  }
>  
> -/*
> - * Note: causes nfs_update_request() to block on the assumption
> - * 	 that the writeback is generated due to memory pressure.
> - */
>  int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> -	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	struct inode *inode = mapping->host;
>  	int err;
>  
> @@ -351,11 +377,6 @@ int nfs_writepages(struct address_space 
>  	err = generic_writepages(mapping, wbc);
>  	if (err)
>  		return err;
> -	while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
> -		if (wbc->nonblocking)
> -			return 0;
> -		nfs_wait_on_write_congestion(mapping, 0);
> -	}
>  	err = nfs_flush_mapping(mapping, wbc, wb_priority(wbc));
>  	if (err < 0)
>  		goto out;
> @@ -369,9 +390,6 @@ int nfs_writepages(struct address_space 
>  	if (err > 0)
>  		err = 0;
>  out:
> -	clear_bit(BDI_write_congested, &bdi->state);
> -	wake_up_all(&nfs_write_congestion);
> -	congestion_end(WRITE);
>  	return err;
>  }
>  
> @@ -401,7 +419,7 @@ static int nfs_inode_add_request(struct 
>  }
>  
>  /*
> - * Insert a write request into an inode
> + * Remove a write request from an inode
>   */
>  static void nfs_inode_remove_request(struct nfs_page *req)
>  {
> @@ -585,8 +603,8 @@ static inline int nfs_scan_commit(struct
>  
>  static int nfs_wait_on_write_congestion(struct address_space *mapping, int intr)
>  {
> +	struct inode *inode = mapping->host;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> -	DEFINE_WAIT(wait);
>  	int ret = 0;
>  
>  	might_sleep();
> @@ -594,31 +612,27 @@ static int nfs_wait_on_write_congestion(
>  	if (!bdi_write_congested(bdi))
>  		return 0;
>  
> -	nfs_inc_stats(mapping->host, NFSIOS_CONGESTIONWAIT);
> +	nfs_inc_stats(inode, NFSIOS_CONGESTIONWAIT);
>  
> -	if (intr) {
> -		struct rpc_clnt *clnt = NFS_CLIENT(mapping->host);
> -		sigset_t oldset;
> -
> -		rpc_clnt_sigmask(clnt, &oldset);
> -		prepare_to_wait(&nfs_write_congestion, &wait, TASK_INTERRUPTIBLE);
> -		if (bdi_write_congested(bdi)) {
> -			if (signalled())
> -				ret = -ERESTARTSYS;
> -			else
> -				schedule();
> +	do {
> +		if (intr) {
> +			struct rpc_clnt *clnt = NFS_CLIENT(inode);
> +			sigset_t oldset;
> +
> +			rpc_clnt_sigmask(clnt, &oldset);
> +			ret = congestion_wait_interruptible(WRITE, HZ/10);
> +			rpc_clnt_sigunmask(clnt, &oldset);
> +			if (ret == -ERESTARTSYS)
> +				return ret;
> +			ret = 0;
> +		} else {
> +			congestion_wait(WRITE, HZ/10);
>  		}
> -		rpc_clnt_sigunmask(clnt, &oldset);
> -	} else {
> -		prepare_to_wait(&nfs_write_congestion, &wait, TASK_UNINTERRUPTIBLE);
> -		if (bdi_write_congested(bdi))
> -			schedule();
> -	}
> -	finish_wait(&nfs_write_congestion, &wait);
> +	} while (bdi_write_congested(bdi));
> +
>  	return ret;
>  }
>  
> -
>  /*
>   * Try to update any existing write request, or create one if there is none.
>   * In order to match, the request's credentials must match those of
> @@ -779,7 +793,7 @@ int nfs_updatepage(struct file *file, st
>  
>  static void nfs_writepage_release(struct nfs_page *req)
>  {
> -	end_page_writeback(req->wb_page);
> +	nfs_end_page_writeback(req->wb_page);
>  
>  #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>  	if (!PageError(req->wb_page)) {
> @@ -1095,12 +1109,12 @@ static void nfs_writeback_done_full(stru
>  			ClearPageUptodate(page);
>  			SetPageError(page);
>  			req->wb_context->error = task->tk_status;
> -			end_page_writeback(page);
> +			nfs_end_page_writeback(page);
>  			nfs_inode_remove_request(req);
>  			dprintk(", error = %d\n", task->tk_status);
>  			goto next;
>  		}
> -		end_page_writeback(page);
> +		nfs_end_page_writeback(page);
>  
>  #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>  		if (data->args.stable != NFS_UNSTABLE || data->verf.committed == NFS_FILE_SYNC) {
> @@ -1565,6 +1579,23 @@ int __init nfs_init_writepagecache(void)
>  	if (nfs_commit_mempool == NULL)
>  		return -ENOMEM;
>  
> +	/*
> +	 * NFS congestion size, scale with available memory.
> +	 *
> +	 *  64MB:    8192k
> +	 * 128MB:   11585k
> +	 * 256MB:   16384k
> +	 * 512MB:   23170k
> +	 *   1GB:   32768k
> +	 *   2GB:   46340k
> +	 *   4GB:   65536k
> +	 *   8GB:   92681k
> +	 *  16GB:  131072k
> +	 *
> +	 * This allows larger machines to have larger/more transfers.
> +	 */
> +	nfs_congestion_size = 32*int_sqrt(totalram_pages);
> +
          ^^^^^^^^^^^^^^^^^^^ nfs_congestion_pages?


>  	return 0;
>  }
>  
> Index: linux-2.6-git/mm/backing-dev.c
> ===================================================================
> --- linux-2.6-git.orig/mm/backing-dev.c	2007-01-19 13:57:49.000000000 +0100
> +++ linux-2.6-git/mm/backing-dev.c	2007-01-19 13:57:52.000000000 +0100
> @@ -55,6 +55,22 @@ long congestion_wait(int rw, long timeou
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +long congestion_wait_interruptible(int rw, long timeout)
> +{
> +	long ret;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[rw];
> +
> +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> +	if (signal_pending(current))
> +		ret = -ERESTARTSYS;
> +	else
> +		ret = io_schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +	return ret;
> +}
> +EXPORT_SYMBOL(congestion_wait_interruptible);
> +
>  /**
>   * congestion_end - wake up sleepers on a congested backing_dev_info
>   * @rw: READ or WRITE
> Index: linux-2.6-git/include/linux/nfs_fs_sb.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/nfs_fs_sb.h	2007-01-19 13:57:49.000000000 +0100
> +++ linux-2.6-git/include/linux/nfs_fs_sb.h	2007-01-19 13:57:52.000000000 +0100
> @@ -82,6 +82,7 @@ struct nfs_server {
>  	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
>  	struct nfs_iostats *	io_stats;	/* I/O statistics */
>  	struct backing_dev_info	backing_dev_info;
> +	atomic_t		writeback;	/* number of writeback pages */
>  	int			flags;		/* various flags */
>  	unsigned int		caps;		/* server capabilities */
>  	unsigned int		rsize;		/* read size */
> Index: linux-2.6-git/include/linux/backing-dev.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/backing-dev.h	2007-01-19 13:57:49.000000000 +0100
> +++ linux-2.6-git/include/linux/backing-dev.h	2007-01-19 13:57:52.000000000 +0100
> @@ -93,6 +93,7 @@ static inline int bdi_rw_congested(struc
>  void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
>  void set_bdi_congested(struct backing_dev_info *bdi, int rw);
>  long congestion_wait(int rw, long timeout);
> +long congestion_wait_interruptible(int rw, long timeout);
>  void congestion_end(int rw);
>  
>  #define bdi_cap_writeback_dirty(bdi) \
> 
> 


  reply	other threads:[~2007-01-19 16:52 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-16  5:47 [RFC 0/8] Cpuset aware writeback Christoph Lameter
2007-01-16  5:47 ` [RFC 1/8] Convert higest_possible_node_id() into nr_node_ids Christoph Lameter
2007-01-16 22:05   ` Andi Kleen
2007-01-17  3:14     ` Christoph Lameter
2007-01-17  4:15       ` Andi Kleen
2007-01-17  4:23         ` Christoph Lameter
2007-01-16  5:47 ` [RFC 2/8] Add a map to inodes to track dirty pages per node Christoph Lameter
2007-01-16  5:47 ` [RFC 3/8] Add a nodemask to pdflush functions Christoph Lameter
2007-01-16  5:48 ` [RFC 4/8] Per cpuset dirty ratio handling and writeout Christoph Lameter
2007-01-16  5:48 ` [RFC 5/8] Make writeout during reclaim cpuset aware Christoph Lameter
2007-01-16 22:07   ` Andi Kleen
2007-01-17  4:20     ` Paul Jackson
2007-01-17  4:28       ` Andi Kleen
2007-01-17  4:36         ` Paul Jackson
2007-01-17  5:59           ` Andi Kleen
2007-01-17  6:19             ` Christoph Lameter
2007-01-17  4:23     ` Christoph Lameter
2007-01-16  5:48 ` [RFC 6/8] Throttle vm writeout per cpuset Christoph Lameter
2007-01-16  5:48 ` [RFC 7/8] Exclude unreclaimable pages from dirty ration calculation Christoph Lameter
2007-01-18 15:48   ` Nikita Danilov
2007-01-18 19:56     ` Christoph Lameter
2007-01-16  5:48 ` [RFC 8/8] Reduce inode memory usage for systems with a high MAX_NUMNODES Christoph Lameter
2007-01-16 19:52   ` Paul Menage
2007-01-16 20:00     ` Christoph Lameter
2007-01-16 20:06       ` Paul Menage
2007-01-16 20:51         ` Christoph Lameter
2007-01-16  7:38 ` [RFC 0/8] Cpuset aware writeback Peter Zijlstra
2007-01-16 20:10   ` Christoph Lameter
2007-01-16  9:25 ` Paul Jackson
2007-01-16 17:13   ` Christoph Lameter
2007-01-16 21:53 ` Andrew Morton
2007-01-16 22:08   ` [PATCH] nfs: fix congestion control Peter Zijlstra
2007-01-16 22:27     ` Trond Myklebust
2007-01-17  2:41       ` Peter Zijlstra
2007-01-17  6:15         ` Trond Myklebust
2007-01-17  8:49           ` Peter Zijlstra
2007-01-17 13:50             ` Trond Myklebust
2007-01-17 14:29               ` Peter Zijlstra
2007-01-17 14:45                 ` Trond Myklebust
2007-01-17 20:05     ` Christoph Lameter
2007-01-17 21:52       ` Peter Zijlstra
2007-01-17 21:54         ` Trond Myklebust
2007-01-18 13:27           ` Peter Zijlstra
2007-01-18 15:49             ` Trond Myklebust
2007-01-19  9:33               ` Peter Zijlstra
2007-01-19 13:07                 ` Peter Zijlstra
2007-01-19 16:51                   ` Trond Myklebust [this message]
2007-01-19 17:54                     ` Peter Zijlstra
2007-01-19 17:20                   ` Christoph Lameter
2007-01-19 17:57                     ` Peter Zijlstra
2007-01-19 18:02                       ` Christoph Lameter
2007-01-19 18:26                       ` Trond Myklebust
2007-01-19 18:27                         ` Christoph Lameter
2007-01-20  7:01                         ` [PATCH] nfs: fix congestion control -v3 Peter Zijlstra
2007-01-22 16:12                           ` Trond Myklebust
2007-01-25 15:32                             ` [PATCH] nfs: fix congestion control -v4 Peter Zijlstra
2007-01-26  5:02                               ` Andrew Morton
2007-01-26  8:00                                 ` Peter Zijlstra
2007-01-26  8:50                                   ` Peter Zijlstra
2007-01-26  5:09                               ` Andrew Morton
2007-01-26  5:31                                 ` Christoph Lameter
2007-01-26  6:04                                   ` Andrew Morton
2007-01-26  6:53                                     ` Christoph Lameter
2007-01-26  8:03                                     ` Peter Zijlstra
2007-01-26  8:51                                       ` Andrew Morton
2007-01-26  9:01                                         ` Peter Zijlstra
2007-02-20 12:59                                         ` Peter Zijlstra
2007-01-22 17:59                           ` [PATCH] nfs: fix congestion control -v3 Christoph Lameter
2007-01-17 23:15     ` [PATCH] nfs: fix congestion control Christoph Hellwig
2007-01-16 22:15   ` [RFC 0/8] Cpuset aware writeback Christoph Lameter
2007-01-16 23:40     ` Andrew Morton
2007-01-17  0:16       ` Christoph Lameter
2007-01-17  1:07         ` Andrew Morton
2007-01-17  1:30           ` Christoph Lameter
2007-01-17  2:34             ` Andrew Morton
2007-01-17  3:40               ` Christoph Lameter
2007-01-17  4:02                 ` Paul Jackson
2007-01-17  4:05                 ` Andrew Morton
2007-01-17  6:27                   ` Christoph Lameter
2007-01-17  7:00                     ` Andrew Morton
2007-01-17  8:01                       ` Paul Jackson
2007-01-17  9:57                         ` Andrew Morton
2007-01-17 19:43                       ` Christoph Lameter
2007-01-17 22:10                         ` Andrew Morton
2007-01-18  1:10                           ` Christoph Lameter
2007-01-18  1:25                             ` Andrew Morton
2007-01-18  5:21                               ` Christoph Lameter
2007-01-16 23:44   ` David Chinner
2007-01-16 22:01 ` Andi Kleen
2007-01-16 22:18   ` Christoph Lameter
2007-02-02  1:38 ` Ethan Solomita
2007-02-02  2:16   ` Christoph Lameter
2007-02-02  4:03     ` Andrew Morton
2007-02-02  5:29       ` Christoph Lameter
2007-02-02  6:02         ` Neil Brown
2007-02-02  6:17           ` Christoph Lameter
2007-02-02  6:41             ` Neil Brown
2007-02-02  7:12         ` Andrew Morton
2007-03-21 21:11     ` Ethan Solomita
2007-03-21 21:29       ` Christoph Lameter
2007-03-21 21:52         ` Andrew Morton
2007-03-21 21:57           ` Christoph Lameter
2007-04-19  2:07         ` Ethan Solomita
2007-04-19  2:55           ` Christoph Lameter
2007-04-19  7:52             ` Ethan Solomita
2007-04-19 16:03               ` Christoph Lameter
2007-04-21  1:37             ` Ethan Solomita
2007-04-21  1:48               ` Christoph Lameter
2007-04-21  8:15                 ` Ethan Solomita
2007-04-21 15:40                   ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1169225506.5775.15.camel@lade.trondhjem.org \
    --to=trond.myklebust@fys.uio.no \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --subject='Re: [PATCH] nfs: fix congestion control' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).