LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/3] [RFC][PATCH] clustered writeback
       [not found] <388214369.05937@ustc.edu.cn>
@ 2007-08-27 11:21 ` Fengguang Wu
       [not found]   ` <388214369.33461@ustc.edu.cn>
                     ` (2 more replies)
  2007-08-27 12:03 ` [PATCH 0/3] [RFC][PATCH] clustered writeback Arjan van de Ven
  1 sibling, 3 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-27 11:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, David Chinner, Michael Rubin, linux-kernel

Chris,

This is one possible implementation of the clustered writeback idea.
It runs OK on ext3 (compiling, syncing, etc.).

The patch is based on 2.6.23-rc3-mm1 and the writeback patches here:
http://lkml.org/lkml/2007/8/19/10

By default, with many dirty inodes, it works as follows:
- store dirty inodes in a radix tree, indexed by their inode numbers
- sweep the whole inode number space in 25s and do it in 5 times
- each time we walk only 1/5 of the inode number space
- pull all inodes with dirty-age larger than 5s to the io dispatching queue

Because it does the work in small batches of 10 inodes, when the system has
<=10 dirty inodes, its behavior will reduce to:
- do a full sweep *at once* on every 25s
Which means the disk will flicker once every 25s, not bad :)


The implications for the majority users could be:
- medium-to-heavy writes becomes less seeky
- dirty inodes are getting synced earlier(before: 30s; now: 5-30s)
- less panic for the 'atime' mount option (a future work)

Fengguang
-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] writeback: introduce queue_dirty()
       [not found]   ` <388214369.33461@ustc.edu.cn>
@ 2007-08-27 11:21     ` Fengguang Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-27 11:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, David Chinner, Michael Rubin, linux-kernel

[-- Attachment #1: writeback-queue_dirty.patch --]
[-- Type: text/plain, Size: 1485 bytes --]

Introduce queue_dirty() to enqueue a newly dirtied inode.
It helps remove duplicate code.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

--- linux-2.6.23-rc3-mm1.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc3-mm1/fs/fs-writeback.c
@@ -24,6 +24,15 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+/*
+ * Enqueue a newly dirtied inode.
+ */
+static void queue_dirty(struct inode *inode)
+{
+	inode->dirtied_when = jiffies;
+	list_move(&inode->i_list, &inode->i_sb->s_dirty);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -121,10 +130,8 @@ void __mark_inode_dirty(struct inode *in
 		 * If the inode was already on s_dirty/s_io/s_more_io, don't
 		 * reposition it (that would break s_dirty time-ordering).
 		 */
-		if (!was_dirty) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
-		}
+		if (!was_dirty)
+			queue_dirty(inode);
 	}
 out:
 	spin_unlock(&inode_lock);
@@ -466,10 +473,8 @@ int generic_sync_sb_inodes(struct super_
 		err = __writeback_single_inode(inode, wbc);
 		if (!ret)
 			ret = err;
-		if (wbc->sync_mode == WB_SYNC_HOLD) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
-		}
+		if (wbc->sync_mode == WB_SYNC_HOLD)
+			queue_dirty(inode);
 		if (current_is_pdflush())
 			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/3] writeback: introduce dirty_volatile_interval
       [not found]   ` <388214369.67692@ustc.edu.cn>
@ 2007-08-27 11:21     ` Fengguang Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-27 11:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Jens Axboe, David Chinner, Ken Chen,
	Michael Rubin, linux-kernel

[-- Attachment #1: writeback-young_protect_interval.patch --]
[-- Type: text/plain, Size: 2004 bytes --]

Introduce dirty_volatile_interval for the minimal dirty time.
Inodes dirtied less than dirty_volatile_interval will not be
considered for syncing by kupdate-style writeback.

This new parameter will be used in clustered writeback.
The old dirty_expire_interval is still(but less) respected.

Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/writeback.h |    1 +
 kernel/sysctl.c           |    7 +++++++
 mm/page-writeback.c       |    5 +++++
 3 files changed, 13 insertions(+)

--- linux-2.6.23-rc3-mm1.orig/include/linux/writeback.h
+++ linux-2.6.23-rc3-mm1/include/linux/writeback.h
@@ -101,6 +101,7 @@ extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
 extern int dirty_writeback_interval;
 extern int dirty_expire_interval;
+extern int dirty_volatile_interval;
 extern int block_dump;
 extern int laptop_mode;
 
--- linux-2.6.23-rc3-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc3-mm1/mm/page-writeback.c
@@ -85,6 +85,11 @@ int dirty_writeback_interval = 5 * HZ;
 int dirty_expire_interval = 30 * HZ;
 
 /*
+ * The shortest number of jiffies for which data should remain dirty
+ */
+int dirty_volatile_interval = 5 * HZ;
+
+/*
  * Flag that makes the machine dump writes/reads and block dirtyings.
  */
 int block_dump;
--- linux-2.6.23-rc3-mm1.orig/kernel/sysctl.c
+++ linux-2.6.23-rc3-mm1/kernel/sysctl.c
@@ -837,6 +837,13 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec_userhz_jiffies,
 	},
 	{
+		.procname	= "dirty_volatile_centisecs",
+		.data		= &dirty_volatile_interval,
+		.maxlen		= sizeof(dirty_volatile_interval),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_userhz_jiffies,
+	},
+	{
 		.ctl_name	= VM_NR_PDFLUSH_THREADS,
 		.procname	= "nr_pdflush_threads",
 		.data		= &nr_pdflush_threads,

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 3/3] writeback: writeback clustering by inode number
       [not found]   ` <388214370.31498@ustc.edu.cn>
@ 2007-08-27 11:21     ` Fengguang Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-27 11:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrew Morton, Jens Axboe, David Chinner, Ken Chen,
	Michael Rubin, linux-kernel

[-- Attachment #1: writeback-ino-clustering.patch --]
[-- Type: text/plain, Size: 6784 bytes --]

Organize dirty inodes in the order of location instead of dirty time.
It helps write extensive workloads to be more seek-friendly.

There are 2 candidates for this feature:
	1) XFS style piggybacking
	   write all expired(age>30s) inodes, plus the ones near them(any ages)
	2) elevator style sweep scanning
	   in each kupdate wake ups, walk 1/5 the partition and write all
	   non-volatile(age>5s) inodes in that range

This patch implements the second scheme. The merits of it are:
	- more predictable behavior
	- can smooth out time-bursty writes
However, it does not help space-bursty caused by hot write areas.

For many filesystems(ext3/XFS/...), inode number is a good location hint
for the inode. However inode data blocks may not necessarily lie close to the
inode itself(i.e. XFS), hence we should somehow extend the location hint in
future.

Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c  |  129 +++++++++++++++++++++++++++++++++++++++++++
 fs/super.c         |    2 
 include/linux/fs.h |   10 +++
 3 files changed, 141 insertions(+)

--- linux-2.6.23-rc3-mm1.orig/fs/super.c
+++ linux-2.6.23-rc3-mm1/fs/super.c
@@ -61,6 +61,8 @@ static struct super_block *alloc_super(s
 			s = NULL;
 			goto out;
 		}
+		INIT_RADIX_TREE(&s->s_dirty_tree.inode_tree, GFP_ATOMIC);
+		INIT_LIST_HEAD(&s->s_dirty_tree.inode_list);
 		INIT_LIST_HEAD(&s->s_dirty);
 		INIT_LIST_HEAD(&s->s_io);
 		INIT_LIST_HEAD(&s->s_more_io);
--- linux-2.6.23-rc3-mm1.orig/include/linux/fs.h
+++ linux-2.6.23-rc3-mm1/include/linux/fs.h
@@ -978,6 +978,15 @@ extern int send_sigurg(struct fown_struc
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 
+struct dirty_inode_tree {
+	struct list_head	inode_list;
+	struct radix_tree_root	inode_tree;
+	unsigned long		nr_inodes;
+	unsigned long		max_index;
+	unsigned long		start_jiffies; /* when the scan started? */
+	unsigned long		next_index;    /* where it is in the scan? */
+};
+
 #define sb_entry(list)	list_entry((list), struct super_block, s_list)
 #define S_BIAS (1<<30)
 struct super_block {
@@ -1007,6 +1016,7 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
+	struct dirty_inode_tree	s_dirty_tree;
 	struct list_head	s_dirty;	/* dirty inodes */
 	struct list_head	s_io;		/* parked for writeback */
 	struct list_head	s_more_io;	/* parked for more writeback */
--- linux-2.6.23-rc3-mm1.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc3-mm1/fs/fs-writeback.c
@@ -25,12 +25,139 @@
 #include "internal.h"
 
 /*
+ * Add @inode to its superblock's radix tree of dirty inodes.
+ * The radix tree is indexed by inode number.
+ */
+static void add_to_dirty_tree(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct dirty_inode_tree *dt = &sb->s_dirty_tree;
+	int e;
+
+	e = radix_tree_preload(GFP_ATOMIC);
+	if (!e) {
+		e = radix_tree_insert(&dt->inode_tree, inode->i_ino, inode);
+		/*
+		 * Inode numbers are unique, but the inode itself might be
+		 * somehow redirtied and resent to us. So it's safe to ignore
+		 * the conflict.
+		 */
+		if (!e) {
+			__iget(inode);
+			dt->nr_inodes++;
+			if (dt->max_index < inode->i_ino)
+			    dt->max_index = inode->i_ino;
+		}
+		list_move(&inode->i_list, &sb->s_dirty_tree.inode_list);
+		radix_tree_preload_end();
+	}
+}
+
+#define DIRTY_SCAN_BATCH	10
+#define DIRTY_SCAN_ALL		LONG_MAX
+#define DIRTY_SCAN_REMAINING	(LONG_MAX-1)
+
+/*
+ * Scan the dirty inode tree and pull some inodes onto s_io.
+ * It could go beyond @end - it is a soft/approx limit.
+ */
+static unsigned long scan_dirty_tree(struct super_block *sb,
+					unsigned long begin, unsigned long end)
+{
+	struct dirty_inode_tree *dt = &sb->s_dirty_tree;
+	struct inode *inodes[DIRTY_SCAN_BATCH];
+	struct inode *inode = NULL;
+	int i, j;
+	void *p;
+
+	while (begin < end) {
+		j = radix_tree_gang_lookup(&dt->inode_tree, (void **)inodes,
+						begin, DIRTY_SCAN_BATCH);
+		if (!j)
+			break;
+		for (i = 0; i < j; i++) {
+			inode = inodes[i];
+			if (end != DIRTY_SCAN_ALL) {
+				/* skip young volatile ones */
+				if (time_after(inode->dirtied_when,
+					jiffies - dirty_volatile_interval)) {
+					inodes[i] = 0;
+					continue;
+				}
+			}
+
+			dt->nr_inodes--;
+			p = radix_tree_delete(&dt->inode_tree, inode->i_ino);
+			BUG_ON(!p);
+
+			if (!(inode->i_state & I_SYNC))
+				list_move(&inode->i_list, &sb->s_io);
+		}
+		begin = inode->i_ino + 1;
+
+		spin_unlock(&inode_lock);
+		for (i = 0; i < j; i++)
+			if (inodes[i])
+				iput(inodes[i]);
+		cond_resched();
+		spin_lock(&inode_lock);
+	}
+
+	return begin;
+}
+
+/*
+ * Move a cluster of dirty inodes to the io dispatch queue.
+ */
+static void move_cluster_inodes(struct super_block *sb,
+				unsigned long *older_than_this)
+{
+	struct dirty_inode_tree *dt = &sb->s_dirty_tree;
+	int scan_interval = dirty_expire_interval - dirty_volatile_interval;
+	unsigned long begin;
+	unsigned long end;
+
+	if (!older_than_this) {
+		/*
+		 * Be aggressive: either it is a sync(), or we fall into
+		 * background writeback because kupdate-style writebacks
+		 * could not catch up with fast writers.
+		 */
+		begin = 0;
+		end = DIRTY_SCAN_ALL;
+	} else if (time_after_eq(jiffies,
+				dt->start_jiffies + scan_interval)) {
+		begin = dt->next_index;
+		end = DIRTY_SCAN_REMAINING; /* complete this scan */
+	} else {
+		unsigned long time_total = max(scan_interval, 1);
+		unsigned long time_delta = jiffies - dt->start_jiffies;
+		unsigned long scan_total = dt->max_index;
+		unsigned long scan_delta = scan_total * time_delta / time_total;
+
+		begin = dt->next_index;
+		end = scan_delta;
+	}
+
+	scan_dirty_tree(sb, begin, end);
+
+	if (end >= DIRTY_SCAN_REMAINING) { /* wrap around and setup next scan */
+		dt->next_index = 0;
+		dt->start_jiffies = jiffies;
+	} else
+		dt->next_index = begin;
+}
+
+
+/*
  * Enqueue a newly dirtied inode.
  */
 static void queue_dirty(struct inode *inode)
 {
 	inode->dirtied_when = jiffies;
 	list_move(&inode->i_list, &inode->i_sb->s_dirty);
+	if (dirty_volatile_interval <= dirty_expire_interval/2)
+		add_to_dirty_tree(inode);
 }
 
 /**
@@ -212,11 +339,13 @@ static void queue_io(struct super_block 
 {
 	list_splice_init(&sb->s_more_io, sb->s_io.prev);
 	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+	move_cluster_inodes(sb, older_than_this);
 }
 
 int sb_has_dirty_inodes(struct super_block *sb)
 {
 	return !list_empty(&sb->s_dirty) ||
+	       !list_empty(&sb->s_dirty_tree.inode_list) ||
 	       !list_empty(&sb->s_io) ||
 	       !list_empty(&sb->s_more_io);
 }

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] [RFC][PATCH] clustered writeback
       [not found] <388214369.05937@ustc.edu.cn>
  2007-08-27 11:21 ` [PATCH 0/3] [RFC][PATCH] clustered writeback Fengguang Wu
@ 2007-08-27 12:03 ` Arjan van de Ven
       [not found]   ` <388217659.90061@ustc.edu.cn>
  2007-08-27 12:43   ` Chris Mason
  1 sibling, 2 replies; 7+ messages in thread
From: Arjan van de Ven @ 2007-08-27 12:03 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Chris Mason, Andrew Morton, David Chinner, Michael Rubin, linux-kernel

On Mon, 27 Aug 2007 19:21:52 +0800
> 
> Because it does the work in small batches of 10 inodes, when the
> system has <=10 dirty inodes, its behavior will reduce to:
> - do a full sweep *at once* on every 25s
> Which means the disk will flicker once every 25s, not bad :)

25 seconds is quite not good already though.... it takes a disk a
second or two of no activity to go into low power mode, every 25
seconds means you now have at least a 10% constant power cost....

I don't know the right answer (well other than "make sure inodes aren't
dirty", which involves fixing apps to not do as much file operations,
as well as relatime) but just "every 25s is no big deal" isn't really
the case ;-(

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] [RFC][PATCH] clustered writeback
       [not found]   ` <388217659.90061@ustc.edu.cn>
@ 2007-08-27 12:27     ` Fengguang Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2007-08-27 12:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Chris Mason, Andrew Morton, David Chinner, Michael Rubin, linux-kernel

On Mon, Aug 27, 2007 at 05:03:36AM -0700, Arjan van de Ven wrote:
> On Mon, 27 Aug 2007 19:21:52 +0800
> > 
> > Because it does the work in small batches of 10 inodes, when the
> > system has <=10 dirty inodes, its behavior will reduce to:
> > - do a full sweep *at once* on every 25s
> > Which means the disk will flicker once every 25s, not bad :)
> 
> 25 seconds is quite not good already though.... it takes a disk a
> second or two of no activity to go into low power mode, every 25
> seconds means you now have at least a 10% constant power cost....
> 
> I don't know the right answer (well other than "make sure inodes aren't
> dirty", which involves fixing apps to not do as much file operations,
> as well as relatime) but just "every 25s is no big deal" isn't really
> the case ;-(

Yeah, 25s may be too frequent... What I meant is that the old behavior
could be "write 1-3 inodes on every 5s" if the inodes are dirtied at
random times. Now it becomes "write 10 inodes on every 25s". So it is
actually better ;-)

It's interesting that we want writeback to be smooth on heavy loads
and to be 'bursty' on light loads. Increasing dirty_expire_centisecs
and decreasing dirty_writeback_centisecs could help it somehow.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] [RFC][PATCH] clustered writeback
  2007-08-27 12:03 ` [PATCH 0/3] [RFC][PATCH] clustered writeback Arjan van de Ven
       [not found]   ` <388217659.90061@ustc.edu.cn>
@ 2007-08-27 12:43   ` Chris Mason
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2007-08-27 12:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Fengguang Wu, Andrew Morton, David Chinner, Michael Rubin, linux-kernel

On Mon, 27 Aug 2007 05:03:36 -0700
Arjan van de Ven <arjan@infradead.org> wrote:

> On Mon, 27 Aug 2007 19:21:52 +0800
> > 
> > Because it does the work in small batches of 10 inodes, when the
> > system has <=10 dirty inodes, its behavior will reduce to:
> > - do a full sweep *at once* on every 25s
> > Which means the disk will flicker once every 25s, not bad :)
> 
> 25 seconds is quite not good already though.... it takes a disk a
> second or two of no activity to go into low power mode, every 25
> seconds means you now have at least a 10% constant power cost....
> 
> I don't know the right answer (well other than "make sure inodes
> aren't dirty", which involves fixing apps to not do as much file
> operations, as well as relatime) but just "every 25s is no big deal"
> isn't really the case ;-(

But fixing this isn't the job of this patch....It needs something like
the laptop mode logic where it says ohhhh, the disk is awake, lets send
stuff down.

kupdate hitting on the disk isn't really a new problem, I'd rather
address it with a different patch series.

-chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-08-27 12:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <388214369.05937@ustc.edu.cn>
2007-08-27 11:21 ` [PATCH 0/3] [RFC][PATCH] clustered writeback Fengguang Wu
     [not found]   ` <388214369.33461@ustc.edu.cn>
2007-08-27 11:21     ` [PATCH 1/3] writeback: introduce queue_dirty() Fengguang Wu
     [not found]   ` <388214369.67692@ustc.edu.cn>
2007-08-27 11:21     ` [PATCH 2/3] writeback: introduce dirty_volatile_interval Fengguang Wu
     [not found]   ` <388214370.31498@ustc.edu.cn>
2007-08-27 11:21     ` [PATCH 3/3] writeback: writeback clustering by inode number Fengguang Wu
2007-08-27 12:03 ` [PATCH 0/3] [RFC][PATCH] clustered writeback Arjan van de Ven
     [not found]   ` <388217659.90061@ustc.edu.cn>
2007-08-27 12:27     ` Fengguang Wu
2007-08-27 12:43   ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).