LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/23] per device dirty throttling -v8
@ 2007-08-03 12:37 Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
                   ` (25 more replies)
  0 siblings, 26 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

Per device dirty throttling patches

These patches aim to improve balance_dirty_pages() and directly address three
issues:
  1) inter device starvation
  2) stacked device deadlocks
  3) inter process starvation

1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.

In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.

3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.

Changes since -v7:
 - perpcu_counter renames (partially suggested by Linus)
 - percpu_counter error handling
 - bdi_init error handling
 - fwd port to .23-rc1-mm


---
#
# cleanups
#
nfs_congestion_fixup.patch
#
# percpu_counter rework
#
percpu_counter_add.patch
percpu_counter_batch.patch
percpu_counter_add64.patch
percpu_counter_set.patch
percpu_counter_sum_positive.patch
percpu_counter_sum.patch
percpu_counter_init.patch
percpu_counter_init_irq.patch
#
# per BDI dirty pages
#
bdi_init.patch
bdi_init_container.patch
bdi_init_mtd.patch
mtd-bdi-fixups.patch
bdi_mtdconcat.patch
bdi_stat.patch
bdi_stat_reclaimable.patch
bdi_stat_writeback.patch
bdi_stat_sysfs.patch
#
# floating proportions
#
proportions.patch
proportions_single.patch
#
# per BDI dirty
#
writeback-balance-per-backing_dev.patch
dirty_pages2.patch
#
# debug foo
#
bdi_stat_debug.patch



^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 01/23] nfs: remove congestion_end()
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: nfs_congestion_fixup.patch --]
[-- Type: text/plain, Size: 1965 bytes --]

Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    5 ++---
 include/linux/backing-dev.h |    1 -
 mm/backing-dev.c            |   13 -------------
 3 files changed, 2 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -93,7 +93,6 @@ static inline int bdi_rw_congested(struc
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -54,16 +54,3 @@ long congestion_wait(int rw, long timeou
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 02/23] lib: percpu_counter_add
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 03/23] lib: percpu_counter variable batch Peter Zijlstra
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_add.patch --]
[-- Type: text/plain, Size: 6689 bytes --]

 s/percpu_counter_mod/percpu_counter_add/

Because its a better name, _mod implies modulo.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext2/balloc.c               |    4 ++--
 fs/ext2/ialloc.c               |    2 +-
 fs/ext3/balloc.c               |    4 ++--
 fs/ext3/resize.c               |    4 ++--
 fs/ext4/balloc.c               |    4 ++--
 fs/ext4/resize.c               |    4 ++--
 include/linux/percpu_counter.h |    8 ++++----
 lib/percpu_counter.c           |    4 ++--
 8 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/ext2/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/balloc.c
+++ linux-2.6/fs/ext2/balloc.c
@@ -99,7 +99,7 @@ static void release_blocks(struct super_
 	if (count) {
 		struct ext2_sb_info *sbi = EXT2_SB(sb);
 
-		percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+		percpu_counter_add(&sbi->s_freeblocks_counter, count);
 		sb->s_dirt = 1;
 	}
 }
@@ -1328,7 +1328,7 @@ allocated:
 	}
 
 	group_adjust_blocks(sb, group_no, gdp, gdp_bh, -num);
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	mark_buffer_dirty(bitmap_bh);
 	if (sb->s_flags & MS_SYNCHRONOUS)
Index: linux-2.6/fs/ext2/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/ialloc.c
+++ linux-2.6/fs/ext2/ialloc.c
@@ -542,7 +542,7 @@ got:
 		goto fail;
 	}
 
-	percpu_counter_mod(&sbi->s_freeinodes_counter, -1);
+	percpu_counter_add(&sbi->s_freeinodes_counter, -1);
 	if (S_ISDIR(mode))
 		percpu_counter_inc(&sbi->s_dirs_counter);
 
Index: linux-2.6/fs/ext3/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/balloc.c
+++ linux-2.6/fs/ext3/balloc.c
@@ -570,7 +570,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1633,7 +1633,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext3/resize.c
===================================================================
--- linux-2.6.orig/fs/ext3/resize.c
+++ linux-2.6/fs/ext3/resize.c
@@ -884,9 +884,9 @@ int ext3_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT3_INODES_PER_GROUP(sb));
 
 	ext3_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/fs/ext4/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/balloc.c
+++ linux-2.6/fs/ext4/balloc.c
@@ -587,7 +587,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1656,7 +1656,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext4_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext4/resize.c
===================================================================
--- linux-2.6.orig/fs/ext4/resize.c
+++ linux-2.6/fs/ext4/resize.c
@@ -893,9 +893,9 @@ int ext4_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT4_INODES_PER_GROUP(sb));
 
 	ext4_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,7 +32,7 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void percpu_counter_add(struct percpu_counter *fbc, s32 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
@@ -71,7 +71,7 @@ static inline void percpu_counter_destro
 }
 
 static inline void
-percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
 	preempt_disable();
 	fbc->count += amount;
@@ -97,12 +97,12 @@ static inline s64 percpu_counter_sum(str
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, 1);
+	percpu_counter_add(fbc, 1);
 }
 
 static inline void percpu_counter_dec(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, -1);
+	percpu_counter_add(fbc, -1);
 }
 
 #endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,7 +14,7 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
 	long count;
 	s32 *pcount;
@@ -32,7 +32,7 @@ void percpu_counter_mod(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(percpu_counter_add);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 03/23] lib: percpu_counter variable batch
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 04/23] lib: make percpu_counter_add take s64 Peter Zijlstra
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_batch.patch --]
[-- Type: text/plain, Size: 2503 bytes --]

Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   10 +++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:36:06.000000000 +0200
@@ -32,9 +32,14 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void percpu_counter_add(struct percpu_counter *fbc, s32 amount);
+void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
+static inline void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+{
+	__percpu_counter_add(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -70,6 +75,9 @@ static inline void percpu_counter_destro
 {
 }
 
+#define __percpu_counter_add(fbc, amount, batch) \
+	percpu_counter_add(fbc, amount)
+
 static inline void
 percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:36:21.000000000 +0200
@@ -14,7 +14,7 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;
 	s32 *pcount;
@@ -22,7 +22,7 @@ void percpu_counter_add(struct percpu_co
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
-	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
 		*pcount = 0;
@@ -32,7 +32,7 @@ void percpu_counter_add(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_add);
+EXPORT_SYMBOL(__percpu_counter_add);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 04/23] lib: make percpu_counter_add take s64
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 03/23] lib: percpu_counter variable batch Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 05/23] lib: percpu_counter_set Peter Zijlstra
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_add64.patch --]
[-- Type: text/plain, Size: 1864 bytes --]

percpu_counter is a s64 counter, make _add consitent.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 +++---
 lib/percpu_counter.c           |    4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,10 +32,10 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch);
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
-static inline void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	__percpu_counter_add(fbc, amount, FBC_BATCH);
 }
@@ -79,7 +79,7 @@ static inline void percpu_counter_destro
 	percpu_counter_add(fbc, amount)
 
 static inline void
-percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	preempt_disable();
 	fbc->count += amount;
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,9 +14,9 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch)
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
-	long count;
+	s64 count;
 	s32 *pcount;
 	int cpu = get_cpu();
 

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 05/23] lib: percpu_counter_set
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 04/23] lib: make percpu_counter_add take s64 Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 06/23] lib: percpu_counter_sum_positive Peter Zijlstra
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_set.patch --]
[-- Type: text/plain, Size: 1791 bytes --]

Provide a method to set a percpu counter to a specified value.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 ++++++
 lib/percpu_counter.c           |   14 ++++++++++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,6 +32,7 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
@@ -75,6 +76,11 @@ static inline void percpu_counter_destro
 {
 }
 
+static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	fbc->count = amount;
+}
+
 #define __percpu_counter_add(fbc, amount, batch) \
 	percpu_counter_add(fbc, amount)
 
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,6 +14,20 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	for_each_possible_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		*pcount = 0;
+	}
+	fbc->count = amount;
+	spin_unlock(&fbc->lock);
+}
+EXPORT_SYMBOL(percpu_counter_set);
+
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 06/23] lib: percpu_counter_sum_positive
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 05/23] lib: percpu_counter_set Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 07/23] lib: percpu_count_sum() Peter Zijlstra
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_sum_positive.patch --]
[-- Type: text/plain, Size: 4665 bytes --]

 s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext3/super.c                |    4 ++--
 fs/ext4/super.c                |    4 ++--
 fs/file_table.c                |    2 +-
 include/linux/percpu_counter.h |    4 ++--
 lib/percpu_counter.c           |    4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -2472,13 +2472,13 @@ static int ext3_statfs (struct dentry * 
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
 	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT3_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -2592,13 +2592,13 @@ static int ext4_statfs (struct dentry * 
 	buf->f_type = EXT4_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
 	if (buf->f_bfree < ext4_r_blocks_count(es))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT4_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -98,7 +98,7 @@ struct file *get_empty_filp(void)
 		 * percpu_counters are inaccurate.  Do an expensive check before
 		 * we go and fail.
 		 */
-		if (percpu_counter_sum(&nr_files) >= files_stat.max_files)
+		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
 			goto over;
 	}
 
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -34,7 +34,7 @@ void percpu_counter_init(struct percpu_c
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum(struct percpu_counter *fbc);
+s64 percpu_counter_sum_positive(struct percpu_counter *fbc);
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
@@ -102,7 +102,7 @@ static inline s64 percpu_counter_read_po
 	return fbc->count;
 }
 
-static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	return percpu_counter_read_positive(fbc);
 }
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum(struct percpu_counter *fbc)
+s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -66,7 +66,7 @@ s64 percpu_counter_sum(struct percpu_cou
 	spin_unlock(&fbc->lock);
 	return ret < 0 ? 0 : ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum);
+EXPORT_SYMBOL(percpu_counter_sum_positive);
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 07/23] lib: percpu_count_sum()
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 06/23] lib: percpu_counter_sum_positive Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 08/23] lib: percpu_counter_init error handling Peter Zijlstra
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_sum.patch --]
[-- Type: text/plain, Size: 2497 bytes --]

Provide an accurate version of percpu_counter_read.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   18 +++++++++++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -34,13 +34,24 @@ void percpu_counter_init(struct percpu_c
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum_positive(struct percpu_counter *fbc);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	__percpu_counter_add(fbc, amount, FBC_BATCH);
 }
 
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
+{
+	s64 ret = __percpu_counter_sum(fbc);
+	return ret < 0 ? 0 : ret;
+}
+
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum(fbc);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -107,6 +118,11 @@ static inline s64 percpu_counter_sum_pos
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return percpu_counter_read(fbc);
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -64,9 +64,9 @@ s64 percpu_counter_sum_positive(struct p
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
-	return ret < 0 ? 0 : ret;
+	return ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum_positive);
+EXPORT_SYMBOL(__percpu_counter_sum);
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 08/23] lib: percpu_counter_init error handling
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 07/23] lib: percpu_count_sum() Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 09/23] lib: percpu_counter_init_irq Peter Zijlstra
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_init.patch --]
[-- Type: text/plain, Size: 5592 bytes --]

alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext2/super.c                |   11 ++++++++---
 fs/ext3/super.c                |   11 ++++++++---
 fs/ext4/super.c                |   11 ++++++++---
 include/linux/percpu_counter.h |    5 +++--
 lib/percpu_counter.c           |    8 +++++++-
 5 files changed, 34 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
 	int db_count;
 	int i, j;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -996,12 +997,16 @@ static int ext2_fill_super(struct super_
 	sbi->s_rsv_window_head.rsv_goal_size = 0;
 	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 				ext2_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 				ext2_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 				ext2_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 	/*
 	 * set up enough so that it can read an inode
 	 */
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -1485,6 +1485,7 @@ static int ext3_fill_super (struct super
 	int i;
 	int needs_recovery;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1745,12 +1746,16 @@ static int ext3_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext3_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext3_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext3_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT3-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -1576,6 +1576,7 @@ static int ext4_fill_super (struct super
 	int needs_recovery;
 	__le32 features;
 	__u64 blocks_count;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1857,12 +1858,16 @@ static int ext4_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext4_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext4_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext4_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT4-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -30,7 +30,7 @@ struct percpu_counter {
 #define FBC_BATCH	(NR_CPUS*4)
 #endif
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -78,9 +78,10 @@ struct percpu_counter {
 	s64 count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	fbc->count = amount;
+	return 0;
 }
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -68,21 +68,27 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
 	fbc->count = amount;
 	fbc->counters = alloc_percpu(s32);
+	if (!fbc->counters)
+		return -ENOMEM;
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
 	list_add(&fbc->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
+	return 0;
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
+	if (!fbc->counters)
+		return;
+
 	free_percpu(fbc->counters);
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 09/23] lib: percpu_counter_init_irq
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 08/23] lib: percpu_counter_init error handling Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 10/23] mm: bdi init hooks Peter Zijlstra
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_init_irq.patch --]
[-- Type: text/plain, Size: 1935 bytes --]

provide a way to init percpu_counters that are supposed to be used from irq
safe contexts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    3 +++
 lib/percpu_counter.c           |   12 ++++++++++++
 2 files changed, 15 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -31,6 +31,7 @@ struct percpu_counter {
 #endif
 
 int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -84,6 +85,8 @@ static inline int percpu_counter_init(st
 	return 0;
 }
 
+#define percpu_counter_init_irq percpu_counter_init
+
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 }
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -68,6 +68,8 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+static struct lock_class_key percpu_counter_irqsafe;
+
 int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
@@ -84,6 +86,16 @@ int percpu_counter_init(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
+{
+	int err;
+
+	err = percpu_counter_init(fbc, amount);
+	if (!err)
+		lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
+	return err;
+}
+
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 	if (!fbc->counters)

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 10/23] mm: bdi init hooks
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 09/23] lib: percpu_counter_init_irq Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 11/23] containers: " Peter Zijlstra
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_init.patch --]
[-- Type: text/plain, Size: 14221 bytes --]

provide BDI constructor/destructor hooks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c               |   13 ++++++++++---
 drivers/block/rd.c              |   20 +++++++++++++++++++-
 drivers/char/mem.c              |    5 +++++
 fs/char_dev.c                   |    1 +
 fs/configfs/configfs_internal.h |    2 ++
 fs/configfs/inode.c             |    8 ++++++++
 fs/configfs/mount.c             |    9 +++++++++
 fs/fuse/inode.c                 |    9 +++++++++
 fs/hugetlbfs/inode.c            |    9 ++++++++-
 fs/nfs/client.c                 |    6 ++++++
 fs/ocfs2/dlm/dlmfs.c            |    9 ++++++++-
 fs/ramfs/inode.c                |   12 +++++++++++-
 fs/sysfs/inode.c                |    5 +++++
 fs/sysfs/mount.c                |    4 ++++
 fs/sysfs/sysfs.h                |    1 +
 include/linux/backing-dev.h     |    8 ++++++++
 mm/readahead.c                  |    6 ++++++
 mm/shmem.c                      |    6 ++++++
 mm/swap.c                       |    4 ++++
 19 files changed, 130 insertions(+), 7 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1780,6 +1780,7 @@ static void blk_release_queue(struct kob
 
 	blk_trace_shutdown(q);
 
+	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1833,21 +1834,27 @@ static struct kobj_type queue_ktype;
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
+	int err;
 
 	q = kmem_cache_alloc_node(requestq_cachep,
 				gfp_mask | __GFP_ZERO, node_id);
 	if (!q)
 		return NULL;
 
+	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
+	q->backing_dev_info.unplug_io_data = q;
+	err = bdi_init(&q->backing_dev_info);
+	if (err) {
+		kmem_cache_free(requestq_cachep, q);
+		return NULL;
+	}
+
 	init_timer(&q->unplug_timer);
 
 	snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue");
 	q->kobj.ktype = &queue_ktype;
 	kobject_init(&q->kobj);
 
-	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
-	q->backing_dev_info.unplug_io_data = q;
-
 	mutex_init(&q->sysfs_lock);
 
 	return q;
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
 		blk_cleanup_queue(rd_queue[i]);
 	}
 	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+	bdi_destroy(&rd_file_backing_dev_info);
+	bdi_destroy(&rd_backing_dev_info);
 }
 
 /*
@@ -419,7 +422,19 @@ static void __exit rd_cleanup(void)
 static int __init rd_init(void)
 {
 	int i;
-	int err = -ENOMEM;
+	int err;
+
+	err = bdi_init(&rd_backing_dev_info);
+	if (err)
+		goto out2;
+
+	err = bdi_init(&rd_file_backing_dev_info);
+	if (err) {
+		bdi_destroy(&rd_backing_dev_info);
+		goto out2;
+	}
+
+	err = -ENOMEM;
 
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
@@ -473,6 +488,9 @@ out:
 		put_disk(rd_disks[i]);
 		blk_cleanup_queue(rd_queue[i]);
 	}
+	bdi_destroy(&rd_backing_dev_info);
+	bdi_destroy(&rd_file_backing_dev_info);
+out2:
 	return err;
 }
 
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -984,6 +984,11 @@ static struct class *mem_class;
 static int __init chr_dev_init(void)
 {
 	int i;
+	int err;
+
+	err = bdi_init(&zero_bdi);
+	if (err)
+		return err;
 
 	if (register_chrdev(MEM_MAJOR,"mem",&memory_fops))
 		printk("unable to get major %d for memory devs\n", MEM_MAJOR);
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c
+++ linux-2.6/fs/char_dev.c
@@ -545,6 +545,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -418,6 +418,7 @@ static int fuse_show_options(struct seq_
 static struct fuse_conn *new_conn(void)
 {
 	struct fuse_conn *fc;
+	int err;
 
 	fc = kzalloc(sizeof(*fc), GFP_KERNEL);
 	if (fc) {
@@ -433,10 +434,17 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		err = bdi_init(&fc->bdi);
+		if (err) {
+			kfree(fc);
+			fc = NULL;
+			goto out;
+		}
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	}
+out:
 	return fc;
 }
 
@@ -446,6 +454,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		mutex_destroy(&fc->inst_mutex);
+		bdi_destroy(&fc->bdi);
 		kfree(fc);
 	}
 }
Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -632,6 +632,7 @@ static void nfs_server_set_fsinfo(struct
 	if (server->rsize > NFS_MAX_FILE_IO_SIZE)
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
@@ -682,6 +683,10 @@ static int nfs_probe_fsinfo(struct nfs_s
 		goto out_error;
 
 	nfs_server_set_fsinfo(server, &fsinfo);
+	error = bdi_init(&server->backing_dev_info);
+	if (error)
+		goto out_error;
+
 
 	/* Get some general file system info */
 	if (server->namelen == 0) {
@@ -761,6 +766,7 @@ void nfs_free_server(struct nfs_server *
 	nfs_put_client(server->nfs_client);
 
 	nfs_free_iostats(server->io_stats);
+	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
 	dprintk("<-- nfs_free_server()\n");
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -34,6 +34,14 @@ struct backing_dev_info {
 	void *unplug_io_data;
 };
 
+static inline int bdi_init(struct backing_dev_info *bdi)
+{
+	return 0;
+}
+
+static inline void bdi_destroy(struct backing_dev_info *bdi)
+{
+}
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -965,11 +965,15 @@ static int __init init_hugetlbfs_fs(void
 	int error;
 	struct vfsmount *vfsmount;
 
+	error = bdi_init(&hugetlbfs_backing_dev_info);
+	if (error)
+		return error;
+
 	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
 					sizeof(struct hugetlbfs_inode_info),
 					0, 0, init_once);
 	if (hugetlbfs_inode_cachep == NULL)
-		return -ENOMEM;
+		return out2;
 
 	error = register_filesystem(&hugetlbfs_fs_type);
 	if (error)
@@ -987,6 +991,8 @@ static int __init init_hugetlbfs_fs(void
  out:
 	if (error)
 		kmem_cache_destroy(hugetlbfs_inode_cachep);
+ out2:
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 	return error;
 }
 
@@ -994,6 +1000,7 @@ static void __exit exit_hugetlbfs_fs(voi
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
Index: linux-2.6/fs/ocfs2/dlm/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlm/dlmfs.c
+++ linux-2.6/fs/ocfs2/dlm/dlmfs.c
@@ -588,13 +588,17 @@ static int __init init_dlmfs_fs(void)
 
 	dlmfs_print_version();
 
+	status = bdi_init(&dlmfs_backing_dev_info);
+	if (status)
+		return status;
+
 	dlmfs_inode_cache = kmem_cache_create("dlmfs_inode_cache",
 				sizeof(struct dlmfs_inode_private),
 				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
 					SLAB_MEM_SPREAD),
 				dlmfs_init_once);
 	if (!dlmfs_inode_cache)
-		return -ENOMEM;
+		goto bail;
 	cleanup_inode = 1;
 
 	user_dlm_worker = create_singlethread_workqueue("user_dlm");
@@ -611,6 +615,7 @@ bail:
 			kmem_cache_destroy(dlmfs_inode_cache);
 		if (cleanup_worker)
 			destroy_workqueue(user_dlm_worker);
+		bdi_destroy(&dlmfs_backing_dev_info);
 	} else
 		printk("OCFS2 User DLM kernel interface loaded\n");
 	return status;
@@ -624,6 +629,8 @@ static void __exit exit_dlmfs_fs(void)
 	destroy_workqueue(user_dlm_worker);
 
 	kmem_cache_destroy(dlmfs_inode_cache);
+
+	bdi_destroy(&dlmfs_backing_dev_info);
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -56,6 +56,8 @@ extern int configfs_is_root(struct confi
 
 extern struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent *);
 extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern int configfs_inode_init(void);
+extern void configfs_inode_exit(void);
 
 extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
 extern int configfs_make_dirent(struct configfs_dirent *,
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -256,4 +256,12 @@ void configfs_hash_and_remove(struct den
 	mutex_unlock(&dir->d_inode->i_mutex);
 }
 
+int __init configfs_inode_init(void)
+{
+	return bdi_init(&configfs_backing_dev_info);
+}
 
+void __exit configfs_inode_exit(void)
+{
+	bdi_destroy(&configfs_backing_dev_info);
+}
Index: linux-2.6/fs/configfs/mount.c
===================================================================
--- linux-2.6.orig/fs/configfs/mount.c
+++ linux-2.6/fs/configfs/mount.c
@@ -154,8 +154,16 @@ static int __init configfs_init(void)
 		subsystem_unregister(&config_subsys);
 		kmem_cache_destroy(configfs_dir_cachep);
 		configfs_dir_cachep = NULL;
+		goto out;
 	}
 
+	err = configfs_inode_init();
+	if (err) {
+		unregister_filesystem(&configfs_fs_type);
+		subsystem_unregister(&config_subsys);
+		kmem_cache_destroy(configfs_dir_cachep);
+		configfs_dir_cachep = NULL;
+	}
 out:
 	return err;
 }
@@ -166,6 +174,7 @@ static void __exit configfs_exit(void)
 	subsystem_unregister(&config_subsys);
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
+	configfs_inode_exit();
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c
+++ linux-2.6/fs/ramfs/inode.c
@@ -223,7 +223,17 @@ module_exit(exit_ramfs_fs)
 
 int __init init_rootfs(void)
 {
-	return register_filesystem(&rootfs_fs_type);
+	int err;
+
+	err = bdi_init(&ramfs_backing_dev_info);
+	if (err)
+		return err;
+
+	err = register_filesystem(&rootfs_fs_type);
+	if (err)
+		bdi_destroy(&ramfs_backing_dev_info);
+
+	return err;
 }
 
 MODULE_LICENSE("GPL");
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -34,6 +34,11 @@ static const struct inode_operations sys
 	.setattr	= sysfs_setattr,
 };
 
+int __init sysfs_inode_init(void)
+{
+	return bdi_init(&sysfs_backing_dev_info);
+}
+
 int sysfs_setattr(struct dentry * dentry, struct iattr * iattr)
 {
 	struct inode * inode = dentry->d_inode;
Index: linux-2.6/fs/sysfs/mount.c
===================================================================
--- linux-2.6.orig/fs/sysfs/mount.c
+++ linux-2.6/fs/sysfs/mount.c
@@ -90,6 +90,10 @@ int __init sysfs_init(void)
 	if (!sysfs_dir_cachep)
 		goto out;
 
+	err = sysfs_inode_init();
+	if (err)
+		goto out_err;
+
 	err = register_filesystem(&sysfs_fs_type);
 	if (!err) {
 		sysfs_mount = kern_mount(&sysfs_fs_type);
Index: linux-2.6/fs/sysfs/sysfs.h
===================================================================
--- linux-2.6.orig/fs/sysfs/sysfs.h
+++ linux-2.6/fs/sysfs/sysfs.h
@@ -78,6 +78,7 @@ extern int sysfs_addrm_finish(struct sys
 
 extern struct inode * sysfs_get_inode(struct sysfs_dirent *sd);
 extern void sysfs_instantiate(struct dentry *dentry, struct inode *inode);
+extern int sysfs_inode_init(void);
 
 extern void release_sysfs_dirent(struct sysfs_dirent * sd);
 extern struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -2460,6 +2460,10 @@ static int __init init_tmpfs(void)
 {
 	int error;
 
+	error = bdi_init(&shmem_backing_dev_info);
+	if (error)
+		goto out4;
+
 	error = init_inodecache();
 	if (error)
 		goto out3;
@@ -2484,6 +2488,8 @@ out1:
 out2:
 	destroy_inodecache();
 out3:
+	bdi_destroy(&shmem_backing_dev_info);
+out4:
 	shm_mnt = ERR_PTR(error);
 	return error;
 }
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -548,6 +548,10 @@ void __init swap_setup(void)
 {
 	unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_SWAP
+	bdi_init(swapper_space.backing_dev_info);
+#endif
+
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -234,6 +234,12 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static int __init readahead_init(void)
+{
+	return bdi_init(&default_backing_dev_info);
+}
+subsys_initcall(readahead_init);
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 11/23] containers: bdi init hooks
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 10/23] mm: bdi init hooks Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 12/23] mtd: " Peter Zijlstra
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_init_container.patch --]
[-- Type: text/plain, Size: 1541 bytes --]

split off from the large bdi_init patch because containers are not slated
for mainline any time soon.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/container.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/container.c
===================================================================
--- linux-2.6.orig/kernel/container.c
+++ linux-2.6/kernel/container.c
@@ -567,12 +567,13 @@ static int container_populate_dir(struct
 static struct inode_operations container_dir_inode_operations;
 static struct file_operations proc_containerstats_operations;
 
+static struct backing_dev_info container_backing_dev_info = {
+	.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
 static struct inode *container_new_inode(mode_t mode, struct super_block *sb)
 {
 	struct inode *inode = new_inode(sb);
-	static struct backing_dev_info container_backing_dev_info = {
-		.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
-	};
 
 	if (inode) {
 		inode->i_mode = mode;
@@ -2261,6 +2262,10 @@ int __init container_init(void)
 	int i;
 	struct proc_dir_entry *entry;
 
+	err = bdi_init(&container_backing_dev_info);
+	if (err)
+		return err;
+
 	for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
 		struct container_subsys *ss = subsys[i];
 		if (!ss->early_init)
@@ -2276,6 +2281,9 @@ int __init container_init(void)
 		entry->proc_fops = &proc_containerstats_operations;
 
 out:
+	if (err)
+		bdi_destroy(&container_backing_dev_info);
+
 	return err;
 }
 

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 12/23] mtd: bdi init hooks
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 11/23] containers: " Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 13/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_init_mtd.patch --]
[-- Type: text/plain, Size: 1122 bytes --]

split off because the relevant mtd changes seem particular to -mm

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/mtd/mtdcore.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c
+++ linux-2.6/drivers/mtd/mtdcore.c
@@ -48,6 +48,7 @@ static LIST_HEAD(mtd_notifiers);
 int add_mtd_device(struct mtd_info *mtd)
 {
 	int i;
+	int err;
 
 	if (!mtd->backing_dev_info) {
 		switch (mtd->type) {
@@ -62,6 +63,9 @@ int add_mtd_device(struct mtd_info *mtd)
 			break;
 		}
 	}
+	err = bdi_init(mtd->backing_dev_info);
+	if (err)
+		return 1;
 
 	BUG_ON(mtd->writesize == 0);
 	mutex_lock(&mtd_table_mutex);
@@ -102,6 +106,7 @@ int add_mtd_device(struct mtd_info *mtd)
 		}
 
 	mutex_unlock(&mtd_table_mutex);
+	bdi_destroy(mtd->backing_dev_info);
 	return 1;
 }
 
@@ -144,6 +149,10 @@ int del_mtd_device (struct mtd_info *mtd
 	}
 
 	mutex_unlock(&mtd_table_mutex);
+
+	if (mtd->backing_dev_info)
+		bdi_destroy(mtd->backing_dev_info);
+
 	return ret;
 }
 

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 13/23] mtd: clean up the backing_dev_info usage
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 12/23] mtd: " Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 14/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: mtd-bdi-fixups.patch --]
[-- Type: text/plain, Size: 1890 bytes --]

Give each mtd device its own backing_dev_info instance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/mtd/mtdcore.c   |    8 +++++---
 include/linux/mtd/mtd.h |    2 ++
 2 files changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c
+++ linux-2.6/drivers/mtd/mtdcore.c
@@ -19,6 +19,7 @@
 #include <linux/init.h>
 #include <linux/mtd/compatmac.h>
 #include <linux/proc_fs.h>
+#include <linux/backing-dev.h>
 
 #include <linux/mtd/mtd.h>
 #include "internal.h"
@@ -53,15 +54,16 @@ int add_mtd_device(struct mtd_info *mtd)
 	if (!mtd->backing_dev_info) {
 		switch (mtd->type) {
 		case MTD_RAM:
-			mtd->backing_dev_info = &mtd_bdi_rw_mappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_rw_mappable;
 			break;
 		case MTD_ROM:
-			mtd->backing_dev_info = &mtd_bdi_ro_mappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_ro_mappable;
 			break;
 		default:
-			mtd->backing_dev_info = &mtd_bdi_unmappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_unmappable;
 			break;
 		}
+		mtd->backing_dev_info = &mtd->mtd_backing_dev_info;
 	}
 	err = bdi_init(mtd->backing_dev_info);
 	if (err)
Index: linux-2.6/include/linux/mtd/mtd.h
===================================================================
--- linux-2.6.orig/include/linux/mtd/mtd.h
+++ linux-2.6/include/linux/mtd/mtd.h
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/notifier.h>
+#include <linux/backing-dev.h>
 
 #include <linux/mtd/compatmac.h>
 #include <mtd/mtd-abi.h>
@@ -154,6 +155,7 @@ struct mtd_info {
 	 * - provides mmap capabilities
 	 */
 	struct backing_dev_info *backing_dev_info;
+	struct backing_dev_info mtd_backing_dev_info;
 
 
 	int (*read) (struct mtd_info *mtd, loff_t from, size_t len, size_t *retlen, u_char *buf);

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 14/23] mtd: give mtdconcat devices their own backing_dev_info
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 13/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 15/23] mm: scalable bdi statistics counters Peter Zijlstra
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds,
	Robert Kaiser

[-- Attachment #1: bdi_mtdconcat.patch --]
[-- Type: text/plain, Size: 3282 bytes --]

These are actual devices, give them their own BDI.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Kaiser <rkaiser@sysgo.de>
---
 drivers/mtd/mtdconcat.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/mtd/mtdconcat.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdconcat.c	2007-04-22 18:55:17.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdconcat.c	2007-04-22 19:01:42.000000000 +0200
@@ -32,6 +32,7 @@ struct mtd_concat {
 	struct mtd_info mtd;
 	int num_subdev;
 	struct mtd_info **subdev;
+	struct backing_dev_info backing_dev_info;
 };
 
 /*
@@ -782,10 +783,9 @@ struct mtd_info *mtd_concat_create(struc
 
 	for (i = 1; i < num_devs; i++) {
 		if (concat->mtd.type != subdev[i]->type) {
-			kfree(concat);
 			printk("Incompatible device type on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		if (concat->mtd.flags != subdev[i]->flags) {
 			/*
@@ -794,10 +794,9 @@ struct mtd_info *mtd_concat_create(struc
 			 */
 			if ((concat->mtd.flags ^ subdev[i]->
 			     flags) & ~MTD_WRITEABLE) {
-				kfree(concat);
 				printk("Incompatible device flags on \"%s\"\n",
 				       subdev[i]->name);
-				return NULL;
+				goto error;
 			} else
 				/* if writeable attribute differs,
 				   make super device writeable */
@@ -809,9 +808,12 @@ struct mtd_info *mtd_concat_create(struc
 		 * - copy-mapping is still permitted
 		 */
 		if (concat->mtd.backing_dev_info !=
-		    subdev[i]->backing_dev_info)
+		    subdev[i]->backing_dev_info) {
+			concat->backing_dev_info = default_backing_dev_info;
+			bdi_init(&concat->backing_dev_info);
 			concat->mtd.backing_dev_info =
-				&default_backing_dev_info;
+				&concat->backing_dev_info;
+		}
 
 		concat->mtd.size += subdev[i]->size;
 		concat->mtd.ecc_stats.badblocks +=
@@ -821,10 +823,9 @@ struct mtd_info *mtd_concat_create(struc
 		    concat->mtd.oobsize    !=  subdev[i]->oobsize ||
 		    !concat->mtd.read_oob  != !subdev[i]->read_oob ||
 		    !concat->mtd.write_oob != !subdev[i]->write_oob) {
-			kfree(concat);
 			printk("Incompatible OOB or ECC data on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		concat->subdev[i] = subdev[i];
 
@@ -903,11 +904,10 @@ struct mtd_info *mtd_concat_create(struc
 		    kmalloc(num_erase_region *
 			    sizeof (struct mtd_erase_region_info), GFP_KERNEL);
 		if (!erase_region_p) {
-			kfree(concat);
 			printk
 			    ("memory allocation error while creating erase region list"
 			     " for device \"%s\"\n", name);
-			return NULL;
+			goto error;
 		}
 
 		/*
@@ -968,6 +968,12 @@ struct mtd_info *mtd_concat_create(struc
 	}
 
 	return &concat->mtd;
+
+error:
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
+	kfree(concat);
+	return NULL;
 }
 
 /*
@@ -977,6 +983,8 @@ struct mtd_info *mtd_concat_create(struc
 void mtd_concat_destroy(struct mtd_info *mtd)
 {
 	struct mtd_concat *concat = CONCAT(mtd);
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
 	if (concat->mtd.numeraseregions)
 		kfree(concat->mtd.eraseregions);
 	kfree(concat);

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 15/23] mm: scalable bdi statistics counters.
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 14/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 16/23] mm: count reclaimable pages per BDI Peter Zijlstra
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 4063 bytes --]

Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   85 ++++++++++++++++++++++++++++++++++++++++++--
 mm/backing-dev.c            |   27 +++++++++++++
 2 files changed, 109 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -8,6 +8,8 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/percpu_counter.h>
+#include <linux/log2.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +26,12 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -32,15 +40,86 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline int bdi_init(struct backing_dev_info *bdi)
+int bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s32 amount)
 {
-	return 0;
+	__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	__mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__mod_bdi_stat(bdi, item, -1);
+}
+
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	s64 sum;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sum = __bdi_stat_sum(bdi, item);
+	local_irq_restore(flags);
+
+	return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+{
+#ifdef CONFIG_SMP
+	return nr_cpu_ids * BDI_STAT_BATCH;
+#else
+	return 1;
+#endif
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -5,6 +5,33 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
+int bdi_init(struct backing_dev_info *bdi)
+{
+	int i, j;
+	int err;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
+		err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
+		if (err) {
+			for (j = 0; j < i; j++)
+				perpcu_counter_destroy(&bdi->bdi_stat[i]);
+			break;
+		}
+	}
+
+	return err;
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_destroy(&bdi->bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 16/23] mm: count reclaimable pages per BDI
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 15/23] mm: scalable bdi statistics counters Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 17/23] mm: count writeback " Peter Zijlstra
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_reclaimable.patch --]
[-- Type: text/plain, Size: 4062 bytes --]

Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    2 ++
 fs/nfs/write.c              |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    4 ++++
 mm/truncate.c               |    2 ++
 5 files changed, 16 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -697,6 +697,8 @@ static int __set_page_dirty(struct page 
 
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -827,6 +827,8 @@ int __set_page_dirty_nobuffers(struct pa
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -961,6 +963,8 @@ int clear_page_dirty_for_io(struct page 
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -72,6 +72,8 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -464,6 +464,7 @@ nfs_mark_request_commit(struct nfs_page 
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -550,6 +551,8 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		nfs_inode_remove_request(req);
@@ -1210,6 +1213,8 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
 	return -ENOMEM;
@@ -1235,6 +1240,8 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -27,6 +27,7 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
+	BDI_RECLAIMABLE,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 17/23] mm: count writeback pages per BDI
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 16/23] mm: count reclaimable pages per BDI Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-09 19:15   ` Christoph Lameter
  2007-08-03 12:37 ` [PATCH 18/23] mm: expose BDI statistics in sysfs Peter Zijlstra
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1930 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |   12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -981,14 +981,18 @@ int test_clear_page_writeback(struct pag
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -1004,14 +1008,18 @@ int test_set_page_writeback(struct page 
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 18/23] mm: expose BDI statistics in sysfs.
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 17/23] mm: count writeback " Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 19/23] lib: floating proportions Peter Zijlstra
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 1963 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3977,6 +3977,23 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_reclaimable =
+		bdi_stat(&q->backing_dev_info, BDI_RECLAIMABLE);
+
+	return sprintf(page, "%llu\n",
+			nr_reclaimable >> (PAGE_CACHE_SHIFT - 10));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_writeback =
+		bdi_stat(&q->backing_dev_info, BDI_WRITEBACK);
+
+	return sprintf(page, "%llu\n",
+			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4001,6 +4018,16 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+	.attr = {.name = "reclaimable_kb", .mode = S_IRUGO },
+	.show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_kb", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4012,6 +4039,8 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_reclaimable_entry.attr,
+	&queue_writeback_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 19/23] lib: floating proportions
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 18/23] mm: expose BDI statistics in sysfs Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 20/23] lib: floating proportions _single Peter Zijlstra
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: proportions.patch --]
[-- Type: text/plain, Size: 10094 bytes --]

Given a set of objects, floating proportions aims to efficiently give the
proportional 'activity' of a single item as compared to the whole set. Where
'activity' is a measure of a temporal property of the items.

It is efficient in that it need not inspect any other items of the set
in order to provide the answer. It is not even needed to know how many
other items there are.

It has one parameter, and that is the period of 'time' over which the 
'activity' is measured.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/proportions.h |   81 +++++++++++++
 lib/Makefile                |    3 
 lib/proportions.c           |  264 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 347 insertions(+), 1 deletion(-)

Index: linux-2.6/lib/proportions.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/proportions.c
@@ -0,0 +1,264 @@
+/*
+ * Floating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {prop_local}, x_{j} is j's number of events,
+ * and i the time period over which the differential is taken. So d/dt_{-i} is
+ * the differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all prop_locals on each period
+ * expiration. By remembering the period count under which it was last accessed
+ * as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+#include <linux/proportions.h>
+#include <linux/rcupdate.h>
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift)
+{
+	int err;
+
+	pd->index = 0;
+	pd->pg[0].shift = shift;
+	mutex_init(&pd->mutex);
+	err = percpu_counter_init_irq(&pd->pg[0].events, 0);
+	if (err)
+		goto out;
+
+	err = percpu_counter_init_irq(&pd->pg[1].events, 0);
+	if (err)
+		percpu_counter_destroy(&pd->pg[0].events);
+
+out:
+	return err;
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the events counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We copy the events count, move the bits around and flip the index.
+ */
+void prop_change_shift(struct prop_descriptor *pd, int shift)
+{
+	int index;
+	int offset;
+	u64 events;
+	unsigned long flags;
+
+	mutex_lock(&pd->mutex);
+
+	index = pd->index ^ 1;
+	offset = pd->pg[pd->index].shift - shift;
+	if (!offset)
+		goto out;
+
+	pd->pg[index].shift = shift;
+
+	local_irq_save(flags);
+	events = percpu_counter_sum(
+			&pd->pg[pd->index].events);
+	if (offset < 0)
+		events <<= -offset;
+	else
+		events >>= offset;
+	percpu_counter_set(&pd->pg[index].events, events);
+
+	/*
+	 * ensure the new pg is fully written before the switch
+	 */
+	smp_wmb();
+	pd->index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&pd->mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+struct prop_global *prop_get_global(struct prop_descriptor *pd)
+{
+	int index;
+
+	rcu_read_lock();
+	index = pd->index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &pd->pg[index];
+}
+
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg)
+{
+	rcu_read_unlock();
+}
+
+static void prop_adjust_shift(struct prop_local *pl, int new_shift)
+{
+	int offset = pl->shift - new_shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		pl->period <<= -offset;
+	else
+		pl->period >>= offset;
+
+	pl->shift = new_shift;
+}
+
+int prop_local_init(struct prop_local *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	return percpu_counter_init_irq(&pl->events, 0);
+}
+
+void prop_local_destroy(struct prop_local *pl)
+{
+	percpu_counter_destroy(&pl->events);
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+void prop_norm(struct prop_global *pg,
+		struct prop_local *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(pl, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		percpu_counter_add(&pl->events, -half);
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+		long *numerator, long *denominator)
+{
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm(pg, pl);
+	*numerator = percpu_counter_read_positive(&pl->events);
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+}
+
Index: linux-2.6/include/linux/proportions.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/proportions.h
@@ -0,0 +1,81 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_PROPORTIONS_H
+#define _LINUX_PROPORTIONS_H
+
+#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+struct prop_global {
+	/*
+	 * The period over which we differentiate
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+	/*
+	 * The total event counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter events;
+};
+
+/*
+ * global proportion descriptor
+ *
+ * this is needed to consitently flip prop_global structures.
+ */
+struct prop_descriptor {
+	int index;
+	struct prop_global pg[2];
+	struct mutex mutex;		/* serialize the prop_global switch */
+};
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift);
+void prop_change_shift(struct prop_descriptor *pd, int new_shift);
+struct prop_global *prop_get_global(struct prop_descriptor *pd);
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg);
+
+struct prop_local {
+	/*
+	 * the local events counter
+	 */
+	struct percpu_counter events;
+
+	/*
+	 * snapshot of the last seen global state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+int prop_local_init(struct prop_local *pl);
+void prop_local_destroy(struct prop_local *pl);
+
+void prop_norm(struct prop_global *pg, struct prop_local *pl);
+
+/*
+ *   ++x_{j}, ++t
+ */
+static inline
+void __prop_inc(struct prop_global *pg, struct prop_local *pl)
+{
+	prop_norm(pg, pl);
+	percpu_counter_add(&pl->events, 1);
+	percpu_counter_add(&pg->events, 1);
+}
+
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+		long *numerator, long *denominator);
+
+#endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -5,7 +5,8 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o argv_split.o
+	 sha1.o irq_regs.o reciprocal_div.o argv_split.o \
+	 proportions.o
 
 lib-$(CONFIG_MMU) += ioremap.o pagewalk.o
 lib-$(CONFIG_SMP) += cpumask.o

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 20/23] lib: floating proportions _single
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 19/23] lib: floating proportions Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: proportions_single.patch --]
[-- Type: text/plain, Size: 9491 bytes --]

Provide a prop_local that does not use a percpu variable for its counter.
This is useful for items that are not (or infrequently) accessed from
multiple context and/or are plenty enought that the percpu_counter overhead
will hurt (tasks).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/proportions.h |  113 +++++++++++++++++++++++++++++++++++++--
 lib/proportions.c           |  125 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 220 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/proportions.h
===================================================================
--- linux-2.6.orig/include/linux/proportions.h
+++ linux-2.6/include/linux/proportions.h
@@ -45,7 +45,11 @@ void prop_change_shift(struct prop_descr
 struct prop_global *prop_get_global(struct prop_descriptor *pd);
 void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg);
 
-struct prop_local {
+/*
+ * ----- PERCPU ------
+ */
+
+struct prop_local_percpu {
 	/*
 	 * the local events counter
 	 */
@@ -59,23 +63,118 @@ struct prop_local {
 	spinlock_t lock;		/* protect the snapshot state */
 };
 
-int prop_local_init(struct prop_local *pl);
-void prop_local_destroy(struct prop_local *pl);
+int prop_local_init_percpu(struct prop_local_percpu *pl);
+void prop_local_destroy_percpu(struct prop_local_percpu *pl);
 
-void prop_norm(struct prop_global *pg, struct prop_local *pl);
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl);
 
 /*
  *   ++x_{j}, ++t
  */
 static inline
-void __prop_inc(struct prop_global *pg, struct prop_local *pl)
+void __prop_inc_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 {
-	prop_norm(pg, pl);
+	prop_norm_percpu(pg, pl);
 	percpu_counter_add(&pl->events, 1);
 	percpu_counter_add(&pg->events, 1);
 }
 
-void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+void prop_fraction_percpu(struct prop_global *pg, struct prop_local_percpu *pl,
+		long *numerator, long *denominator);
+
+/*
+ * ----- SINGLE ------
+ */
+
+struct prop_local_single {
+	/*
+	 * the local events counter
+	 */
+	unsigned long events;
+
+	/*
+	 * snapshot of the last seen global state
+	 * and a lock protecting this state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+int prop_local_init_single(struct prop_local_single *pl);
+void prop_local_destroy_single(struct prop_local_single *pl);
+
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl);
+
+/*
+ *   ++x_{j}, ++t
+ */
+static inline
+void __prop_inc_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	prop_norm_single(pg, pl);
+	pl->events++;
+	percpu_counter_add(&pg->events, 1);
+}
+
+void prop_fraction_single(struct prop_global *pg, struct prop_local_single *pl,
 		long *numerator, long *denominator);
 
+/*
+ * ----- GLUE ------
+ */
+
+#undef TYPE_EQUAL
+#define TYPE_EQUAL(expr, type) \
+	__builtin_types_compatible_p(typeof(expr), type)
+
+extern int __bad_prop_local(void);
+
+#define prop_local_init(prop_local)					\
+({	int err;							\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		err = prop_local_init_percpu(				\
+			(struct prop_local_percpu *)(prop_local));	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		err = prop_local_init_single(				\
+			(struct prop_local_single *)(prop_local));	\
+	else __bad_prop_local();					\
+	err;								\
+})
+
+#define prop_local_destroy(prop_local)					\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		prop_local_destroy_percpu(				\
+			(struct prop_local_percpu *)(prop_local));	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		prop_local_destroy_single(				\
+			(struct prop_local_single *)(prop_local));	\
+	else __bad_prop_local();					\
+} while (0)
+
+#define __prop_inc(prop_global, prop_local)				\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		__prop_inc_percpu(prop_global,				\
+			(struct prop_local_percpu *)(prop_local)); 	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		__prop_inc_single(prop_global,				\
+			(struct prop_local_single *)(prop_local)); 	\
+	else __bad_prop_local();					\
+} while (0)
+
+#define prop_fraction(prop_global, prop_local, num, denom)		\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		prop_fraction_percpu(prop_global,			\
+			(struct prop_local_percpu *)(prop_local),	\
+			num, denom);					\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		prop_fraction_single(prop_global,			\
+			(struct prop_local_single *)(prop_local),	\
+			num, denom);					\
+	else __bad_prop_local();					\
+} while (0)
+
 #endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/proportions.c
===================================================================
--- linux-2.6.orig/lib/proportions.c
+++ linux-2.6/lib/proportions.c
@@ -158,22 +158,31 @@ void prop_put_global(struct prop_descrip
 	rcu_read_unlock();
 }
 
-static void prop_adjust_shift(struct prop_local *pl, int new_shift)
+static void
+__prop_adjust_shift(int *pl_shift, unsigned long *pl_period, int new_shift)
 {
-	int offset = pl->shift - new_shift;
+	int offset = *pl_shift - new_shift;
 
 	if (!offset)
 		return;
 
 	if (offset < 0)
-		pl->period <<= -offset;
+		*pl_period <<= -offset;
 	else
-		pl->period >>= offset;
+		*pl_period >>= offset;
 
-	pl->shift = new_shift;
+	*pl_shift = new_shift;
 }
 
-int prop_local_init(struct prop_local *pl)
+#define prop_adjust_shift(prop_local, pg_shift)			\
+	__prop_adjust_shift(&(prop_local)->shift,		\
+			    &(prop_local)->period, pg_shift)
+
+/*
+ * PERCPU
+ */
+
+int prop_local_init_percpu(struct prop_local_percpu *pl)
 {
 	spin_lock_init(&pl->lock);
 	pl->shift = 0;
@@ -181,7 +190,7 @@ int prop_local_init(struct prop_local *p
 	return percpu_counter_init_irq(&pl->events, 0);
 }
 
-void prop_local_destroy(struct prop_local *pl)
+void prop_local_destroy_percpu(struct prop_local_percpu *pl)
 {
 	percpu_counter_destroy(&pl->events);
 }
@@ -193,8 +202,7 @@ void prop_local_destroy(struct prop_loca
  *     x_{j} -= x_{j}/2;
  *     c_{j}++;
  */
-void prop_norm(struct prop_global *pg,
-		struct prop_local *pl)
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 {
 	unsigned long period = 1UL << (pg->shift - 1);
 	unsigned long period_mask = ~(period - 1);
@@ -247,17 +255,112 @@ void prop_norm(struct prop_global *pg,
  *
  *   p_{j} = x_{j} / (period/2 + t % period/2)
  */
-void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+void prop_fraction_percpu(struct prop_global *pg, struct prop_local_percpu *pl,
 		long *numerator, long *denominator)
 {
 	unsigned long period_2 = 1UL << (pg->shift - 1);
 	unsigned long counter_mask = period_2 - 1;
 	unsigned long global_count;
 
-	prop_norm(pg, pl);
+	prop_norm_percpu(pg, pl);
 	*numerator = percpu_counter_read_positive(&pl->events);
 
 	global_count = percpu_counter_read(&pg->events);
 	*denominator = period_2 + (global_count & counter_mask);
 }
 
+/*
+ * SINGLE
+ */
+
+int prop_local_init_single(struct prop_local_single *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	pl->events = 0;
+	return 0;
+}
+
+void prop_local_destroy_single(struct prop_local_single *pl)
+{
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(pl, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of single counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = pl->events;
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		/*
+		 * Iff shift >32 half might exceed the limits of
+		 * the regular single_counter_mod.
+		 */
+		pl->events -= half;
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_single(struct prop_global *pg, struct prop_local_single *pl,
+		long *numerator, long *denominator)
+{
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_single(pg, pl);
+	*numerator = pl->events;
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+}
+

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 21/23] mm: per device dirty threshold
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 20/23] lib: floating proportions _single Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 13711 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    4 
 kernel/sysctl.c             |    6 +
 mm/backing-dev.c            |   19 +++-
 mm/page-writeback.c         |  205 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 199 insertions(+), 35 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -10,6 +10,7 @@
 
 #include <linux/percpu_counter.h>
 #include <linux/log2.h>
+#include <linux/proportions.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -44,6 +45,9 @@ struct backing_dev_info {
 	void *unplug_io_data;
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+	struct prop_local_percpu completions;
+	int dirty_exceeded;
 };
 
 int bdi_init(struct backing_dev_info *bdi);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -2,6 +2,7 @@
  * mm/page-writeback.c
  *
  * Copyright (C) 2002, Linus Torvalds.
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Contains functions related to writing back dirty pages at the
  * address_space level.
@@ -49,8 +50,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +102,106 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by keeping a floating proportion between BDIs, based on page
+ * writeback completions [end_page_writeback()]. Those devices that write out
+ * pages fastest will get the larger share, while the slower will get a smaller
+ * share.
+ *
+ * We use page writeout completions because we are interested in getting rid of
+ * dirty pages. Having them written out is the primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure these events,
+ * because demand can/will vary over time. The length of this period itself is
+ * measured in page writeback completions.
+ *
+ */
+static struct prop_descriptor vm_completions;
+
+static unsigned long determine_dirtyable_memory(void);
+
+/*
+ * couple the period to the dirty_ratio:
+ *
+ *   period/2 ~ roundup_pow_of_two(dirty limit)
+ */
+static int calc_period_shift(void)
+{
+	unsigned long dirty_total;
+
+	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+	return 2 + ilog2(dirty_total - 1);
+}
+
+/*
+ * update the period when the dirty ratio changes.
+ */
+int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_ratio = vm_dirty_ratio;
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
+		int shift = calc_period_shift();
+		prop_change_shift(&vm_completions, shift);
+	}
+	return ret;
+}
+
+/*
+ * Increment the BDI's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static void __bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	struct prop_global *pg = prop_get_global(&vm_completions);
+	__prop_inc(pg, &bdi->completions);
+	prop_put_global(&vm_completions, pg);
+}
+
+/*
+ * Obtain an accurate fraction of the BDI's portion.
+ */
+static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator)
+{
+	if (bdi_cap_writeback_dirty(bdi)) {
+		struct prop_global *pg = prop_get_global(&vm_completions);
+		prop_fraction(pg, &bdi->completions, numerator, denominator);
+		prop_put_global(&vm_completions, pg);
+	} else {
+		*numerator = 0;
+		*denominator = 1;
+	}
+}
+
+/*
+ * Clip the earned share of dirty pages to that which is actually available.
+ * This avoids exceeding the total dirty_limit when the floating averages
+ * fluctuate too quickly.
+ */
+static void
+clip_bdi_dirty_limit(struct backing_dev_info *bdi, long dirty, long *pbdi_dirty)
+{
+	long avail_dirty;
+
+	avail_dirty = dirty -
+		(global_page_state(NR_FILE_DIRTY) +
+		 global_page_state(NR_WRITEBACK) +
+		 global_page_state(NR_UNSTABLE_NFS));
+
+	if (avail_dirty < 0)
+		avail_dirty = 0;
+
+	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
+		bdi_stat(bdi, BDI_WRITEBACK);
+
+	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -158,8 +257,8 @@ static unsigned long determine_dirtyable
 }
 
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +292,22 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long bdi_dirty = dirty;
+		long numerator, denominator;
+
+		/*
+		 * Calculate this BDI's share of the dirty ratio.
+		 */
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+
+		bdi_dirty *= numerator;
+		do_div(bdi_dirty, denominator);
+
+		*pbdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+	}
 }
 
 /*
@@ -204,9 +319,11 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,15 +338,15 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
@@ -237,16 +354,37 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+
+			/*
+			 * In order to avoid the stacked BDI deadlock we need
+			 * to ensure we accurately count the 'dirty' pages when
+			 * the threshold is low.
+			 *
+			 * Otherwise it would be possible to get thresh+n pages
+			 * reported dirty, even though there are thresh-m pages
+			 * actually dirty; with m+n sitting in the percpu
+			 * deltas.
+			 */
+			if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat_sum(bdi, BDI_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable =
+					bdi_stat(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat(bdi, BDI_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,9 +392,9 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
+	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
+			bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
@@ -270,7 +408,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -306,7 +446,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
+	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -342,7 +482,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +517,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -580,9 +720,14 @@ static struct notifier_block __cpuinitda
  */
 void __init page_writeback_init(void)
 {
+	int shift;
+
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+
+	shift = calc_period_shift();
+	prop_descriptor_init(&vm_completions, shift);
 }
 
 /**
@@ -988,8 +1133,10 @@ int test_clear_page_writeback(struct pag
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_writeback_dirty(bdi))
+			if (bdi_cap_writeback_dirty(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__bdi_writeout_inc(bdi);
+			}
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -12,11 +12,17 @@ int bdi_init(struct backing_dev_info *bd
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
-		if (err) {
-			for (j = 0; j < i; j++)
-				perpcu_counter_destroy(&bdi->bdi_stat[i]);
-			break;
-		}
+		if (err)
+			goto err;
+	}
+
+	bdi->dirty_exceeded = 0;
+	err = prop_local_init(&bdi->completions);
+
+	if (err) {
+err:
+		for (j = 0; j < i; j++)
+			percpu_counter_destroy(&bdi->bdi_stat[i]);
 	}
 
 	return err;
@@ -29,6 +35,8 @@ void bdi_destroy(struct backing_dev_info
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+	prop_local_destroy(&bdi->completions);
 }
 EXPORT_SYMBOL(bdi_destroy);
 
@@ -81,3 +89,4 @@ long congestion_wait(int rw, long timeou
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
+
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -174,6 +174,10 @@ extern ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+extern int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -831,7 +835,7 @@ static ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &dirty_ratio_handler,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 22/23] mm: dirty balancing for tasks
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 12:37 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: dirty_pages2.patch --]
[-- Type: text/plain, Size: 5322 bytes --]

Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer. 

Andrea proposed something similar:
  http://lwn.net/Articles/152277/

The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approached appear quite similar.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 +
 kernel/exit.c         |    1 
 kernel/fork.c         |    8 +++++++
 mm/page-writeback.c   |   56 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 66 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -86,6 +86,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/proportions.h>
 
 #include <asm/processor.h>
 
@@ -1188,6 +1189,7 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
+	struct prop_local_single dirties;
 };
 
 /*
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -161,6 +161,7 @@ repeat:
 	ptrace_unlink(p);
 	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
 	__exit_signal(p);
+	prop_local_destroy(&p->dirties);
 
 	/*
 	 * If we are the last non-leader member of the thread
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -163,6 +163,7 @@ static struct task_struct *dup_task_stru
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	int err;
 
 	prepare_to_copy(orig);
 
@@ -176,6 +177,13 @@ static struct task_struct *dup_task_stru
 		return NULL;
 	}
 
+	err = prop_local_init(&tsk->dirties);
+	if (err) {
+		free_thread_info(ti);
+		free_task_struct(tsk);
+		return NULL;
+	}
+
 	*tsk = *orig;
 	tsk->stack = ti;
 	setup_thread_stack(tsk, orig);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -118,6 +118,7 @@ static void background_writeout(unsigned
  *
  */
 static struct prop_descriptor vm_completions;
+static struct prop_descriptor vm_dirties;
 
 static unsigned long determine_dirtyable_memory(void);
 
@@ -146,6 +147,7 @@ int dirty_ratio_handler(ctl_table *table
 	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
 		int shift = calc_period_shift();
 		prop_change_shift(&vm_completions, shift);
+		prop_change_shift(&vm_dirties, shift);
 	}
 	return ret;
 }
@@ -161,6 +163,16 @@ static void __bdi_writeout_inc(struct ba
 	prop_put_global(&vm_completions, pg);
 }
 
+static void task_dirty_inc(struct task_struct *tsk)
+{
+	unsigned long flags;
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	local_irq_save(flags);
+	__prop_inc(pg, &tsk->dirties);
+	local_irq_restore(flags);
+	prop_put_global(&vm_dirties, pg);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -201,6 +213,38 @@ clip_bdi_dirty_limit(struct backing_dev_
 	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
 }
 
+void task_dirties_fraction(struct task_struct *tsk,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	prop_fraction(pg, &tsk->dirties, numerator, denominator);
+	prop_put_global(&vm_dirties, pg);
+}
+
+/*
+ * scale the dirty limit
+ *
+ * task specific dirty limit:
+ *
+ *   dirty -= (dirty/2) * p_{t}
+ */
+void task_dirty_limit(struct task_struct *tsk, long *pdirty)
+{
+	long numerator, denominator;
+	long dirty = *pdirty;
+	long long inv = dirty >> 1;
+
+	task_dirties_fraction(tsk, &numerator, &denominator);
+	inv *= numerator;
+	do_div(inv, denominator);
+
+	dirty -= inv;
+	if (dirty < *pdirty/2)
+		dirty = *pdirty/2;
+
+	*pdirty = dirty;
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -307,6 +351,7 @@ get_dirty_limits(long *pbackground, long
 
 		*pbdi_dirty = bdi_dirty;
 		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+		task_dirty_limit(current, pbdi_dirty);
 	}
 }
 
@@ -728,6 +773,7 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
+	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1006,7 +1052,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
-int fastcall set_page_dirty(struct page *page)
+static int __set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -1024,6 +1070,14 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
+
+int fastcall set_page_dirty(struct page *page)
+{
+	int ret = __set_page_dirty(page);
+	if (ret)
+		task_dirty_inc(current);
+	return ret;
+}
 EXPORT_SYMBOL(set_page_dirty);
 
 /*

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* [PATCH 23/23] debug: sysfs files for the current ratio/size/total
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
@ 2007-08-03 12:37 ` Peter Zijlstra
  2007-08-03 22:21 ` [PATCH 00/23] per device dirty throttling -v8 Linus Torvalds
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-03 12:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_debug.patch --]
[-- Type: text/plain, Size: 4216 bytes --]

Expose the per bdi dirty limits in sysfs

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c   |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c |    4 +-
 2 files changed, 82 insertions(+), 2 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3995,6 +3995,56 @@ static ssize_t queue_nr_writeback_show(s
 			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
 }
 
+extern void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator);
+
+static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_num_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_denom_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", div);
+}
+
+extern void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		struct backing_dev_info *bdi);
+
+static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", bdi_dirty);
+}
+
+static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", dirty);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -4028,6 +4078,31 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_nr_writeback_show,
 };
 
+static struct queue_sysfs_entry queue_cache_ratio_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_ratio_show,
+};
+
+static struct queue_sysfs_entry queue_cache_num_entry = {
+	.attr = {.name = "cache_num", .mode = S_IRUGO },
+	.show = queue_nr_cache_num_show,
+};
+
+static struct queue_sysfs_entry queue_cache_denom_entry = {
+	.attr = {.name = "cache_denom", .mode = S_IRUGO },
+	.show = queue_nr_cache_denom_show,
+};
+
+static struct queue_sysfs_entry queue_cache_size_entry = {
+	.attr = {.name = "cache_size", .mode = S_IRUGO },
+	.show = queue_nr_cache_size_show,
+};
+
+static struct queue_sysfs_entry queue_cache_total_entry = {
+	.attr = {.name = "cache_total", .mode = S_IRUGO },
+	.show = queue_nr_cache_total_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4041,6 +4116,11 @@ static struct attribute *default_attrs[]
 	&queue_max_sectors_entry.attr,
 	&queue_reclaimable_entry.attr,
 	&queue_writeback_entry.attr,
+	&queue_cache_ratio_entry.attr,
+	&queue_cache_num_entry.attr,
+	&queue_cache_denom_entry.attr,
+	&queue_cache_size_entry.attr,
+	&queue_cache_total_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -176,7 +176,7 @@ static void task_dirty_inc(struct task_s
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
 	if (bdi_cap_writeback_dirty(bdi)) {
@@ -300,7 +300,7 @@ static unsigned long determine_dirtyable
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-static void
+void
 get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 		 struct backing_dev_info *bdi)
 {

--


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2007-08-03 12:37 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
@ 2007-08-03 22:21 ` Linus Torvalds
  2007-08-04  6:32   ` Ingo Molnar
  2007-08-06 20:26 ` Miklos Szeredi
  2007-08-08 12:25 ` richard kennedy
  25 siblings, 1 reply; 188+ messages in thread
From: Linus Torvalds @ 2007-08-03 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard



On Fri, 3 Aug 2007, Peter Zijlstra wrote:
> 
> These patches aim to improve balance_dirty_pages() and directly address three
> issues:
>   1) inter device starvation
>   2) stacked device deadlocks
>   3) inter process starvation

Ok, the patches certainly look pretty enough, and you fixed the only thing 
I complained about last time (naming), so as far as I'm concerned it's now 
just a matter of whether it *works* or not. I guess being in -mm will help 
somewhat, but it would be good to have people with several disks etc 
actively test this out.

		Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-03 22:21 ` [PATCH 00/23] per device dirty throttling -v8 Linus Torvalds
@ 2007-08-04  6:32   ` Ingo Molnar
  2007-08-04  7:07     ` Ingo Molnar
  2007-08-05 17:22     ` Brice Figureau
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04  6:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 3 Aug 2007, Peter Zijlstra wrote:
> > 
> > These patches aim to improve balance_dirty_pages() and directly address three
> > issues:
> >   1) inter device starvation
> >   2) stacked device deadlocks
> >   3) inter process starvation
> 
> Ok, the patches certainly look pretty enough, and you fixed the only 
> thing I complained about last time (naming), so as far as I'm 
> concerned it's now just a matter of whether it *works* or not. I guess 
> being in -mm will help somewhat, but it would be good to have people 
> with several disks etc actively test this out.

There are positive reports in the never-ending "my system crawls like an 
XT when copying large files" bugzilla entry:

 http://bugzilla.kernel.org/show_bug.cgi?id=7372

 " vfs_cache_pressure=1
   TCQ   nr_requests
   8     128    not that bad
   1     128    snappiest configuration, almost no pauses
                (or unnoticable ones) "
 
 " 1) vfs_cache_pressure at 100, 2.6.21.5+per bdi throttling patch 
   Result is good, not as snappier as I'd want during a large copy but 
   still usable. No process seems stuck for agen, but there seems to be 
   some short (second or subsecond) moment where everything is stuck 
   (like if you run a top d 0.5, the screen is not updated on a regular
   basis).

   2) vfs_cache_pressure at 1, 2.6.21.5+per bdi throttling patch Result
   is at 2.6.17 level. It is the better combination since 2.6.17. "

 " 1) I've applied the patches posted by Peter Zijlstra in comment #76 
   to the 2.6.21-mm2 kernel to check if it removes the problem. My
   impression is that the problem is still there with those patches,
   although less visible then with the clean 2.6.21 kernel. "

so the whole problem area seems to be a "perfect storm" created by a 
combination of TCQ, IO scheduling and VM dirty handling weaknesses. Per 
device dirty throttling is a good step forward and it makes a very 
visible positive difference.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  6:32   ` Ingo Molnar
@ 2007-08-04  7:07     ` Ingo Molnar
  2007-08-04  7:44       ` david
                         ` (2 more replies)
  2007-08-05 17:22     ` Brice Figureau
  1 sibling, 3 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04  7:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard


* Ingo Molnar <mingo@elte.hu> wrote:

> There are positive reports in the never-ending "my system crawls like 
> an XT when copying large files" bugzilla entry:
> 
>  http://bugzilla.kernel.org/show_bug.cgi?id=7372

i forgot this entry:

 " We recently upgraded our office to gigabit Ethernet and got some big 
   AMD64 / 3ware boxes for file and vmware servers... only to find them 
   almost useless under any kind of real load. I've built some patched 
   2.6.21.6 kernels (using the bdi throttling patch you mentioned) to 
   see if our various Debian Etch boxes run better. So far my testing 
   shows a *great* improvement over the stock Debian 2.6.18 kernel on 
   our configurations. "

and bdi has been in -mm in the past i think, so we also know (to a 
certain degree) that it does not hurt those workloads that are fine 
either.

[ my personal interest in this is the following regression: every time i
  start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
  i get up to 30 seconds complete pauses in Vim (and most other tasks),
  during plain editing of the source code. (which happens when Vim tries
  to write() to its swap/undo-file.) ]

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  7:07     ` Ingo Molnar
@ 2007-08-04  7:44       ` david
  2007-08-04 16:01         ` Ray Lee
  2007-08-04 10:33       ` Ingo Molnar
  2007-08-04 16:15       ` Linus Torvalds
  2 siblings, 1 reply; 188+ messages in thread
From: david @ 2007-08-04  7:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

On Sat, 4 Aug 2007, Ingo Molnar wrote:

> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> There are positive reports in the never-ending "my system crawls like
>> an XT when copying large files" bugzilla entry:
>>
>>  http://bugzilla.kernel.org/show_bug.cgi?id=7372
>
> i forgot this entry:
>
> " We recently upgraded our office to gigabit Ethernet and got some big
>   AMD64 / 3ware boxes for file and vmware servers... only to find them
>   almost useless under any kind of real load. I've built some patched
>   2.6.21.6 kernels (using the bdi throttling patch you mentioned) to
>   see if our various Debian Etch boxes run better. So far my testing
>   shows a *great* improvement over the stock Debian 2.6.18 kernel on
>   our configurations. "
>
> and bdi has been in -mm in the past i think, so we also know (to a
> certain degree) that it does not hurt those workloads that are fine
> either.
>
> [ my personal interest in this is the following regression: every time i
>  start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
>  i get up to 30 seconds complete pauses in Vim (and most other tasks),
>  during plain editing of the source code. (which happens when Vim tries
>  to write() to its swap/undo-file.) ]

I have an issue that sounds like it's related.

I've got a syslog server that's got two Opteron 246 cpu's, 16G ram, 2x140G 
15k rpm drives (fusion MPT hardware mirroring), 16x500G 7200rpm SATA 
drives on 3ware 9500 cards (software raid6) running 2.6.20.3 with hz set 
at default and preempt turned off.

I have syslog doing buffered writes to the SCSI drives and every 5 min a 
cron job copies the data to the raid array.

I've found that if I do anything significant on the large raid array that 
the system looses a significant amount of the UDP syslog traffic, even 
though there should be pleanty of ram and cpu (and the spindles involved 
in the writes are not being touched), even a grep can cause up to 40% 
losses in the syslog traffic. I've experimented with nice levels (nicing 
down the grep and nicing up the syslogd) without a noticable effect on the 
losses.

I've been planning to try a new kernel with hz=1000 to see if that would 
help, and after that experiment with the various preempt settings, but it 
sounds like the per-device queues may actually be more relavent to the 
problem.

what would you suggest I test, and in what order and combination?

David Lang

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  7:07     ` Ingo Molnar
  2007-08-04  7:44       ` david
@ 2007-08-04 10:33       ` Ingo Molnar
  2007-08-04 16:17         ` Linus Torvalds
  2007-08-05  0:28         ` Andi Kleen
  2007-08-04 16:15       ` Linus Torvalds
  2 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 10:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard


* Ingo Molnar <mingo@elte.hu> wrote:

> [ my personal interest in this is the following regression: every time 
>   i start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM 
>   box, i get up to 30 seconds complete pauses in Vim (and most other 
>   tasks), during plain editing of the source code. (which happens when 
>   Vim tries to write() to its swap/undo-file.) ]

hm, it turns out that it's due to vim doing an occasional fsync not only 
on writeout, but during normal use too. "set nofsync" in the .vimrc 
solves this problem.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  7:44       ` david
@ 2007-08-04 16:01         ` Ray Lee
  2007-08-04 17:15           ` david
  2007-08-09  5:11           ` david
  0 siblings, 2 replies; 188+ messages in thread
From: Ray Lee @ 2007-08-04 16:01 UTC (permalink / raw)
  To: david
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard, netdev

(adding netdev cc:)

On 8/4/07, david@lang.hm <david@lang.hm> wrote:
> On Sat, 4 Aug 2007, Ingo Molnar wrote:
>
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> >> There are positive reports in the never-ending "my system crawls like
> >> an XT when copying large files" bugzilla entry:
> >>
> >>  http://bugzilla.kernel.org/show_bug.cgi?id=7372
> >
> > i forgot this entry:
> >
> > " We recently upgraded our office to gigabit Ethernet and got some big
> >   AMD64 / 3ware boxes for file and vmware servers... only to find them
> >   almost useless under any kind of real load. I've built some patched
> >   2.6.21.6 kernels (using the bdi throttling patch you mentioned) to
> >   see if our various Debian Etch boxes run better. So far my testing
> >   shows a *great* improvement over the stock Debian 2.6.18 kernel on
> >   our configurations. "
> >
> > and bdi has been in -mm in the past i think, so we also know (to a
> > certain degree) that it does not hurt those workloads that are fine
> > either.
> >
> > [ my personal interest in this is the following regression: every time i
> >  start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
> >  i get up to 30 seconds complete pauses in Vim (and most other tasks),
> >  during plain editing of the source code. (which happens when Vim tries
> >  to write() to its swap/undo-file.) ]
>
> I have an issue that sounds like it's related.
>
> I've got a syslog server that's got two Opteron 246 cpu's, 16G ram, 2x140G
> 15k rpm drives (fusion MPT hardware mirroring), 16x500G 7200rpm SATA
> drives on 3ware 9500 cards (software raid6) running 2.6.20.3 with hz set
> at default and preempt turned off.
>
> I have syslog doing buffered writes to the SCSI drives and every 5 min a
> cron job copies the data to the raid array.
>
> I've found that if I do anything significant on the large raid array that
> the system looses a significant amount of the UDP syslog traffic, even
> though there should be pleanty of ram and cpu (and the spindles involved
> in the writes are not being touched), even a grep can cause up to 40%
> losses in the syslog traffic. I've experimented with nice levels (nicing
> down the grep and nicing up the syslogd) without a noticable effect on the
> losses.
>
> I've been planning to try a new kernel with hz=1000 to see if that would
> help, and after that experiment with the various preempt settings, but it
> sounds like the per-device queues may actually be more relavent to the
> problem.
>
> what would you suggest I test, and in what order and combination?

At least on a surface level, your report has some similarities to
http://lkml.org/lkml/2007/5/21/84 . In that message, John Miller
mentions several things he tried without effect:

< - I increased the max allowed receive buffer through
< proc/sys/net/core/rmem_max and the application calls the right
< syscall. "netstat -su" does not show any "packet receive errors".
<
< - After getting "kernel: swapper: page allocation failure.
< order:0, mode:0x20", I increased /proc/sys/vm/min_free_kbytes
<
< - ixgb.txt in kernel network documentation suggests to increase
< net.core.netdev_max_backlog to 300000. This did not help.
<
< - I also had to increase net.core.optmem_max, because the default
< value was too small for 700 multicast groups.

As they're all pretty simple to test, it may be worthwhile to give
them a shot just to rule things out.

Ray

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  7:07     ` Ingo Molnar
  2007-08-04  7:44       ` david
  2007-08-04 10:33       ` Ingo Molnar
@ 2007-08-04 16:15       ` Linus Torvalds
  2 siblings, 0 replies; 188+ messages in thread
From: Linus Torvalds @ 2007-08-04 16:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard



On Sat, 4 Aug 2007, Ingo Molnar wrote:
> 
> i forgot this entry:
> 
>  " We recently upgraded our office to gigabit Ethernet and got some big 
>    AMD64 / 3ware boxes for file and vmware servers... only to find them 
>    almost useless under any kind of real load. I've built some patched 
>    2.6.21.6 kernels (using the bdi throttling patch you mentioned) to 
>    see if our various Debian Etch boxes run better. So far my testing 
>    shows a *great* improvement over the stock Debian 2.6.18 kernel on 
>    our configurations. "

Well, quite frankly, there are other changes between 2.6.18 and 2.6.21 
that are more likely to be a big deal than Peter's patches. No offense to 
Peter, but we also cut the default dirty percentage by a factor of four in 
that timeframe, and that made a *huge* difference for some setups (and 
admittedly not so much on others ;)

> and bdi has been in -mm in the past i think, so we also know (to a 
> certain degree) that it does not hurt those workloads that are fine 
> either.

Hey, I'm not complaining. I think the code looks fine. I just want to make 
sure that it actually helps.

> [ my personal interest in this is the following regression: every time i
>   start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
>   i get up to 30 seconds complete pauses in Vim (and most other tasks),
>   during plain editing of the source code. (which happens when Vim tries
>   to write() to its swap/undo-file.) ]

So do the patches really end up helping your case? Or is this just why 
you're following it, and hoping they'll eventually do so?

		Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 10:33       ` Ingo Molnar
@ 2007-08-04 16:17         ` Linus Torvalds
  2007-08-04 16:37           ` Ingo Molnar
  2007-08-04 16:41           ` Andrew Morton
  2007-08-05  0:28         ` Andi Kleen
  1 sibling, 2 replies; 188+ messages in thread
From: Linus Torvalds @ 2007-08-04 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard



On Sat, 4 Aug 2007, Ingo Molnar wrote:

> > [ my personal interest in this is the following regression: every time 
> >   i start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM 
> >   box, i get up to 30 seconds complete pauses in Vim (and most other 
> >   tasks), during plain editing of the source code. (which happens when 
> >   Vim tries to write() to its swap/undo-file.) ]
> 
> hm, it turns out that it's due to vim doing an occasional fsync not only 
> on writeout, but during normal use too. "set nofsync" in the .vimrc 
> solves this problem.

Yes, that's independent. The fact is, ext3 *sucks* at fsync. I hate hate 
hate it. It's totally unusable, imnsho.

The whole point of fsync() is that it should sync only that one file, and 
avoid syncing all the other stuff that is going on, and ext3 violates 
that, because it ends up having to sync the whole log, or something like 
that. So even if vim really wants to sync a small file, you end up waiting 
for megabytes of data being written out.

I detest logging filesystems. 

			Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:17         ` Linus Torvalds
@ 2007-08-04 16:37           ` Ingo Molnar
  2007-08-04 16:51             ` Andrew Morton
                               ` (4 more replies)
  2007-08-04 16:41           ` Andrew Morton
  1 sibling, 5 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 16:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > hm, it turns out that it's due to vim doing an occasional fsync not 
> > only on writeout, but during normal use too. "set nofsync" in the 
> > .vimrc solves this problem.
> 
> Yes, that's independent. The fact is, ext3 *sucks* at fsync. I hate 
> hate hate it. It's totally unusable, imnsho.

yeah, it's really ugly. But otherwise i've got no real complaint about 
ext3 - with the obligatory qualification that "noatime,nodiratime" in 
/etc/fstab is a must. This speeds up things very visibly - especially 
when lots of files are accessed. It's kind of weird that every Linux 
desktop and server is hurt by a noticeable IO performance slowdown due 
to the constant atime updates, while there's just two real users of it: 
tmpwatch [which can be configured to use ctime so it's not a big issue] 
and some backup tools. (Ok, and mail-notify too i guess.) Out of tens of 
thousands of applications. So for most file workloads we give Windows a 
20%-30% performance edge, for almost nothing. (for RAM-starved kernel 
builds the performance difference between atime and noatime+nodiratime 
setups is more on the order of 40%)

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:17         ` Linus Torvalds
  2007-08-04 16:37           ` Ingo Molnar
@ 2007-08-04 16:41           ` Andrew Morton
  2007-08-04 17:26             ` Nikita Danilov
  2007-08-04 19:16             ` Florian Weimer
  1 sibling, 2 replies; 188+ messages in thread
From: Andrew Morton @ 2007-08-04 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

On Sat, 4 Aug 2007 09:17:44 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sat, 4 Aug 2007, Ingo Molnar wrote:
> 
> > > [ my personal interest in this is the following regression: every time 
> > >   i start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM 
> > >   box, i get up to 30 seconds complete pauses in Vim (and most other 
> > >   tasks), during plain editing of the source code. (which happens when 
> > >   Vim tries to write() to its swap/undo-file.) ]
> > 
> > hm, it turns out that it's due to vim doing an occasional fsync not only 
> > on writeout, but during normal use too. "set nofsync" in the .vimrc 
> > solves this problem.
> 
> Yes, that's independent. The fact is, ext3 *sucks* at fsync. I hate hate 
> hate it. It's totally unusable, imnsho.
> 
> The whole point of fsync() is that it should sync only that one file, and 
> avoid syncing all the other stuff that is going on, and ext3 violates 
> that, because it ends up having to sync the whole log, or something like 
> that. So even if vim really wants to sync a small file, you end up waiting 
> for megabytes of data being written out.
> 
> I detest logging filesystems. 
> 

Well it's not a problem with journalling per-se.  Other journalling designs
may well not have this problem.

It's an unfortunate coupling:

- the ext3 journal contains metadata from all altered files

- ordered-mode needs to write back data for a file before committing its
  metdata to the journal.

- fsync of one file requires a commit for its metadata, which will commit
  metadata for all files

- hence we need to write back all data for all files which have metadata
  in the journal.


It's pretty much unfixable given the ext3 journalling design, and the
guarantees which data-ordered provides.

The easy preventive is to mount with data=writeback.  Maybe that should
have been the default.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:37           ` Ingo Molnar
@ 2007-08-04 16:51             ` Andrew Morton
  2007-08-04 16:56               ` Ingo Molnar
  2007-08-04 17:02             ` Diego Calleja
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 188+ messages in thread
From: Andrew Morton @ 2007-08-04 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

On Sat, 4 Aug 2007 18:37:33 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > hm, it turns out that it's due to vim doing an occasional fsync not 
> > > only on writeout, but during normal use too. "set nofsync" in the 
> > > .vimrc solves this problem.
> > 
> > Yes, that's independent. The fact is, ext3 *sucks* at fsync. I hate 
> > hate hate it. It's totally unusable, imnsho.
> 
> yeah, it's really ugly. But otherwise i've got no real complaint about 
> ext3 - with the obligatory qualification that "noatime,nodiratime" in 
> /etc/fstab is a must. This speeds up things very visibly - especially 
> when lots of files are accessed. It's kind of weird that every Linux 
> desktop and server is hurt by a noticeable IO performance slowdown due 
> to the constant atime updates,

Not just more IO: it will cause great gobs of blockdev pagecache to remain
in memory, too.


> while there's just two real users of it: 
> tmpwatch [which can be configured to use ctime so it's not a big issue] 
> and some backup tools. (Ok, and mail-notify too i guess.) Out of tens of 
> thousands of applications. So for most file workloads we give Windows a 
> 20%-30% performance edge, for almost nothing. (for RAM-starved kernel 
> builds the performance difference between atime and noatime+nodiratime 
> setups is more on the order of 40%)
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:51             ` Andrew Morton
@ 2007-08-04 16:56               ` Ingo Molnar
  2007-08-04 20:23                 ` Alan Cox
  0 siblings, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > yeah, it's really ugly. But otherwise i've got no real complaint 
> > about ext3 - with the obligatory qualification that 
> > "noatime,nodiratime" in /etc/fstab is a must. This speeds up things 
> > very visibly - especially when lots of files are accessed. It's kind 
> > of weird that every Linux desktop and server is hurt by a noticeable 
> > IO performance slowdown due to the constant atime updates,
> 
> Not just more IO: it will cause great gobs of blockdev pagecache to 
> remain in memory, too.

i tried to convince distro folks about it ... but there's fear, 
uncertainty and doubt about touching /etc/fstab and i suspect no major 
distro will do it until another does it - which is a catch-22 :-/ So i 
guess we should add a kernel config option that allows the kernel rpm 
maker to just disable atime by default. (re-enableable via boot-line and 
fstab entry too) [That new kernel config option would be disabled by 
default.] That makes it much easier to control and introduce.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:37           ` Ingo Molnar
  2007-08-04 16:51             ` Andrew Morton
@ 2007-08-04 17:02             ` Diego Calleja
  2007-08-04 17:17               ` Ingo Molnar
  2007-08-04 17:39             ` Linus Torvalds
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 188+ messages in thread
From: Diego Calleja @ 2007-08-04 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

El Sat, 4 Aug 2007 18:37:33 +0200, Ingo Molnar <mingo@elte.hu> escribió:

> thousands of applications. So for most file workloads we give Windows a 
> 20%-30% performance edge, for almost nothing. (for RAM-starved kernel 
> builds the performance difference between atime and noatime+nodiratime 
> setups is more on the order of 40%)

Just curious - do you have numbers with relatime?

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:01         ` Ray Lee
@ 2007-08-04 17:15           ` david
  2007-08-09  5:11           ` david
  1 sibling, 0 replies; 188+ messages in thread
From: david @ 2007-08-04 17:15 UTC (permalink / raw)
  To: Ray Lee
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard, netdev

On Sat, 4 Aug 2007, Ray Lee wrote:

> (adding netdev cc:)
>
> On 8/4/07, david@lang.hm <david@lang.hm> wrote:
>> On Sat, 4 Aug 2007, Ingo Molnar wrote:
>>
>>> * Ingo Molnar <mingo@elte.hu> wrote:
>>>
>>>> There are positive reports in the never-ending "my system crawls like
>>>> an XT when copying large files" bugzilla entry:
>>>>
>>>>  http://bugzilla.kernel.org/show_bug.cgi?id=7372
>>>
>>> i forgot this entry:
>>>
>>> " We recently upgraded our office to gigabit Ethernet and got some big
>>>   AMD64 / 3ware boxes for file and vmware servers... only to find them
>>>   almost useless under any kind of real load. I've built some patched
>>>   2.6.21.6 kernels (using the bdi throttling patch you mentioned) to
>>>   see if our various Debian Etch boxes run better. So far my testing
>>>   shows a *great* improvement over the stock Debian 2.6.18 kernel on
>>>   our configurations. "
>>>
>>> and bdi has been in -mm in the past i think, so we also know (to a
>>> certain degree) that it does not hurt those workloads that are fine
>>> either.
>>>
>>> [ my personal interest in this is the following regression: every time i
>>>  start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
>>>  i get up to 30 seconds complete pauses in Vim (and most other tasks),
>>>  during plain editing of the source code. (which happens when Vim tries
>>>  to write() to its swap/undo-file.) ]
>>
>> I have an issue that sounds like it's related.
>>
>> I've got a syslog server that's got two Opteron 246 cpu's, 16G ram, 2x140G
>> 15k rpm drives (fusion MPT hardware mirroring), 16x500G 7200rpm SATA
>> drives on 3ware 9500 cards (software raid6) running 2.6.20.3 with hz set
>> at default and preempt turned off.
>>
>> I have syslog doing buffered writes to the SCSI drives and every 5 min a
>> cron job copies the data to the raid array.
>>
>> I've found that if I do anything significant on the large raid array that
>> the system looses a significant amount of the UDP syslog traffic, even
>> though there should be pleanty of ram and cpu (and the spindles involved
>> in the writes are not being touched), even a grep can cause up to 40%
>> losses in the syslog traffic. I've experimented with nice levels (nicing
>> down the grep and nicing up the syslogd) without a noticable effect on the
>> losses.
>>
>> I've been planning to try a new kernel with hz=1000 to see if that would
>> help, and after that experiment with the various preempt settings, but it
>> sounds like the per-device queues may actually be more relavent to the
>> problem.
>>
>> what would you suggest I test, and in what order and combination?
>
> At least on a surface level, your report has some similarities to
> http://lkml.org/lkml/2007/5/21/84 . In that message, John Miller
> mentions several things he tried without effect:
>
> < - I increased the max allowed receive buffer through
> < proc/sys/net/core/rmem_max and the application calls the right
> < syscall. "netstat -su" does not show any "packet receive errors".
> <
> < - After getting "kernel: swapper: page allocation failure.
> < order:0, mode:0x20", I increased /proc/sys/vm/min_free_kbytes
> <
> < - ixgb.txt in kernel network documentation suggests to increase
> < net.core.netdev_max_backlog to 300000. This did not help.
> <
> < - I also had to increase net.core.optmem_max, because the default
> < value was too small for 700 multicast groups.
>
> As they're all pretty simple to test, it may be worthwhile to give
> them a shot just to rule things out.

I will try them later today.

I forgot to mention that the filesystems are ext2 for the mirrored high 
speed disks and xfs for the 8TB array.

David Lang

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 17:02             ` Diego Calleja
@ 2007-08-04 17:17               ` Ingo Molnar
  2007-08-04 17:38                 ` Diego Calleja
  2007-08-08 10:43                 ` Karel Zak
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 17:17 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard


* Diego Calleja <diegocg@gmail.com> wrote:

> El Sat, 4 Aug 2007 18:37:33 +0200, Ingo Molnar <mingo@elte.hu> escribió:
> 
> > thousands of applications. So for most file workloads we give 
> > Windows a 20%-30% performance edge, for almost nothing. (for 
> > RAM-starved kernel builds the performance difference between atime 
> > and noatime+nodiratime setups is more on the order of 40%)
> 
> Just curious - do you have numbers with relatime?

nope. Stupid question, i just tried it and got this:

 EXT3-fs: Unrecognized mount option "relatime" or missing value

i've got util-linux-2.13-0.46.fc6 and 2.6.22 on that box, shouldnt that 
be recent enough? As far as i can see it from the kernel-side code, this 
works on the general VFS level and hence should be supported by ext3 
already.

even relatime means one extra write IO after a file has been created, 
but at least for read-mostly files it avoids the continuous atime 
update.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:41           ` Andrew Morton
@ 2007-08-04 17:26             ` Nikita Danilov
  2007-08-04 19:16             ` Florian Weimer
  1 sibling, 0 replies; 188+ messages in thread
From: Nikita Danilov @ 2007-08-04 17:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard

Andrew Morton writes:

[...]

 > 
 > It's pretty much unfixable given the ext3 journalling design, and the
 > guarantees which data-ordered provides.

ZFS has intent log to handle this
(http://blogs.sun.com/realneel/entry/the_zfs_intent_log). Something like
that can --theoretically-- be added to ext3-style journalling.

Nikita.

 > 
 > The easy preventive is to mount with data=writeback.  Maybe that should
 > have been the default.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 17:17               ` Ingo Molnar
@ 2007-08-04 17:38                 ` Diego Calleja
  2007-08-04 17:51                   ` Diego Calleja
  2007-08-08 10:43                 ` Karel Zak
  1 sibling, 1 reply; 188+ messages in thread
From: Diego Calleja @ 2007-08-04 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

El Sat, 4 Aug 2007 19:17:24 +0200, Ingo Molnar <mingo@elte.hu> escribió:

> i've got util-linux-2.13-0.46.fc6 and 2.6.22 on that box, shouldnt that 
> be recent enough? As far as i can see it from the kernel-side code, this 
> works on the general VFS level and hence should be supported by ext3 
> already.

Mmmh, "mount -o remount,noatime /" seems to Work For Me in Ubuntu
with util-linux/mount "2.12r-17ubuntu"...but then Google says [1] that
Ubuntu has been shipping with relatime enabled as default for months,
so it's probably patched (probably only in the kernel). So maybe upstream
util-linux hasn't merged the relatime patch.

[1]: http://lkml.org/lkml/2007/2/12/30

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:37           ` Ingo Molnar
  2007-08-04 16:51             ` Andrew Morton
  2007-08-04 17:02             ` Diego Calleja
@ 2007-08-04 17:39             ` Linus Torvalds
  2007-08-04 18:08               ` Jeff Garzik
  2007-08-05  0:26             ` Andi Kleen
  2007-08-09  6:25             ` Lionel Elie Mamane
  4 siblings, 1 reply; 188+ messages in thread
From: Linus Torvalds @ 2007-08-04 17:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david



On Sat, 4 Aug 2007, Ingo Molnar wrote:
> 
> yeah, it's really ugly. But otherwise i've got no real complaint about 
> ext3 - with the obligatory qualification that "noatime,nodiratime" in 
> /etc/fstab is a must.

I agree, we really should do something about atime.

But the fsync thing is a real issue. It literally makes ext3 almost 
unusable from a latency standpoint on many loads. I have a fast disk, and 
don't actually tend to have all that much going on normally, and it still 
hurts occasionally. 

One of the most common (and *best*) reasons for using fsync is for the 
mail spool. So anybody that uses local email will actually be doing a lot 
of fsync, and while you could try to thread the interfaces, I don't think 
a lot of mailers do.

So fsync ends up being a latency issue for something that a lot of people 
actually see, and something that you actually end up working with and you 
notice the latencies very clearly. Your editor auto-save feature is 
another good example of that exact same thing: the fsync actually is there 
for a very good reason, even if you apparently decided that you'd rather 
disable it.

But yeah, "noatime,data=writeback" will quite likely be *quite* noticeable 
(with different effects for different loads), but almost nobody actually 
runs that way.

I ended up using O_NOATIME for the individual object "open()" calls inside 
git, and it was an absolutely huge time-saver for the case of not having 
"noatime" in the mount options. Certainly more than your estimated 10% 
under some loads.

The "relatime" thing that David mentioned might well be very useful, but 
it's probably even less used than "noatime" is. And sadly, I don't really 
see that changing (unless we were to actually change the defaults inside 
the kernel).

			Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 17:38                 ` Diego Calleja
@ 2007-08-04 17:51                   ` Diego Calleja
  0 siblings, 0 replies; 188+ messages in thread
From: Diego Calleja @ 2007-08-04 17:51 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

El Sat, 4 Aug 2007 19:38:01 +0200, Diego Calleja <diegocg@gmail.com> escribió:

> Mmmh, "mount -o remount,noatime /" seems to Work For Me in Ubuntu
> with util-linux/mount "2.12r-17ubuntu"...but then Google says [1] that
> Ubuntu has been shipping with relatime enabled as default for months,
                                                           ^^^^^

Obviously, i meant "noatime"...(so it's unlikely that ubuntu has patched
anything to support relatime - it's not reflected in the changelogs at least)

> so it's probably patched (probably only in the kernel). So maybe upstream
> util-linux hasn't merged the relatime patch.
> 
> [1]: http://lkml.org/lkml/2007/2/12/30

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 17:39             ` Linus Torvalds
@ 2007-08-04 18:08               ` Jeff Garzik
  2007-08-04 19:12                 ` Jörn Engel
  2007-08-05 10:20                 ` Jakob Oestergaard
  0 siblings, 2 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-04 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard, david

Linus Torvalds wrote:
> The "relatime" thing that David mentioned might well be very useful, but 
> it's probably even less used than "noatime" is. And sadly, I don't really 
> see that changing (unless we were to actually change the defaults inside 
> the kernel).


I actually vote for that.  IMO, distros should turn -on- atime updates 
when they know its needed.

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 18:08               ` Jeff Garzik
@ 2007-08-04 19:12                 ` Jörn Engel
  2007-08-04 19:21                   ` Ingo Molnar
  2007-08-05 10:20                 ` Jakob Oestergaard
  1 sibling, 1 reply; 188+ messages in thread
From: Jörn Engel @ 2007-08-04 19:12 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, 4 August 2007 14:08:40 -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >The "relatime" thing that David mentioned might well be very useful, but 
> >it's probably even less used than "noatime" is. And sadly, I don't really 
> >see that changing (unless we were to actually change the defaults inside 
> >the kernel).
> 
> I actually vote for that.  IMO, distros should turn -on- atime updates 
> when they know its needed.

If you mean "relatime" I concur.  "noatime" hurts mutt and others while
"relatime" has no known problems, afaics.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:41           ` Andrew Morton
  2007-08-04 17:26             ` Nikita Danilov
@ 2007-08-04 19:16             ` Florian Weimer
  2007-08-05  6:00               ` Andrew Morton
                                 ` (2 more replies)
  1 sibling, 3 replies; 188+ messages in thread
From: Florian Weimer @ 2007-08-04 19:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard

* Andrew Morton:

> The easy preventive is to mount with data=writeback.  Maybe that should
> have been the default.

The documentation I could find suggests that this may lead to a
security weakness (old data in blocks of a file that was grown just
before the crash leaks to a different user).  XFS overwrites that data
with zeros upon reboot, which tends to irritate users when it happens.

>From this point of view, data=ordered doesn't seem too bad.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:12                 ` Jörn Engel
@ 2007-08-04 19:21                   ` Ingo Molnar
  2007-08-04 19:26                     ` Jörn Engel
  2007-08-04 20:11                     ` [PATCH 00/23] " Alan Cox
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 19:21 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Jörn Engel <joern@logfs.org> wrote:

> > I actually vote for that.  IMO, distros should turn -on- atime 
> > updates when they know its needed.
> 
> If you mean "relatime" I concur.  "noatime" hurts mutt and others 
> while "relatime" has no known problems, afaics.

so ... one app can keep 30,000+ apps hostage?

i use Mutt myself, on such a filesystem:

   /dev/md0 on / type ext3 (rw,noatime,nodiratime,user_xattr)

and i can see no problems, it notices new mails just fine.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:21                   ` Ingo Molnar
@ 2007-08-04 19:26                     ` Jörn Engel
  2007-08-04 19:42                       ` Jörn Engel
  2007-08-04 19:47                       ` Linus Torvalds
  2007-08-04 20:11                     ` [PATCH 00/23] " Alan Cox
  1 sibling, 2 replies; 188+ messages in thread
From: Jörn Engel @ 2007-08-04 19:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jörn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, 4 August 2007 21:21:30 +0200, Ingo Molnar wrote:
> * Jörn Engel <joern@logfs.org> wrote:
> 
> > > I actually vote for that.  IMO, distros should turn -on- atime 
> > > updates when they know its needed.
> > 
> > If you mean "relatime" I concur.  "noatime" hurts mutt and others 
> > while "relatime" has no known problems, afaics.
> 
> so ... one app can keep 30,000+ apps hostage?
> 
> i use Mutt myself, on such a filesystem:
> 
>    /dev/md0 on / type ext3 (rw,noatime,nodiratime,user_xattr)
> 
> and i can see no problems, it notices new mails just fine.

Given the choice between only "atime" and "noatime" I'd agree with you.
Heck, I use it myself.  But "relatime" seems to combine the best of both
worlds.  It currently just suffers from mount not supporting it in any
relevant distro.

Jörn

-- 
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:26                     ` Jörn Engel
@ 2007-08-04 19:42                       ` Jörn Engel
  2007-08-05 20:36                         ` Christoph Hellwig
  2007-08-04 19:47                       ` Linus Torvalds
  1 sibling, 1 reply; 188+ messages in thread
From: Jörn Engel @ 2007-08-04 19:42 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, 4 August 2007 21:26:15 +0200, Jörn Engel wrote:
> 
> Given the choice between only "atime" and "noatime" I'd agree with you.
> Heck, I use it myself.  But "relatime" seems to combine the best of both
> worlds.  It currently just suffers from mount not supporting it in any
> relevant distro.

And here is a completely untested patch to enable it by default.  Ingo,
can you see how good this fares compared to "atime" and
"noatime,nodiratime"?

Jörn

-- 
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it.
-- Brian W. Kernighan

--- linux-2.6.22_relatime/fs/namespace.c~default_relatime	2007-05-16 02:01:39.000000000 +0200
+++ linux-2.6.22_relatime/fs/namespace.c	2007-08-04 21:36:20.000000000 +0200
@@ -1401,6 +1401,10 @@ long do_mount(char *dev_name, char *dir_
 	if (data_page)
 		((char *)data_page)[PAGE_SIZE - 1] = 0;
 
+#ifdef CONFIG_DEFAULT_RELATIME
+	flags |= MS_RELATIME;
+#endif
+
 	/* Separate the per-mountpoint flags */
 	if (flags & MS_NOSUID)
 		mnt_flags |= MNT_NOSUID;
--- linux-2.6.22_relatime/fs/Kconfig~default_relatime	2007-05-16 02:01:38.000000000 +0200
+++ linux-2.6.22_relatime/fs/Kconfig	2007-08-04 21:39:46.000000000 +0200
@@ -6,6 +6,15 @@ menu "File systems"
 
 if BLOCK
 
+config DEFAULT_RELATIME
+	bool "Mount all filesystems with 'relatime' by default"
+	default y
+	help
+	  Relatime only updates atime once after any file has been changed.
+	  Setting this should give a noticeable performance bonus.
+
+	  If unsure, say Y.
+
 config EXT2_FS
 	tristate "Second extended fs support"
 	help

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:26                     ` Jörn Engel
  2007-08-04 19:42                       ` Jörn Engel
@ 2007-08-04 19:47                       ` Linus Torvalds
  2007-08-04 19:49                         ` Linus Torvalds
                                           ` (3 more replies)
  1 sibling, 4 replies; 188+ messages in thread
From: Linus Torvalds @ 2007-08-04 19:47 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Jeff Garzik, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david



On Sat, 4 Aug 2007, Jörn Engel wrote:
> 
> Given the choice between only "atime" and "noatime" I'd agree with you.
> Heck, I use it myself.  But "relatime" seems to combine the best of both
> worlds.  It currently just suffers from mount not supporting it in any
> relevant distro.

Well, we could make it the default for the kernel (possibly under a 
"fast-atime" config option), and then people can add "atime" or "noatime" 
as they wish, since mount has supported _those_ options for a long time.

		Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:47                       ` Linus Torvalds
@ 2007-08-04 19:49                         ` Linus Torvalds
  2007-08-04 20:00                         ` Ingo Molnar
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 188+ messages in thread
From: Linus Torvalds @ 2007-08-04 19:49 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Jeff Garzik, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david



On Sat, 4 Aug 2007, Linus Torvalds wrote:
> 
> Well, we could make it the default for the kernel (possibly under a 
> "fast-atime" config option), and then people can add "atime" or "noatime" 
> as they wish, since mount has supported _those_ options for a long time.

Side note: while I think the fsync() behaviour is more irritating than 
atime, that one is harder to fix. I think it's reasonable to have 
"relatime" as a default strategy for the kernel, but I don't think it's 
necessarily at all as reasonable to change a filesystem-specific ordering 
constraint.

			Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:47                       ` Linus Torvalds
  2007-08-04 19:49                         ` Linus Torvalds
@ 2007-08-04 20:00                         ` Ingo Molnar
  2007-08-04 20:11                           ` Ingo Molnar
  2007-08-05  8:18                           ` [patch] add noatime/atime boot options, CONFIG_DEFAULT_NOATIME Ingo Molnar
  2007-08-04 20:13                         ` [PATCH 00/23] per device dirty throttling -v8 Arjan van de Ven
       [not found]                         ` <fa.7rstQpXif2z9y2n2HD+qxLFnueg@ifi.uio.no>
  3 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jörn Engel, Jeff Garzik, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Well, we could make it the default for the kernel (possibly under a 
> "fast-atime" config option), and then people can add "atime" or 
> "noatime" as they wish, since mount has supported _those_ options for 
> a long time.

the patch below implements this, but there's a problem: we only have 
MNT_NOATIME, we have no MNT_ATIME option AFAICS. So there's no good way 
to detect it when a user _does_ want to have atime :-( Perhaps a boot 
option to turn this off? [sucks a bit but keeps the solution within the 
kernel.]

	Ingo

--------------------------------->
Subject: [patch] add CONFIG_FASTATIME
From: Ingo Molnar <mingo@elte.hu>

add the CONFIG_FASTATIME kernel option, which makes "relatime" the
default for all mounts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/Kconfig     |   10 ++++++++++
 fs/namespace.c |    4 ++++
 2 files changed, 14 insertions(+)

Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,16 @@ config 9P_FS
 
 endmenu
 
+config FASTATIME
+	bool "Fast atime support by default"
+	default y
+	help
+	  If you say Y here, all your filesystems that do not have
+	  the "noatime" or "atime" mount option specified will get
+	  the "relatime" option by default, which speeds up atime
+	  updates. (atime will only be updated if ctime or mtime
+	  is more recent than atime)
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1409,6 +1409,10 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+#ifdef CONFIG_FASTATIME
+	if (!(flags & (MNT_NOATIME | MNT_NODIRATIME)))
+		mnt_flags |= MNT_RELATIME;
+#endif
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:00                         ` Ingo Molnar
@ 2007-08-04 20:11                           ` Ingo Molnar
  2007-08-04 20:13                             ` Arjan van de Ven
  2007-08-05  8:18                           ` [patch] add noatime/atime boot options, CONFIG_DEFAULT_NOATIME Ingo Molnar
  1 sibling, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 20:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: J?rn Engel, Jeff Garzik, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Ingo Molnar <mingo@elte.hu> wrote:

> +#ifdef CONFIG_FASTATIME
> +	if (!(flags & (MNT_NOATIME | MNT_NODIRATIME)))
> +		mnt_flags |= MNT_RELATIME;
> +#endif

btw., "relatime" does not seem to make much of a difference, if i do 
this:

  ls -l x ; sync

on a "relatime" mounted filesystem ('x' is a regular file), then there's 
disk IO for every such command. Only if i mount it noatime,nodiratime do 
i get zero disk IO. Or my patch is wrong somehow.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:21                   ` Ingo Molnar
  2007-08-04 19:26                     ` Jörn Engel
@ 2007-08-04 20:11                     ` Alan Cox
  2007-08-04 20:28                       ` Jeff Garzik
  2007-08-04 20:28                       ` Ingo Molnar
  1 sibling, 2 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-04 20:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jörn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> i use Mutt myself, on such a filesystem:
> 
>    /dev/md0 on / type ext3 (rw,noatime,nodiratime,user_xattr)
> 
> and i can see no problems, it notices new mails just fine.

In some setups it will and in others it won't. Nor is it the only
application that has this requirement. Ext3 currently is a standards
compliant file system. Turn off atime and its very non standards
compliant, turn to relatime and its not standards compliant but nobody
will break (which is good)

Either change is a big user/kernel interface change and no major vendor
targets desktop as primary market so I'm not suprised they haven't done
this. The fix is to educate them further not to break the kernel.

There are several reasons for that
-	Distros will change the least conservative stuff first so we
	have the dedicated followers of fashion finding problems first
-	Existing systems won't suddenly change behaviour and break
	(and as the catastrophic failure case is backup failure we do
	not want to break them)

People just need to know about the performance differences - very few
realise its more than a fraction of a percent. I'm sure Gentoo will use
relatime the moment anyone knows its > 5% 8)

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:47                       ` Linus Torvalds
  2007-08-04 19:49                         ` Linus Torvalds
  2007-08-04 20:00                         ` Ingo Molnar
@ 2007-08-04 20:13                         ` Arjan van de Ven
  2007-08-04 21:48                           ` Theodore Tso
       [not found]                         ` <fa.7rstQpXif2z9y2n2HD+qxLFnueg@ifi.uio.no>
  3 siblings, 1 reply; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-04 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jörn Engel, Ingo Molnar, Jeff Garzik, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, 2007-08-04 at 12:47 -0700, Linus Torvalds wrote:
> 
> On Sat, 4 Aug 2007, Jörn Engel wrote:
> > 
> > Given the choice between only "atime" and "noatime" I'd agree with you.
> > Heck, I use it myself.  But "relatime" seems to combine the best of both
> > worlds.  It currently just suffers from mount not supporting it in any
> > relevant distro.
> 
> Well, we could make it the default for the kernel (possibly under a 
> "fast-atime" config option), and then people can add "atime" or "noatime" 
> as they wish, since mount has supported _those_ options for a long time.


there is another trick possible (more involved though, Al will have to
jump in on that one I suspect): Have 2 types of "dirty inode" states;
one is the current dirty state (meaning the full range of ext3
transactions etc) and "lighter" state of "atime-dirty"; which will not
do the background syncs or journal transactions (so if your machine
crashes, you lose the atime update) but it does keep atime for most
normal cases and keeps it standard compliant "except after a crash".



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:11                           ` Ingo Molnar
@ 2007-08-04 20:13                             ` Arjan van de Ven
  0 siblings, 0 replies; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-04 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, J?rn Engel, Jeff Garzik, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, 2007-08-04 at 22:11 +0200, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > +#ifdef CONFIG_FASTATIME
> > +	if (!(flags & (MNT_NOATIME | MNT_NODIRATIME)))
> > +		mnt_flags |= MNT_RELATIME;
> > +#endif
> 
> btw., "relatime" does not seem to make much of a difference, if i do 
> this:
> 
>   ls -l x ; sync
> 
> on a "relatime" mounted filesystem ('x' is a regular file), then there's 
> disk IO for every such command. Only if i mount it noatime,nodiratime do 
> i get zero disk IO. Or my patch is wrong somehow.

do we have reldiratime ?



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:56               ` Ingo Molnar
@ 2007-08-04 20:23                 ` Alan Cox
  0 siblings, 0 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-04 20:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard

> i tried to convince distro folks about it ... but there's fear, 
> uncertainty and doubt about touching /etc/fstab and i suspect no major 
> distro will do it until another does it - which is a catch-22 :-/ So i 

Thats what Gentoo is for ;)

> guess we should add a kernel config option that allows the kernel rpm 
> maker to just disable atime by default. (re-enableable via boot-line and 
> fstab entry too) [That new kernel config option would be disabled by 
> default.] That makes it much easier to control and introduce.

It makes it much more messy and awkward as the same system behaves in
arbitary different ways under different builds of the kernel.

If you want to sort this in Fedora for example you just need to package
and announce a desktop-tuning rpm which makes the relevant updates on
install and reverses them on remove. Stick the scheduler/vm tuning values
in as well and the disk queue tweaks.

Regardless of the kernel defaults people will install such a package
en-mass...

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:11                     ` [PATCH 00/23] " Alan Cox
@ 2007-08-04 20:28                       ` Jeff Garzik
  2007-08-04 21:47                         ` Alan Cox
  2007-08-07 18:55                         ` Bill Davidsen
  2007-08-04 20:28                       ` Ingo Molnar
  1 sibling, 2 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-04 20:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Jörn Engel, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Alan Cox wrote:
> In some setups it will and in others it won't. Nor is it the only
> application that has this requirement. Ext3 currently is a standards
> compliant file system. Turn off atime and its very non standards
> compliant, turn to relatime and its not standards compliant but nobody
> will break (which is good)

Linux has always been a "POSIX unless its stupid" type of system.  For 
the upstream kernel, we should do the right thing -- noatime by default 
-- but allow distros and people that care about rigid compliance to 
easily change the default.


(from another message)
> If you want to sort this in Fedora for example you just need to package
> and announce a desktop-tuning rpm which makes the relevant updates on
> install and reverses them on remove. Stick the scheduler/vm tuning values
> in as well and the disk queue tweaks.
> 
> Regardless of the kernel defaults people will install such a package
> en-mass...

<chuckle>  Sounds like an effective idea :)

Though strictly in the context of atime vs. noatime, servers benefit 
from that too, not just desktop.

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:11                     ` [PATCH 00/23] " Alan Cox
  2007-08-04 20:28                       ` Jeff Garzik
@ 2007-08-04 20:28                       ` Ingo Molnar
  2007-08-04 20:34                         ` Arjan van de Ven
                                           ` (3 more replies)
  1 sibling, 4 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 20:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Either change is a big user/kernel interface change and no major 
> vendor targets desktop as primary market so I'm not suprised they 
> haven't done this. [...]

earlier in the thread it was claimed that Ubuntu is now defaulting to 
noatime+nodiratime, and has done so for several months. Could be one of 
the reasons why:

   http://www.google.com/trends?q=fedora%2C+ubuntu

> People just need to know about the performance differences - very few 
> realise its more than a fraction of a percent. I'm sure Gentoo will 
> use relatime the moment anyone knows its > 5% 8)

noatime,nodiratime gave 50% of wall-clock kernel rpm build performance 
improvement for Dave Jones, on a beefy box. Unless i misunderstood what 
you meant under 'fraction of a percent' your numbers are _WAY_ off. 
Atime updates are a _huge everyday deal_, from laptops to servers. 
Everywhere on the planet. Give me a Linux desktop anywhere and i can 
tell you whether it has atimes on or off, just by clicking around and 
using apps (without looking at the mount options). That's how i notice 
it that i forgot to turn off atime on any newly installed system - the 
system has weird desktop lags and unnecessary disk trashing.

> [...] Ext3 currently is a standards compliant file system. Turn off 
> atime and its very non standards compliant, turn to relatime and its 
> not standards compliant but nobody will break (which is good)

come on! Any standards testsuite needs tons of tweaks to the system to 
run through to completion. Mounting the filesystem atime will just be 
one more item in the long list of (mostly silly) 'needed for standards 
compliance' items (most of which nobody configures). What matters are 
the apps, and nary any app depends on atime, and those people who depend 
on them can turn on atime just fine. (it's the same as for extended 
attributes for example - and attributes are infinitely _more_ useful 
than atime.)

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Ingo Molnar
@ 2007-08-04 20:34                         ` Arjan van de Ven
  2007-08-04 21:03                         ` Ingo Molnar
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-04 20:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

> > People just need to know about the performance differences - very few 
> > realise its more than a fraction of a percent. I'm sure Gentoo will 
> > use relatime the moment anyone knows its > 5% 8)
> 
> noatime,nodiratime gave 50% of wall-clock kernel rpm build performance 
> improvement for Dave Jones, on a beefy box. Unless i misunderstood what 
> you meant under 'fraction of a percent' your numbers are _WAY_ off.

it's also a Watt or so of power if you have the AHCI ALPM patches in the
kernel (which are pending mainline inclusion)...



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Ingo Molnar
  2007-08-04 20:34                         ` Arjan van de Ven
@ 2007-08-04 21:03                         ` Ingo Molnar
  2007-08-04 21:51                           ` Alan Cox
  2007-08-04 21:48                         ` Alan Cox
  2007-08-04 22:39                         ` Ilpo Järvinen
  3 siblings, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-04 21:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Ingo Molnar <mingo@elte.hu> wrote:

> noatime,nodiratime gave 50% of wall-clock kernel rpm build performance 
> improvement for Dave Jones, on a beefy box. Unless i misunderstood 
> what you meant under 'fraction of a percent' your numbers are _WAY_ 
> off. Atime updates are a _huge everyday deal_, from laptops to 
> servers. Everywhere on the planet. Give me a Linux desktop anywhere 
> and i can tell you whether it has atimes on or off, just by clicking 
> around and using apps (without looking at the mount options). That's 
> how i notice it that i forgot to turn off atime on any newly installed 
> system - the system has weird desktop lags and unnecessary disk 
> trashing.

i cannot over-emphasise how much of a deal it is in practice. Atime 
updates are by far the biggest IO performance deficiency that Linux has 
today. Getting rid of atime updates would give us more everyday Linux 
performance than all the pagecache speedups of the past 10 years, 
_combined_.

it's also perhaps the most stupid Unix design idea of all times. Unix is 
really nice and well done, but think about this a bit:

   ' For every file that is read from the disk, lets do a ... write to
     the disk! And, for every file that is already cached and which we
     read from the cache ... do a write to the disk! '

tell that concept to any rookie programmer who knows nothing about 
kernels and the answer will be: 'huh, what? That's gross!'. And Linux 
does this unconditionally for everything, and no, it's not only done on 
some high-security servers that need all sorts of auditing enabled that 
logs every file read - no, it's done by 99% of the Linux desktops and 
servers. For the sake of some lazy mailers that could now be using 
inotify, and for the sake of ... nothing much, really - forensics 
software perhaps.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Jeff Garzik
@ 2007-08-04 21:47                         ` Alan Cox
  2007-08-04 23:51                           ` Claudio Martins
  2007-08-05  7:18                           ` Ingo Molnar
  2007-08-07 18:55                         ` Bill Davidsen
  1 sibling, 2 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-04 21:47 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ingo Molnar, Jörn Engel, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> Linux has always been a "POSIX unless its stupid" type of system.  For 
> the upstream kernel, we should do the right thing -- noatime by default 
> -- but allow distros and people that care about rigid compliance to 
> easily change the default.

Linux has never been a "suprise your kernel interfaces all just changed
today" kernel, nor a "gosh you upgraded and didn't notice your backups
broke" kernel.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:13                         ` [PATCH 00/23] per device dirty throttling -v8 Arjan van de Ven
@ 2007-08-04 21:48                           ` Theodore Tso
  2007-08-05 18:01                             ` Arjan van de Ven
  0 siblings, 1 reply; 188+ messages in thread
From: Theodore Tso @ 2007-08-04 21:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Jörn Engel, Ingo Molnar, Jeff Garzik,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sat, Aug 04, 2007 at 01:13:19PM -0700, Arjan van de Ven wrote:
> there is another trick possible (more involved though, Al will have to
> jump in on that one I suspect): Have 2 types of "dirty inode" states;
> one is the current dirty state (meaning the full range of ext3
> transactions etc) and "lighter" state of "atime-dirty"; which will not
> do the background syncs or journal transactions (so if your machine
> crashes, you lose the atime update) but it does keep atime for most
> normal cases and keeps it standard compliant "except after a crash".

That would make us standards compliant (POSIX explicitly says that
what happens after a unclean shutdown is Unspecified) and it would
make things a heck of a lot faster.  However, there is a potential
problem which is that it will keep a large number of inodes pinned in
memory, which is its own problem.  So there would have to be some way
to force the atime updates to be merged when under memory pressure,
and and perhaps on some much longer background interval (i.e., every
hour or so).

							- Ted

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Ingo Molnar
  2007-08-04 20:34                         ` Arjan van de Ven
  2007-08-04 21:03                         ` Ingo Molnar
@ 2007-08-04 21:48                         ` Alan Cox
  2007-08-05  7:13                           ` Ingo Molnar
  2007-08-04 22:39                         ` Ilpo Järvinen
  3 siblings, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-04 21:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> > People just need to know about the performance differences - very few 
> > realise its more than a fraction of a percent. I'm sure Gentoo will 
> > use relatime the moment anyone knows its > 5% 8)
> 
> noatime,nodiratime gave 50% of wall-clock kernel rpm build performance 
> improvement for Dave Jones, on a beefy box. Unless i misunderstood what 
> you meant under 'fraction of a percent' your numbers are _WAY_ off.

What numbers - I didn't quote any performance numbers ?
 

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:03                         ` Ingo Molnar
@ 2007-08-04 21:51                           ` Alan Cox
  2007-08-05  7:21                             ` Ingo Molnar
                                               ` (2 more replies)
  0 siblings, 3 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-04 21:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> i cannot over-emphasise how much of a deal it is in practice. Atime 
> updates are by far the biggest IO performance deficiency that Linux has 
> today. Getting rid of atime updates would give us more everyday Linux 
> performance than all the pagecache speedups of the past 10 years, 
> _combined_.
> 
> it's also perhaps the most stupid Unix design idea of all times. Unix is 
> really nice and well done, but think about this a bit:

Think about the user for a moment instead. 

Do things right. The job of the kernel is not to "correct" for
distribution policy decisions. The distributions need to change policy.
You do that by showing the distributions the numbers. 

With a Red Hat on if we can move from /dev/hda to /dev/sda in FC7 then we
can move from atime to noatime by default on FC8 with appropriate release
note warnings and having a couple of betas to find out what other than
mutt goes boom.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Ingo Molnar
                                           ` (2 preceding siblings ...)
  2007-08-04 21:48                         ` Alan Cox
@ 2007-08-04 22:39                         ` Ilpo Järvinen
  3 siblings, 0 replies; 188+ messages in thread
From: Ilpo Järvinen @ 2007-08-04 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sat, 4 Aug 2007, Ingo Molnar wrote:

> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > People just need to know about the performance differences - very few 
> > realise its more than a fraction of a percent. I'm sure Gentoo will 
> > use relatime the moment anyone knows its > 5% 8)
> 
> noatime,nodiratime gave 50% of wall-clock kernel rpm build performance 
> improvement for Dave Jones, on a beefy box. Unless i misunderstood what 
> you meant under 'fraction of a percent' your numbers are _WAY_ off. 
> Atime updates are a _huge everyday deal_, from laptops to servers. 
> Everywhere on the planet. Give me a Linux desktop anywhere and i can 
> tell you whether it has atimes on or off, just by clicking around and 
> using apps (without looking at the mount options). That's how i notice 
> it that i forgot to turn off atime on any newly installed system - the 
> system has weird desktop lags and unnecessary disk trashing.

...For me, I would say 50% is not enough to describe the _visible_ 
benefits... Not talking any specific number but past 10sec-1min+ lagging
in X is history, it's gone and I really don't miss it that much... :-) 
Cannot reproduce even a second long delay anymore in window focusing under 
considerable load as it's basically instantaneous (I can see that it's 
loaded but doesn't affect the feeling of responsiveness I'm now getting), 
even on some loads that I couldn't previously even dream of... I still
can get drawing lag a bit by pushing enough stuff to swap but still it's 
definately quite well under control, though rare 1-2 sec spikes in drawing 
appear due to swap loads I think. ...And this is 2.6.21.5 so no fancies 
ala Ingo's CFS or so yet...

...Thanks about this hint. :-)

-- 
 i.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:47                         ` Alan Cox
@ 2007-08-04 23:51                           ` Claudio Martins
  2007-08-05  0:49                             ` Alan Cox
  2007-08-07 21:20                             ` Bill Davidsen
  2007-08-05  7:18                           ` Ingo Molnar
  1 sibling, 2 replies; 188+ messages in thread
From: Claudio Martins @ 2007-08-04 23:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jeff Garzik, Ingo Molnar, Jörn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Saturday 04 August 2007, Alan Cox wrote:
>
> Linux has never been a "suprise your kernel interfaces all just changed
> today" kernel, nor a "gosh you upgraded and didn't notice your backups
> broke" kernel.
>

 Can you give examples of backup solutions that rely on atime being updated?
I can understand backup tools using mtime/ctime for incremental backups (like 
tar + Amanda, etc), but I'm having trouble figuring out why someone would 
want to use atime for that.

 Best regards

Claudio


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:37           ` Ingo Molnar
                               ` (2 preceding siblings ...)
  2007-08-04 17:39             ` Linus Torvalds
@ 2007-08-05  0:26             ` Andi Kleen
  2007-08-05 15:00               ` Theodore Tso
                                 ` (2 more replies)
  2007-08-09  6:25             ` Lionel Elie Mamane
  4 siblings, 3 replies; 188+ messages in thread
From: Andi Kleen @ 2007-08-05  0:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

Ingo Molnar <mingo@elte.hu> writes:
> 
> yeah, it's really ugly. But otherwise i've got no real complaint about 
> ext3 - with the obligatory qualification that "noatime,nodiratime" in 
> /etc/fstab is a must. This speeds up things very visibly - especially 
> when lots of files are accessed. It's kind of weird that every Linux 
> desktop and server is hurt by a noticeable IO performance slowdown due 
> to the constant atime updates, while there's just two real users of it: 
> tmpwatch [which can be configured to use ctime so it's not a big issue] 
> and some backup tools. (Ok, and mail-notify too i guess.) Out of tens of 
> thousands of applications. So for most file workloads we give Windows a 
> 20%-30% performance edge, for almost nothing. (for RAM-starved kernel 
> builds the performance difference between atime and noatime+nodiratime 
> setups is more on the order of 40%)

I always thought the right solution would be to just sync atime only
very very lazily. This means if a inode is only dirty because of an
atime update put it on a "only write out when there is nothing to do
or the memory is really needed" list.

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 10:33       ` Ingo Molnar
  2007-08-04 16:17         ` Linus Torvalds
@ 2007-08-05  0:28         ` Andi Kleen
  1 sibling, 0 replies; 188+ messages in thread
From: Andi Kleen @ 2007-08-05  0:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

Ingo Molnar <mingo@elte.hu> writes:

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > [ my personal interest in this is the following regression: every time 
> >   i start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM 
> >   box, i get up to 30 seconds complete pauses in Vim (and most other 
> >   tasks), during plain editing of the source code. (which happens when 
> >   Vim tries to write() to its swap/undo-file.) ]
> 
> hm, it turns out that it's due to vim doing an occasional fsync not only 
> on writeout, but during normal use too. "set nofsync" in the .vimrc 
> solves this problem.

It should probably be doing fdatasync() instead. Then ext3 could just
write the data blocks only, but only mess with the logs when the file 
size changed and mtime would be written out somewhat later.

[unless you have data logging enabled]

Does the problem go away when you change it to that? 

-Andi


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 23:51                           ` Claudio Martins
@ 2007-08-05  0:49                             ` Alan Cox
  2007-08-05  7:28                               ` Ingo Molnar
  2007-08-05 14:46                               ` Theodore Tso
  2007-08-07 21:20                             ` Bill Davidsen
  1 sibling, 2 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-05  0:49 UTC (permalink / raw)
  To: Claudio Martins
  Cc: Jeff Garzik, Ingo Molnar, Jörn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

>  Can you give examples of backup solutions that rely on atime being updated?
> I can understand backup tools using mtime/ctime for incremental backups (like 
> tar + Amanda, etc), but I'm having trouble figuring out why someone would 
> want to use atime for that.

HSM is the usual one, and to a large extent probably why Unix originally
had atime. Basically migrating less used files away so as to keep the
system disks tidy.

Its not something usally found on desktop boxes so it doesn't in anyway
argue against the distribution using noatime or relative atime, but on
big server boxes it matters


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:16             ` Florian Weimer
@ 2007-08-05  6:00               ` Andrew Morton
  2007-08-05  7:57                 ` Florian Weimer
  2007-08-05 22:46               ` Theodore Tso
  2007-08-06  0:24               ` David Chinner
  2 siblings, 1 reply; 188+ messages in thread
From: Andrew Morton @ 2007-08-05  6:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard

On Sat, 04 Aug 2007 21:16:35 +0200 Florian Weimer <fw@deneb.enyo.de> wrote:

> * Andrew Morton:
> 
> > The easy preventive is to mount with data=writeback.  Maybe that should
> > have been the default.
> 
> The documentation I could find suggests that this may lead to a
> security weakness (old data in blocks of a file that was grown just
> before the crash leaks to a different user).

yup.  This problem also exists in ext2, reiserfs (unless using
ordered-mode), JFS, others.

>  XFS overwrites that data
> with zeros upon reboot, which tends to irritate users when it happens.

yup.

> >From this point of view, data=ordered doesn't seem too bad.

If your computer is used by multiple users who don't trust each other,
sure.  That covers, what?  About 2% of machines?

I was using data=writeback for a while on my most-thrashed disk.  The
results were a bit disappointing - not much difference.  ext2 is a lot
quicker.

(I don't use anything which is fsync-happy, btw).  (I used to have a patch
which sysctl-tunably turned fsync, msync, fdatasync into "return 0" for use
on the laptop but I seem to have lost it)




^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:48                         ` Alan Cox
@ 2007-08-05  7:13                           ` Ingo Molnar
  2007-08-05 13:22                             ` Diego Calleja
  2007-08-09  0:57                             ` Greg Trounson
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > > People just need to know about the performance differences - very 
> > > few realise its more than a fraction of a percent. I'm sure Gentoo 
> > > will use relatime the moment anyone knows its > 5% 8)
> > 
> > noatime,nodiratime gave 50% of wall-clock kernel rpm build 
> > performance improvement for Dave Jones, on a beefy box. Unless i 
> > misunderstood what you meant under 'fraction of a percent' your 
> > numbers are _WAY_ off.
> 
> What numbers - I didn't quote any performance numbers ?

ok, i misunderstood your "very few realise its more than a fraction of a 
percent" sentence, i thought you were saying it's a fraction of a 
percent.

Measurements show that noatime helps 20-30% on regular desktop 
workloads, easily 50% for kernel builds and much more than that (in 
excess of 100%) for file-read-intense workloads. We cannot just walk 
past such a _huge_ performance impact so easily without even reacting to 
the performance arguments, and i'm happy Ubuntu picked up 
noatime,nodiratime and is whipping up the floor with Fedora on the 
desktop.

just look at the spontaneous feedback this thread prompted:

| ...For me, I would say 50% is not enough to describe the _visible_ 
| benefits... Not talking any specific number but past 10sec-1min+ 
| lagging in X is history, it's gone and I really don't miss it that 
| much... :-) Cannot reproduce even a second long delay anymore in 
| window focusing under considerable load as it's basically 
| instantaneous (I can see that it's loaded but doesn't affect the 
| feeling of responsiveness I'm now getting), even on some loads that I 
| couldn't previously even dream of... I still can get drawing lag a bit 
| by pushing enough stuff to swap but still it's definately quite well 
| under control, though rare 1-2 sec spikes in drawing appear due to 
| swap loads I think. ...And this is 2.6.21.5 so no fancies ala Ingo's 
| CFS or so yet...
|
| ...Thanks about this hint. :-)

much of the hard performance work we put into the kernel and into 
userspace is basically masked by the atime stupidity. How many man-years 
did it take to implement prelink? It has less of an impact than noatime! 
How much effort did we put into smart readahead and bootup 
optimizations? It has less of an impact than noatime.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:47                         ` Alan Cox
  2007-08-04 23:51                           ` Claudio Martins
@ 2007-08-05  7:18                           ` Ingo Molnar
  1 sibling, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:18 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jeff Garzik, Jörn Engel, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > Linux has always been a "POSIX unless its stupid" type of system.  
> > For the upstream kernel, we should do the right thing -- noatime by 
> > default -- but allow distros and people that care about rigid 
> > compliance to easily change the default.
> 
> Linux has never been a "suprise your kernel interfaces all just 
> changed today" kernel, nor a "gosh you upgraded and didn't notice your 
> backups broke" kernel.

HSM uses atime as a _hint_. The only even remotely valid argument is 
Mutt, and even that one could easily be fixed _it is not even installed 
by default on most distros_ and nobody but me uses it ;) [and i've been 
using Mutt on noatime filesystems for years] So basically a single type 
of package and use-case (against tens of thousands of packages) held all 
of Linux desktop IO performance hostage for 10 years, to the tune of a 
20-30-50-100% performance degradation (depending on the workload)? Wow. 

And the atime situation is _so_ obvious, what will we do in the much 
less obvious cases?

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:51                           ` Alan Cox
@ 2007-08-05  7:21                             ` Ingo Molnar
  2007-08-05  7:29                               ` Andrew Morton
                                                 ` (3 more replies)
  2007-08-05  7:37                             ` Ingo Molnar
  2007-08-07 19:09                             ` Bill Davidsen
  2 siblings, 4 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> With a Red Hat on if we can move from /dev/hda to /dev/sda in FC7 then 
> we can move from atime to noatime by default on FC8 with appropriate 
> release note warnings and having a couple of betas to find out what 
> other than mutt goes boom.

btw., Mutt does not go boom, i use it myself. It works just fine and 
notices new mails even on a noatime,nodiratime filesystem.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  0:49                             ` Alan Cox
@ 2007-08-05  7:28                               ` Ingo Molnar
  2007-08-05 10:29                                 ` Jakob Oestergaard
  2007-08-05 12:46                                 ` Alan Cox
  2007-08-05 14:46                               ` Theodore Tso
  1 sibling, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Claudio Martins, Jeff Garzik, Jörn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> >  Can you give examples of backup solutions that rely on atime being 
> > updated? I can understand backup tools using mtime/ctime for 
> > incremental backups (like tar + Amanda, etc), but I'm having trouble 
> > figuring out why someone would want to use atime for that.
> 
> HSM is the usual one, and to a large extent probably why Unix 
> originally had atime. Basically migrating less used files away so as 
> to keep the system disks tidy.

atime is used as a _hint_, at most and HSM sure works just fine on an 
atime-incapable filesystem too. So it's the same deal as "add user_xattr 
mount option to the filesystem to make Beagle index faster". It's now: 
"if you use HSM storage add the atime mount option to make it slightly 
more intelligent. Expect huge IO slowdowns though."

The only remotely valid compatibility argument would be Mutt - but even 
that handles it just fine. (we broke way more software via noexec)

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:21                             ` Ingo Molnar
@ 2007-08-05  7:29                               ` Andrew Morton
  2007-08-05  7:39                                 ` Ingo Molnar
  2007-08-05  8:53                               ` Willy Tarreau
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 188+ messages in thread
From: Andrew Morton @ 2007-08-05  7:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, 5 Aug 2007 09:21:41 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> even on a noatime,nodiratime filesystem

noatime is a superset of nodiratime, btw.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:51                           ` Alan Cox
  2007-08-05  7:21                             ` Ingo Molnar
@ 2007-08-05  7:37                             ` Ingo Molnar
  2007-08-05  9:04                               ` Jeff Garzik
  2007-08-05 12:43                               ` Alan Cox
  2007-08-07 19:09                             ` Bill Davidsen
  2 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:37 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > it's also perhaps the most stupid Unix design idea of all times. 
> > Unix is really nice and well done, but think about this a bit:
> 
> Think about the user for a moment instead.
> 
> Do things right. The job of the kernel is not to "correct" for 
> distribution policy decisions. The distributions need to change 
> policy. You do that by showing the distributions the numbers.

you try to put the blame into distribution makers' shoes but in reality, 
had the kernel stepped forward with a neat .config option sooner 
(combined with a neat boot option as well to turn it off), we'd have had 
noatime systems 10 years ago. A new entry into relnotes and done. It's 
_much less_ of a compatibility impact than many of the changes that 
happen in a new distro release. (new glibc, new compiler, new kernel) 

Distro makers did not dare to do this sooner because some kernel 
developers came forward with these mostly bogus arguments ... The impact 
of atime is far better understood by the kernel community, so it is the 
responsibility of _us_ to signal such things towards distributors, not 
the other way around.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:29                               ` Andrew Morton
@ 2007-08-05  7:39                                 ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sun, 5 Aug 2007 09:21:41 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > even on a noatime,nodiratime filesystem
> 
> noatime is a superset of nodiratime, btw.

heh, indeed. I've been using this trick for 10 years on my desktops so 
it's an ancient thinko :)

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  6:00               ` Andrew Morton
@ 2007-08-05  7:57                 ` Florian Weimer
  2007-08-05 20:43                   ` Christoph Hellwig
  0 siblings, 1 reply; 188+ messages in thread
From: Florian Weimer @ 2007-08-05  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard

* Andrew Morton:

>>  XFS overwrites that data with zeros upon reboot, which tends to
>> irritate users when it happens.
>
> yup.
>
>> >From this point of view, data=ordered doesn't seem too bad.
>
> If your computer is used by multiple users who don't trust each other,
> sure.  That covers, what?  About 2% of machines?

I wasn't concerned so much with security, but with user experience.
For instance, some editors don't perform fsync-then-rename, but simply
truncate the file when saving (because they want to preserve hard
links).  With XFS, this tends to cause null bytes on crashes.  Since
ext3 has got a much larger install base, this would result in lots of
bug reports, I fear.

Without zeroing, the truncating editor might garble the file in a more
obvious way, but you've got the security issue (and I agree that this
is more of a PR issue).

^ permalink raw reply	[flat|nested] 188+ messages in thread

* [patch] add noatime/atime boot options, CONFIG_DEFAULT_NOATIME
  2007-08-04 20:00                         ` Ingo Molnar
  2007-08-04 20:11                           ` Ingo Molnar
@ 2007-08-05  8:18                           ` Ingo Molnar
  1 sibling, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05  8:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: J?rn Engel, Jeff Garzik, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


here's an updated patch that implements a full spectrum of config, boot 
and sysctl parameters to make it easy for users and distros to make 
noatime the default. Tested on ext3, with and without atime.

for compatibility reasons the config option defaults to disabled, so 
this patch has no impact by default. If CONFIG_DEFAULT_NOATIME is 
enabled for a kernel then all filesystems will be noatime mounted. The 
boot and sysctl options are available unconditionally.

	Ingo

---------------------------->
Subject: [patch] add noatime/atime boot options, CONFIG_DEFAULT_NOATIME
From: Ingo Molnar <mingo@elte.hu>

add the "noatime" (and "atime") boot options to enable/disable atime
updates for all filesystems.

also add the CONFIG_DEFAULT_NOATIME kernel option (disabled by default
for compatibility reasons), which makes "noatime" the default for all
mounts without an extra kernel boot option.

also add the /proc/sys/kernel/mount_with_atime flag which can be changed
runtime to modify the behavior of subsequent new mounts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |   12 +++++++
 fs/Kconfig                          |   21 +++++++++++++
 fs/namespace.c                      |   56 ++++++++++++++++++++++++++++++++++++
 include/linux/mount.h               |    2 +
 kernel/sysctl.c                     |    9 +++++
 5 files changed, 100 insertions(+)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -303,6 +303,12 @@ and is between 256 and 4096 characters. 
 
 	atascsi=	[HW,SCSI] Atari SCSI
 
+	atime           [FS] default to enabled atime updates on all
+			filesystems.
+
+	atime=          [FS] default to enabled/disabled atime updates on all
+			filesystems.
+
 	atkbd.extra=	[HW] Enable extra LEDs and keys on IBM RapidAccess,
 			EzKey and similar keyboards
 
@@ -1100,6 +1106,12 @@ and is between 256 and 4096 characters. 
 	noasync		[HW,M68K] Disables async and sync negotiation for
 			all devices.
 
+	noatime         [FS] default to disabled atime updates on all
+			filesystems.
+
+	noatime=        [FS] default to disabled/enabled atime updates on all
+			filesystems.
+
 	nobats		[PPC] Do not use BATs for mapping kernel lowmem
 			on "Classic" PPC cores.
 
Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,27 @@ config 9P_FS
 
 endmenu
 
+config DEFAULT_NOATIME
+	bool "Mount all filesystems with noatime by default"
+	help
+	  If you say Y here, all your filesystems will be mounted
+	  with the "noatime" mount option. This eliminates atime
+	  ('file last accessed' timestamp) updates (which otherwise
+	  is performed on every file access and generates a write
+	  IO to the inode) and thus speeds up IO.
+
+	  The mtime ('file last modified') and ctime ('file created')
+	  timestamp are unaffected by this change.
+
+	  Note: the overwhelming majority of applications make no
+	  use of atime. Known exceptions: the Mutt mail client can
+	  depend on it (for new mail notification) on multi-user
+	  machines and some HSM backup tools might also work better
+	  in the presence of atime.
+
+	  Use the "atime" kernel boot option to turn off this
+	  feature.
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1362,6 +1362,60 @@ int copy_mount_options(const void __user
 }
 
 /*
+ * Allow users to disable (or enable) atime updates via a .config
+ * option or via the boot line, or via /proc/sys/fs/mount_with_atime:
+ */
+int mount_with_atime __read_mostly =
+#ifdef CONFIG_DEFAULT_NOATIME
+0
+#else
+1
+#endif
+;
+
+/*
+ * The "noatime=", "atime=", "noatime" and "atime" boot parameters:
+ */
+static int toggle_atime_updates(int val)
+{
+	mount_with_atime = val;
+
+	printk("Atime updates are: %s\n", val ? "on" : "off");
+
+	return 1;
+}
+
+static int __init set_atime_setup(char *str)
+{
+	int val;
+
+	get_option(&str, &val);
+	return toggle_atime_updates(val);
+}
+__setup("atime=", set_atime_setup);
+
+static int __init set_noatime_setup(char *str)
+{
+	int val;
+
+	get_option(&str, &val);
+	return toggle_atime_updates(!val);
+}
+__setup("noatime=", set_noatime_setup);
+
+static int __init set_atime(char *str)
+{
+	return toggle_atime_updates(1);
+}
+__setup("atime", set_atime);
+
+static int __init set_noatime(char *str)
+{
+	return toggle_atime_updates(0);
+}
+__setup("noatime", set_noatime);
+
+/*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
  *
@@ -1409,6 +1463,8 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	if (!mount_with_atime && !(flags & (MNT_NOATIME | MNT_NODIRATIME)))
+		mnt_flags |= MNT_NOATIME;
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h
+++ linux/include/linux/mount.h
@@ -103,5 +103,7 @@ extern void shrink_submounts(struct vfsm
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
+extern int mount_with_atime;
+
 #endif
 #endif /* _LINUX_MOUNT_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -1206,6 +1207,14 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "mount_with_atime",
+		.data		= &mount_with_atime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:21                             ` Ingo Molnar
  2007-08-05  7:29                               ` Andrew Morton
@ 2007-08-05  8:53                               ` Willy Tarreau
  2007-08-05 14:17                                 ` Jörn Engel
  2007-08-05 12:47                               ` Alan Cox
  2007-08-05 18:44                               ` Dave Jones
  3 siblings, 1 reply; 188+ messages in thread
From: Willy Tarreau @ 2007-08-05  8:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 09:21:41AM +0200, Ingo Molnar wrote:
> 
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > With a Red Hat on if we can move from /dev/hda to /dev/sda in FC7 then 
> > we can move from atime to noatime by default on FC8 with appropriate 
> > release note warnings and having a couple of betas to find out what 
> > other than mutt goes boom.
> 
> btw., Mutt does not go boom, i use it myself. It works just fine and 
> notices new mails even on a noatime,nodiratime filesystem.

IIRC, atime is used by mailers and by the shell to detect that new
mail has arrived and report it only once if there are several intances
watching the same mbox.

I too use mutt and noatime,nodiratime everywhere (same 10 year-old
thinko), and the only side effect is that when I have a new mail,
it is reported in all of my xterms until I read it, clearly something
I can live with (and sometimes it's even desirable).

In fact, mutt is pretty good at this. It updates atime and ctime itself
as soon as it opens the mbox, so the shell is happy and only reports
"you have mail" afterwards.

Well, I hope we're not getting too much off-topic here...

Willy


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:37                             ` Ingo Molnar
@ 2007-08-05  9:04                               ` Jeff Garzik
  2007-08-05 12:43                               ` Alan Cox
  1 sibling, 0 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-05  9:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Ingo Molnar wrote:
> Distro makers did not dare to do this sooner because some kernel 
> developers came forward with these mostly bogus arguments ... The impact 
> of atime is far better understood by the kernel community, so it is the 
> responsibility of _us_ to signal such things towards distributors, not 
> the other way around.


Pretty much.

AFAICS there was never a "policy decision" on the part of distro makers 
to begin with.  The kernel had its default -- atime -- and the distros 
ran with that.

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 18:08               ` Jeff Garzik
  2007-08-04 19:12                 ` Jörn Engel
@ 2007-08-05 10:20                 ` Jakob Oestergaard
  2007-08-05 10:42                   ` Jeff Garzik
  1 sibling, 1 reply; 188+ messages in thread
From: Jakob Oestergaard @ 2007-08-05 10:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, Aug 04, 2007 at 02:08:40PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >The "relatime" thing that David mentioned might well be very useful, but 
> >it's probably even less used than "noatime" is. And sadly, I don't really 
> >see that changing (unless we were to actually change the defaults inside 
> >the kernel).
> 
> 
> I actually vote for that.  IMO, distros should turn -on- atime updates 
> when they know its needed.

Oh dear.

Why not just make ext3 fsync() a no-op while you're at it?

Distros can turn it back on if it's needed...

Of course I'm not serious, but like atime, fsync() is something one
expects to work if it's there.  Disabling atime updates or making
fsync() a no-op will both result in silent failure which I am sure we
can agree is disasterous.

Why on earth would you cripple the kernel defaults for ext3 (which is a
fine FS for boot/root filesystems), when the *fundamental* problem you
really want to solve lie much deeper in the implementation of the
filesystem?  Noatime doesn't solve the problem, it just makes it "less
horrible".

If you really need different filesystem performance characteristics, you
can switch to another filesystem. There's plenty to choose from.

-- 

 / jakob


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:28                               ` Ingo Molnar
@ 2007-08-05 10:29                                 ` Jakob Oestergaard
  2007-08-05 12:46                                 ` Alan Cox
  1 sibling, 0 replies; 188+ messages in thread
From: Jakob Oestergaard @ 2007-08-05 10:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Claudio Martins, Jeff Garzik, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, Aug 05, 2007 at 09:28:05AM +0200, Ingo Molnar wrote:
> 
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > >  Can you give examples of backup solutions that rely on atime being 
> > > updated? I can understand backup tools using mtime/ctime for 
> > > incremental backups (like tar + Amanda, etc), but I'm having trouble 
> > > figuring out why someone would want to use atime for that.
> > 
> > HSM is the usual one, and to a large extent probably why Unix 
> > originally had atime. Basically migrating less used files away so as 
> > to keep the system disks tidy.
> 
> atime is used as a _hint_, at most and HSM sure works just fine on an 
> atime-incapable filesystem too. So it's the same deal as "add user_xattr 
> mount option to the filesystem to make Beagle index faster". It's now: 
> "if you use HSM storage add the atime mount option to make it slightly 
> more intelligent. Expect huge IO slowdowns though."
> 
> The only remotely valid compatibility argument would be Mutt - but even 
> that handles it just fine. (we broke way more software via noexec)

I find it pretty normal to use tmpreaper to clear out unused files from
certain types of semi-temporary directory structures. Those files are
often only ever read. They'd start randomly disappearing while in use.

But then again, maybe I'm the only guy on the planet who uses tmpreaper.

-- 

 / jakob


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 10:20                 ` Jakob Oestergaard
@ 2007-08-05 10:42                   ` Jeff Garzik
  2007-08-05 10:58                     ` Jakob Oestergaard
  2007-08-05 23:43                     ` David Chinner
  0 siblings, 2 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-05 10:42 UTC (permalink / raw)
  To: Jakob Oestergaard, Linus Torvalds, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu
  Cc: Ingo Molnar, Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	nikita, trond.myklebust, yingchao.zhou, richard, david

Jakob Oestergaard wrote:
> Oh dear.
> 
> Why not just make ext3 fsync() a no-op while you're at it?
> 
> Distros can turn it back on if it's needed...
> 
> Of course I'm not serious, but like atime, fsync() is something one

No, they are nothing alike, and you are just making yourself look silly 
if you compare them.  fsync has to do with fundamental guarantees about 
data.


> expects to work if it's there.  Disabling atime updates or making
> fsync() a no-op will both result in silent failure which I am sure we
> can agree is disasterous.

<rolls eyes>  Climb down from hyperbole mountain.

If you can show massive amounts of users that will actually be 
negatively impacted, please present hard evidence.

Otherwise all this is useless hot air.


> Why on earth would you cripple the kernel defaults for ext3 (which is a
> fine FS for boot/root filesystems), when the *fundamental* problem you
> really want to solve lie much deeper in the implementation of the
> filesystem?  Noatime doesn't solve the problem, it just makes it "less
> horrible".

atime updates -are- a fundamental problem, one you cannot solve by 
tweaking filesystem implementations.  No matter how much you try to hide 
or batch, atime dirties an inode each time on every read...  for a 
feature a tiny minority of programs care about, much less depend on.

Remember several filesystems lock atime to mtime, because they do not 
have a concept of atime, and programs continue to work just fine.  We 
already have field proof of how little atime matters in reality.

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 10:42                   ` Jeff Garzik
@ 2007-08-05 10:58                     ` Jakob Oestergaard
  2007-08-05 12:46                       ` Ingo Molnar
  2007-08-05 23:43                     ` David Chinner
  1 sibling, 1 reply; 188+ messages in thread
From: Jakob Oestergaard @ 2007-08-05 10:58 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	Ingo Molnar, Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	nikita, trond.myklebust, yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 06:42:30AM -0400, Jeff Garzik wrote:
...
> If you can show massive amounts of users that will actually be 
> negatively impacted, please present hard evidence.
> 
> Otherwise all this is useless hot air.

Peace Jeff  :)

In another mail, I gave an example with tmpreaper clearing out unused
files; if some of those files are only read and never modified,
tmpreaper would start deleting files which were still frequently used.

That's a regression, the way I see it. As for 'massive amounts of
users', well, tmpreaper exists in most distros, so it's possible it has
other users than just me.

-- 

 / jakob


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:37                             ` Ingo Molnar
  2007-08-05  9:04                               ` Jeff Garzik
@ 2007-08-05 12:43                               ` Alan Cox
  2007-08-05 12:54                                 ` Ingo Molnar
  1 sibling, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 12:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> you try to put the blame into distribution makers' shoes but in reality, 
> had the kernel stepped forward with a neat .config option sooner 
> (combined with a neat boot option as well to turn it off), we'd have had 
> noatime systems 10 years ago. A new entry into relnotes and done. It's 

Sorry Ingo, having been in the distribution business for over ten years I
have to disagree. Kernel options that magically totally change the kernel
API and behaviour are exactly what a vendor does *NOT* want to have.

> Distro makers did not dare to do this sooner because some kernel 
> developers came forward with these mostly bogus arguments ... The impact 
> of atime is far better understood by the kernel community, so it is the 
> responsibility of _us_ to signal such things towards distributors, not 
> the other way around.

You are trying to put a bogus divide between kernel community and
developer community. Yet you know perfectly well that a large part of the
kernel community yourself included work for distribution vendors and are
actively building the distribution kernels.

You are perfectly positioned to provide timing examples to the Fedora
development team and make the case for FC8 beta going out that way. You
are perfectly able to propose, build and submit a FC7 extras package of
tuning which people can try in the meantime, but you haven't do so.

Other people in this discussion can do likewise for Debian, SuSE etc.

Your argument appears to be "I can't be bothered to use the due processes
of the distribution but I can do it quickly with an ugly kernel hack".
That is not the right approach. Propose it with your presented numbers to
fedora-devel and I'll be happy to back up such a proposal for the next FC
as will many other kernel folk I'm sure.

Heck, go write a piece for LWN with the benchmark numbers and how to
change your atime options. You'll make Jon happy and lots of folks read
it and will give feedback on improvements as a result.

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:28                               ` Ingo Molnar
  2007-08-05 10:29                                 ` Jakob Oestergaard
@ 2007-08-05 12:46                                 ` Alan Cox
  2007-08-05 12:58                                   ` Ingo Molnar
  1 sibling, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 12:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Claudio Martins, Jeff Garzik, Jörn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

> The only remotely valid compatibility argument would be Mutt - but even 
> that handles it just fine. (we broke way more software via noexec)

And went through a sensible process of resolving it.

And its not just mutt. HSM stuff stops working which is a big deal as
stuff clogs up. The /tmp/ cleaning tools go wrong as well.

These are big deals because you seem intent on using a large hammer to
force a change that should be done properly by other means.

The /tmp cleaning for example can probably be done other ways in future
but the changes should be in place first.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 10:58                     ` Jakob Oestergaard
@ 2007-08-05 12:46                       ` Ingo Molnar
  2007-08-05 13:46                         ` Jakob Oestergaard
  2007-08-05 16:45                         ` Linus Torvalds
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 12:46 UTC (permalink / raw)
  To: Jakob Oestergaard, Jeff Garzik, Linus Torvalds, miklos, akpm,
	neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Jakob Oestergaard <jakob@unthought.net> wrote:

> > If you can show massive amounts of users that will actually be 
> > negatively impacted, please present hard evidence.
> > 
> > Otherwise all this is useless hot air.
> 
> Peace Jeff :)
> 
> In another mail, I gave an example with tmpreaper clearing out unused 
> files; if some of those files are only read and never modified, 
> tmpreaper would start deleting files which were still frequently used.
> 
> That's a regression, the way I see it. As for 'massive amounts of 
> users', well, tmpreaper exists in most distros, so it's possible it 
> has other users than just me.

you mean tmpwatch? The trivial change below fixes this. And with that 
we've come to the end of an extremely short list of atime dependencies.

	Ingo

--- /etc/cron.daily/tmpwatch.orig
+++ /etc/cron.daily/tmpwatch
@@ -1,9 +1,9 @@
 #! /bin/sh
-/usr/sbin/tmpwatch -x /tmp/.X11-unix -x /tmp/.XIM-unix -x /tmp/.font-unix \
+/usr/sbin/tmpwatch --mtime -x /tmp/.X11-unix -x /tmp/.XIM-unix -x /tmp/.font-unix \
 	-x /tmp/.ICE-unix -x /tmp/.Test-unix 10d /tmp
-/usr/sbin/tmpwatch 30d /var/tmp
+/usr/sbin/tmpwatch --mtime 30d /var/tmp
 for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
     if [ -d "$d" ]; then
-	/usr/sbin/tmpwatch -f 30d "$d"
+	/usr/sbin/tmpwatch --mtime -f 30d "$d"
     fi
 done

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:21                             ` Ingo Molnar
  2007-08-05  7:29                               ` Andrew Morton
  2007-08-05  8:53                               ` Willy Tarreau
@ 2007-08-05 12:47                               ` Alan Cox
  2007-08-05 12:56                                 ` Ingo Molnar
  2007-08-05 18:44                               ` Dave Jones
  3 siblings, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 12:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> > we can move from atime to noatime by default on FC8 with appropriate 
> > release note warnings and having a couple of betas to find out what 
> > other than mutt goes boom.
> 
> btw., Mutt does not go boom, i use it myself. It works just fine and 
> notices new mails even on a noatime,nodiratime filesystem.

Configuration dependant, and also mutt and the shell will misreport new
mail with noatime on the mail spool. The shell should probably use
inotify of course but that change has to be made.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:43                               ` Alan Cox
@ 2007-08-05 12:54                                 ` Ingo Molnar
  2007-08-05 13:37                                   ` Alan Cox
  0 siblings, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 12:54 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > you try to put the blame into distribution makers' shoes but in 
> > reality, had the kernel stepped forward with a neat .config option 
> > sooner (combined with a neat boot option as well to turn it off), 
> > we'd have had noatime systems 10 years ago. A new entry into 
> > relnotes and done. It's
> 
> Sorry Ingo, having been in the distribution business for over ten 
> years I have to disagree. Kernel options that magically totally change 
> the kernel API and behaviour are exactly what a vendor does *NOT* want 
> to have.

it's default off of course. A distro can turn it on or off.

> > Distro makers did not dare to do this sooner because some kernel 
> > developers came forward with these mostly bogus arguments ... The 
> > impact of atime is far better understood by the kernel community, so 
> > it is the responsibility of _us_ to signal such things towards 
> > distributors, not the other way around.
> 
> You are trying to put a bogus divide between kernel community and 
> developer community. Yet you know perfectly well that a large part of 
> the kernel community yourself included work for distribution vendors 
> and are actively building the distribution kernels.

i've periodically pushed for a noatime distro kernel for like ... 5-10
years and last time this argument came up [i brought it up 6 months ago]
most of the distro kernel developer actually recommended using noatime,
but it took only 1-2 kernel developers to come out with the
'compatibility' and 'compliance' boogeyman to scare the distro userspace
people away from changing /etc/fstab.

so yes, things like this needs a clear message from the kernel folks,
and a kernel option for that is a pretty good way of doing it.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:47                               ` Alan Cox
@ 2007-08-05 12:56                                 ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 12:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > > we can move from atime to noatime by default on FC8 with 
> > > appropriate release note warnings and having a couple of betas to 
> > > find out what other than mutt goes boom.
> > 
> > btw., Mutt does not go boom, i use it myself. It works just fine and 
> > notices new mails even on a noatime,nodiratime filesystem.
> 
> Configuration dependant, and also mutt and the shell will misreport 
> new mail with noatime on the mail spool. The shell should probably use 
> inotify of course but that change has to be made.

just to quote from this same email thread:

| I too use mutt and noatime,nodiratime everywhere (same 10 year-old 
| thinko), and the only side effect is that when I have a new mail, it 
| is reported in all of my xterms until I read it, clearly something I 
| can live with (and sometimes it's even desirable).
|
| In fact, mutt is pretty good at this. It updates atime and ctime 
| itself as soon as it opens the mbox, so the shell is happy and only 
| reports "you have mail" afterwards.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:46                                 ` Alan Cox
@ 2007-08-05 12:58                                   ` Ingo Molnar
  2007-08-05 13:29                                     ` Willy Tarreau
  0 siblings, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 12:58 UTC (permalink / raw)
  To: Alan Cox
  Cc: Claudio Martins, Jeff Garzik, Jörn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > The only remotely valid compatibility argument would be Mutt - but even 
> > that handles it just fine. (we broke way more software via noexec)
> 
> And went through a sensible process of resolving it.
>
> And its not just mutt. HSM stuff stops working which is a big deal as 
> stuff clogs up. The /tmp/ cleaning tools go wrong as well.

what OSS HSM software stops working and what is its failure mode? /tmp 
cleaning tools will work _just fine_ if we report back max(mtime,ctime) 
as atime - they'll zap more /tmp stuff as they used to. There's no 
guarantee for /tmp contents anyway if tmpwatch is running. Or the patch 
below.

	Ingo

--- /etc/cron.daily/tmpwatch.orig	2007-08-05 14:44:25.000000000 +0200
+++ /etc/cron.daily/tmpwatch	2007-08-05 14:45:10.000000000 +0200
@@ -1,9 +1,9 @@
 #! /bin/sh
-/usr/sbin/tmpwatch -x /tmp/.X11-unix -x /tmp/.XIM-unix -x /tmp/.font-unix \
+/usr/sbin/tmpwatch --mtime -x /tmp/.X11-unix -x /tmp/.XIM-unix -x /tmp/.font-unix \
 	-x /tmp/.ICE-unix -x /tmp/.Test-unix 10d /tmp
-/usr/sbin/tmpwatch 30d /var/tmp
+/usr/sbin/tmpwatch --mtime 30d /var/tmp
 for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
     if [ -d "$d" ]; then
-	/usr/sbin/tmpwatch -f 30d "$d"
+	/usr/sbin/tmpwatch --mtime -f 30d "$d"
     fi
 done

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:13                           ` Ingo Molnar
@ 2007-08-05 13:22                             ` Diego Calleja
  2007-08-05 19:03                               ` david
  2007-08-06  6:58                               ` Ingo Molnar
  2007-08-09  0:57                             ` Greg Trounson
  1 sibling, 2 replies; 188+ messages in thread
From: Diego Calleja @ 2007-08-05 13:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

El Sun, 5 Aug 2007 09:13:20 +0200, Ingo Molnar <mingo@elte.hu> escribió:

> Measurements show that noatime helps 20-30% on regular desktop 
> workloads, easily 50% for kernel builds and much more than that (in 
> excess of 100%) for file-read-intense workloads. We cannot just walk 


And as everybody knows in servers is a popular practice to disable it.
According to an interview to the kernel.org admins....

"Beyond that, Peter noted, "very little fancy is going on, and that is good
because fancy is hard to maintain." He explained that the only fancy thing
being done is that all filesystems are mounted noatime meaning that the
system doesn't have to make writes to the filesystem for files which are
simply being read, "that cut the load average in half."

I bet that some people would consider such performance hit a bug...

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:58                                   ` Ingo Molnar
@ 2007-08-05 13:29                                     ` Willy Tarreau
  2007-08-06  6:57                                       ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Willy Tarreau @ 2007-08-05 13:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Claudio Martins, Jeff Garzik, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, Aug 05, 2007 at 02:58:47PM +0200, Ingo Molnar wrote:
> 
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > > The only remotely valid compatibility argument would be Mutt - but even 
> > > that handles it just fine. (we broke way more software via noexec)
> > 
> > And went through a sensible process of resolving it.
> >
> > And its not just mutt. HSM stuff stops working which is a big deal as 
> > stuff clogs up. The /tmp/ cleaning tools go wrong as well.
> 
> what OSS HSM software stops working and what is its failure mode? /tmp 
> cleaning tools will work _just fine_ if we report back max(mtime,ctime) 
> as atime - they'll zap more /tmp stuff as they used to. There's no 
> guarantee for /tmp contents anyway if tmpwatch is running. Or the patch 
> below.

Ingo,

In your example above, maybe it's the opposite, users know they can keep a
file in /tmp one more week by simply cat'ing it.

Changing the kernel in a non-easily reversible way is not kind to the users.
As you pointed it, there's no "atime" option in mount, and quite frankly,
having to reboot an NFS server to change a command line option which should
belong to fstab is quite gross. And yes, there may be people realying on
atime in specific environments. I remember having used it in the past to
automatically archive unused files. Those people might not be affected by
the drop in performance at all and would rather keep the feature.

I like Alan's idea of a package to automatically add "noatime" everywhere
in fstab, not only because it's easy to use, but because it will also teach
users how they can proceed on their other systems. Also, if you make the
package yourself, it will benefit from the "coolness factor" many people
see in everything that's done by renown persons (you know, the type of
people who regularly ask you if you use vi/emacs and what type of window
manager, and who then consider it must be good if you use it). I'll stop
ranting here, some of them may be reading ;-)

As a second step, once many people explicitly ask for "noatime" by default,
it will be time to add MS_ATIME to the kernel and to mount, and set NOATIME
as the default with big warnings. This will make everyone happy.

But expecting the admins to recompile their kernels or to reboot to change
the atime status is not acceptable IMHO. Moreover, they will not even know
they have to do this and they will feel frustrated because the system will
not do what they want.

I've already been bothered a lot by ext3 filesystems with dirindex enabled.
When you boot from an old CD and you cannot mount them, it's already quite
irritating (not to mention that tune2fs from the old CD does not know about
it either so you cannot disable the option). But it's even worse when you
plug an USB hard disk into an old server to start a backup and notice that
you cannot mount the disk without first upgrading your kernel !

For this reason, I think that the default noatime will be desirable only
after MS_ATIME is supported by both the kernel and the tools.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:54                                 ` Ingo Molnar
@ 2007-08-05 13:37                                   ` Alan Cox
  2007-08-05 18:08                                     ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 13:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> it's default off of course. A distro can turn it on or off.

...

> i've periodically pushed for a noatime distro kernel for like ... 5-10
> years and last time this argument came up [i brought it up 6 months ago]
> most of the distro kernel developer actually recommended using noatime,
> but it took only 1-2 kernel developers to come out with the
> 'compatibility' and 'compliance' boogeyman to scare the distro userspace
> people away from changing /etc/fstab.

And you honestly think that putting it in Kconfig as well as allowing
users to screw up horribly and creating incompatible defaults you can't
test for in a user space app where it matters is going to *change* this.

Do you really think anyone who said "noatime, compatibility, umm errr" is
going to say "noatime, compatibility, but hey its in Kconfig lets do it".
You argument doesn't hold up to minimal rational consideration. Posting
to the distribution devel list with: "Its a 50% performance win, we need
to fix these corner cases, here's a tmpwatch patch" is *exactly* what is
needed to change it, and Kconfig options are irrelevant to that.

Be serious and do this the proper way, propose it for FC8, go through the
proper due process. Otherwise the FC8 process will simply continue as
"umm err, compatibility" and it'll go nowhere.

You can't really complain about the CK scheduler and Con trying to do
stuff his own way without listening and then do this can you ? 

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:46                       ` Ingo Molnar
@ 2007-08-05 13:46                         ` Jakob Oestergaard
  2007-08-05 16:45                         ` Linus Torvalds
  1 sibling, 0 replies; 188+ messages in thread
From: Jakob Oestergaard @ 2007-08-05 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff Garzik, Linus Torvalds, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 02:46:48PM +0200, Ingo Molnar wrote:
> 
> * Jakob Oestergaard <jakob@unthought.net> wrote:
> 
> > > If you can show massive amounts of users that will actually be 
> > > negatively impacted, please present hard evidence.
> > > 
> > > Otherwise all this is useless hot air.
> > 
> > Peace Jeff :)
> > 
> > In another mail, I gave an example with tmpreaper clearing out unused 
> > files; if some of those files are only read and never modified, 
> > tmpreaper would start deleting files which were still frequently used.
> > 
> > That's a regression, the way I see it. As for 'massive amounts of 
> > users', well, tmpreaper exists in most distros, so it's possible it 
> > has other users than just me.
> 
> you mean tmpwatch?

Same same.

> The trivial change below fixes this. And with that 
> we've come to the end of an extremely short list of atime dependencies.

Please read what I wrote, not what you think I wrote.

If I only *read* those files, the mtime will not be updated, only the
atime.

And the files *will* then magically begin to disappear although they are
frequently used.

That will happen with a standard piece of software in a standard
configuration, in a scenario that may or may not be common... I have no
idea how common such a setup is - but I know how much it would suck to
have files magically disappearing because of a kernel upgrade  :)

-- 

 / jakob


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  8:53                               ` Willy Tarreau
@ 2007-08-05 14:17                                 ` Jörn Engel
  2007-08-05 18:02                                   ` Arjan van de Ven
  0 siblings, 1 reply; 188+ messages in thread
From: Jörn Engel @ 2007-08-05 14:17 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, 5 August 2007 10:53:54 +0200, Willy Tarreau wrote:
> On Sun, Aug 05, 2007 at 09:21:41AM +0200, Ingo Molnar wrote:
> > 
> > btw., Mutt does not go boom, i use it myself. It works just fine and 
> > notices new mails even on a noatime,nodiratime filesystem.
> 
> IIRC, atime is used by mailers and by the shell to detect that new
> mail has arrived and report it only once if there are several intances
> watching the same mbox.
> 
> I too use mutt and noatime,nodiratime everywhere (same 10 year-old
> thinko), and the only side effect is that when I have a new mail,
> it is reported in all of my xterms until I read it, clearly something
> I can live with (and sometimes it's even desirable).
> 
> In fact, mutt is pretty good at this. It updates atime and ctime itself
> as soon as it opens the mbox, so the shell is happy and only reports
> "you have mail" afterwards.

For me mutt fails to recognize new mail.  And the difference might be
this:
http://www.google.de/search?q=enable-buffy-size

Jörn

-- 
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  0:49                             ` Alan Cox
  2007-08-05  7:28                               ` Ingo Molnar
@ 2007-08-05 14:46                               ` Theodore Tso
  2007-08-05 17:55                                 ` Ingo Molnar
  2007-08-05 18:08                                 ` Arjan van de Ven
  1 sibling, 2 replies; 188+ messages in thread
From: Theodore Tso @ 2007-08-05 14:46 UTC (permalink / raw)
  To: Alan Cox
  Cc: Claudio Martins, Jeff Garzik, Ingo Molnar, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, Aug 05, 2007 at 01:49:26AM +0100, Alan Cox wrote:
> HSM is the usual one, and to a large extent probably why Unix originally
> had atime. Basically migrating less used files away so as to keep the
> system disks tidy.
> 
> Its not something usally found on desktop boxes so it doesn't in anyway
> argue against the distribution using noatime or relative atime, but on
> big server boxes it matters

In addition, big server boxes are usually not reading a huge *number*
of files per second.  The place where you see this as a problem is (a)
compilation, thanks to huge /usr/include hierarchies (and here things
have gotten worse over time as include files have gotten much more
complex than in the early Unix days), and (b) silly desktop apps that
want to scan huge numbers of XML files or who want to read every
single image file on the desktop or in an open file browser window to
show c00l icons.  Oh, and I guess I should include Maildir setups.

If you are always reading from the same small set of files (i.e., a
database workload), then those inodes only get updated every 5 seconds
(the traditional/default metadata update sync time, as well as the
default ext3 journal update time), it's no big deal.  Or if you are
running a mail server, most of the time the mail queue files are
getting updated anyway as you process them, and usually the mail is
delivered before 5 seconds is up anyway.  

So earlier, when Ingo characterized it as, "whenever you read from a
file, even one in memory cache.... do a write!", it's probably a bit
unfair.  Traditional Unix systems simply had very different workload
characteristics than many modern dekstop systems today.

							- Ted

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  0:26             ` Andi Kleen
@ 2007-08-05 15:00               ` Theodore Tso
  2007-08-06 13:47                 ` Chris Mason
  2007-08-17  0:45                 ` Dave Jones
  2007-08-05 20:41               ` Christoph Hellwig
  2007-08-16 10:18               ` Helge Hafting
  2 siblings, 2 replies; 188+ messages in thread
From: Theodore Tso @ 2007-08-05 15:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
> I always thought the right solution would be to just sync atime only
> very very lazily. This means if a inode is only dirty because of an
> atime update put it on a "only write out when there is nothing to do
> or the memory is really needed" list.

As I've mentioend earlier, the memory balancing issues that arise when
we add an "atime dirty" bit scare me a little.  It can be addressed,
obviously, but at the cost of more code complexity.

An alternative is to simply have a tunable parameter, via either a
mount option or stashed in the superblock which controls atime's
granularity guarantee.  That is, only update the atime if it is older
than some set time that could be configurable as a mount option or in
the superblock.  Most of the time, an HSM system simply wants to know
if a file has been used sometime "recently", where recently might be
measured in hours or in days.

This is IMHO slightly better than relatime, since it keeps the spirit
of the atime update, while keeping the performance impact to a very
minimal (and tunable) level.

						- Ted

P.S.  Yet alternative is to specify noatime on an individual
file/directory basis.  We've had this capability for a *long* time,
and if a distro were to set noatime for all files in certain
hierarchies (i.e., /usr/include) and certain top-level directories
(since the chattr +A flag is inherited), I think folks would find that
this would reduce the I/O traffic of noatime by a huge amount.  This
also would be 100% POSIX compliant, since we are extending the
filesystem and setting certain files to use it.  But if users want to
know when was the last time they looked at a particular file in their
home directory, they would still have that facility.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 12:46                       ` Ingo Molnar
  2007-08-05 13:46                         ` Jakob Oestergaard
@ 2007-08-05 16:45                         ` Linus Torvalds
  2007-08-05 19:09                           ` Ingo Molnar
  1 sibling, 1 reply; 188+ messages in thread
From: Linus Torvalds @ 2007-08-05 16:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jakob Oestergaard, Jeff Garzik, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


On Sun, 5 Aug 2007, Ingo Molnar wrote:
> 
> you mean tmpwatch? The trivial change below fixes this. And with that 
> we've come to the end of an extremely short list of atime dependencies.

You wouldn't even need these kinds of games.

What we could do is to make "relatime" updates a bit smarter.

A bit smarter would be:

 - update atime if the old atime is <= than mtime/ctime

   Logic: things like mailers can care about whether some new state has 
   been read or not. This is the current relatime.

 - update atime if the old atime is more than X seconds in the past 
   (defaulting to one day or something)

   Logic: things like tmpwatch and backup software may want to remove 
   stuff that hasn't been touched in a long time, but they sure don't care 
   about "exact" atime.

Now, you could also make the rule be that "X" depends on mtime/ctime, ie 
if a file has been "recently" created or modified, we keep more exact 
track of it and use one hour instead of one day, but if it's some old file 
that hasn't been modified in the last six months, we change X to a week. 
IOW, the "exactness" of atime is relative to how old the inode 
modifications are.

We could obviously do with an additional rule:

 - update atime if the inode is dirty anyway. Logic: there's no downside.

which just says that we'll make it exact if there is no reason not to.

			Linus

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04  6:32   ` Ingo Molnar
  2007-08-04  7:07     ` Ingo Molnar
@ 2007-08-05 17:22     ` Brice Figureau
  2007-08-05 22:17       ` Andi Kleen
  1 sibling, 1 reply; 188+ messages in thread
From: Brice Figureau @ 2007-08-05 17:22 UTC (permalink / raw)
  To: linux-kernel

Hi,

Ingo Molnar <mingo <at> elte.hu> writes:
> * Linus Torvalds <torvalds <at> linux-foundation.org> wrote:
> > On Fri, 3 Aug 2007, Peter Zijlstra wrote:
> > > 
> > > These patches aim to improve balance_dirty_pages() and directly address 
> > > three issues:
> > >   1) inter device starvation
> > >   2) stacked device deadlocks
> > >   3) inter process starvation
> > 
> > Ok, the patches certainly look pretty enough, and you fixed the only 
> > thing I complained about last time (naming), so as far as I'm 
> > concerned it's now just a matter of whether it *works* or not. I guess 
> > being in -mm will help somewhat, but it would be good to have people 
> > with several disks etc actively test this out.
> 
> There are positive reports in the never-ending "my system crawls like an 
> XT when copying large files" bugzilla entry:
> 
>  http://bugzilla.kernel.org/show_bug.cgi?id=7372
> 
>[ snipped part of the bug report ]
> 
> so the whole problem area seems to be a "perfect storm" created by a 
> combination of TCQ, IO scheduling and VM dirty handling weaknesses. Per 
> device dirty throttling is a good step forward and it makes a very 
> visible positive difference.


Foreword: I'm the OP of bug #7372. 

I just want to say/add that:
 1) I'm running the per-bdi patch since about 30 days on a master mysql server
under somewhat mild load without any adverse effect I could notice.

 2) I _still_ don't get the "performances" of 2.6.17, but since that's the
better combination I could get, I think there is IMHO progress in the right
direction (to be compared to no progress since 2.6.18, that's better :-)).

To be honest, a vanilla 2.6.17 not tuned at all (ie vfs_cache_pressure and other
knobs in /proc/sys/vm like swappiness and dirty_*) is still better than any
other upcoming kernel I tested. Thus I still think 2.6.18 added a big regression
(which unfortunately I couldn't find).
Read the full bug report for any background information if needed.

Unfortunately it isn't practical to git-bisect my issue as the server is a
production server that can't be rebooted/stopped whenever I want (and since I
found workarounds of the issue...).

Thanks for showing interest in this issue.

Please CC: me on any answers as I'm not subscribed to the list.

--
Brice Figureau


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 14:46                               ` Theodore Tso
@ 2007-08-05 17:55                                 ` Ingo Molnar
  2007-08-05 17:59                                   ` Jeff Garzik
  2007-08-05 18:08                                 ` Arjan van de Ven
  1 sibling, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 17:55 UTC (permalink / raw)
  To: Theodore Tso, Alan Cox, Claudio Martins, Jeff Garzik,
	Jörn Engel, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Theodore Tso <tytso@mit.edu> wrote:

> If you are always reading from the same small set of files (i.e., a 
> database workload), then those inodes only get updated every 5 seconds 
> (the traditional/default metadata update sync time, as well as the 
> default ext3 journal update time), it's no big deal.  Or if you are 
> running a mail server, most of the time the mail queue files are 
> getting updated anyway as you process them, and usually the mail is 
> delivered before 5 seconds is up anyway.
> 
> So earlier, when Ingo characterized it as, "whenever you read from a 
> file, even one in memory cache.... do a write!", it's probably a bit 
> unfair.  Traditional Unix systems simply had very different workload 
> characteristics than many modern dekstop systems today.

yeah, i didnt mean to say that it is _always_ a big issue, but "only a 
small number of files are read" is a very, very small minority of even 
the database server world.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 17:55                                 ` Ingo Molnar
@ 2007-08-05 17:59                                   ` Jeff Garzik
  2007-08-05 18:09                                     ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Jeff Garzik @ 2007-08-05 17:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Alan Cox, Claudio Martins, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Ingo Molnar wrote:
> * Theodore Tso <tytso@mit.edu> wrote:
> 
>> If you are always reading from the same small set of files (i.e., a 
>> database workload), then those inodes only get updated every 5 seconds 
>> (the traditional/default metadata update sync time, as well as the 
>> default ext3 journal update time), it's no big deal.  Or if you are 
>> running a mail server, most of the time the mail queue files are 
>> getting updated anyway as you process them, and usually the mail is 
>> delivered before 5 seconds is up anyway.
>>
>> So earlier, when Ingo characterized it as, "whenever you read from a 
>> file, even one in memory cache.... do a write!", it's probably a bit 
>> unfair.  Traditional Unix systems simply had very different workload 
>> characteristics than many modern dekstop systems today.
> 
> yeah, i didnt mean to say that it is _always_ a big issue, but "only a 
> small number of files are read" is a very, very small minority of even 
> the database server world.

OTOH, consider a popular Linux task, web serving.  atime results in a 
lot of unnecessary disk traffic.

	Jeff




^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:48                           ` Theodore Tso
@ 2007-08-05 18:01                             ` Arjan van de Ven
  2007-08-05 20:34                               ` Christoph Hellwig
  0 siblings, 1 reply; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-05 18:01 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Jörn Engel, Ingo Molnar, Jeff Garzik,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sat, 2007-08-04 at 17:48 -0400, Theodore Tso wrote:
> On Sat, Aug 04, 2007 at 01:13:19PM -0700, Arjan van de Ven wrote:
> > there is another trick possible (more involved though, Al will have to
> > jump in on that one I suspect): Have 2 types of "dirty inode" states;
> > one is the current dirty state (meaning the full range of ext3
> > transactions etc) and "lighter" state of "atime-dirty"; which will not
> > do the background syncs or journal transactions (so if your machine
> > crashes, you lose the atime update) but it does keep atime for most
> > normal cases and keeps it standard compliant "except after a crash".
> 
> That would make us standards compliant (POSIX explicitly says that
> what happens after a unclean shutdown is Unspecified) and it would
> make things a heck of a lot faster.  However, there is a potential
> problem which is that it will keep a large number of inodes pinned in
> memory, which is its own problem.  So there would have to be some way
> to force the atime updates to be merged when under memory pressure,
> and and perhaps on some much longer background interval (i.e., every
> hour or so).

on the journalling side this would be one transaction (not 5 milion)
and... since inodes are grouped on disk, you can even get some better
coalescing this way... 

Wonder if we could do inode-grouping smartly; eg if we HAVE to write
inode X, also write out the atime-dirty inodes in range X-Y to X+Y
(where Y is some tunable) in the same IO..


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 14:17                                 ` Jörn Engel
@ 2007-08-05 18:02                                   ` Arjan van de Ven
  2007-08-05 18:37                                     ` Jörn Engel
  0 siblings, 1 reply; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-05 18:02 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Willy Tarreau, Ingo Molnar, Alan Cox, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, 2007-08-05 at 16:17 +0200, Jörn Engel wrote:
> On Sun, 5 August 2007 10:53:54 +0200, Willy Tarreau wrote:
> > On Sun, Aug 05, 2007 at 09:21:41AM +0200, Ingo Molnar wrote:
> > > 
> > > btw., Mutt does not go boom, i use it myself. It works just fine and 
> > > notices new mails even on a noatime,nodiratime filesystem.
> > 
> > IIRC, atime is used by mailers and by the shell to detect that new
> > mail has arrived and report it only once if there are several intances
> > watching the same mbox.
> > 
> > I too use mutt and noatime,nodiratime everywhere (same 10 year-old
> > thinko), and the only side effect is that when I have a new mail,
> > it is reported in all of my xterms until I read it, clearly something
> > I can live with (and sometimes it's even desirable).
> > 
> > In fact, mutt is pretty good at this. It updates atime and ctime itself
> > as soon as it opens the mbox, so the shell is happy and only reports
> > "you have mail" afterwards.
> 
> For me mutt fails to recognize new mail.  And the difference might be
> this:

but does it work with relatime ?



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 14:46                               ` Theodore Tso
  2007-08-05 17:55                                 ` Ingo Molnar
@ 2007-08-05 18:08                                 ` Arjan van de Ven
  1 sibling, 0 replies; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-05 18:08 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Alan Cox, Claudio Martins, Jeff Garzik, Ingo Molnar,
	Jörn Engel, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


> 
> In addition, big server boxes are usually not reading a huge *number*
> of files per second.  The place where you see this as a problem is (a)
> compilation, thanks to huge /usr/include hierarchies (and here things
> have gotten worse over time as include files have gotten much more
> complex than in the early Unix days), and (b) silly desktop apps that
> want to scan huge numbers of XML files or who want to read every
> single image file on the desktop or in an open file browser window to
> show c00l icons.  Oh, and I guess I should include Maildir setups.
> 
> If you are always reading from the same small set of files (i.e., a
> database workload), then those inodes only get updated every 5 seconds
> (the traditional/default metadata update sync time, as well as the
> default ext3 journal update time), it's no big deal.  Or if you are
> running a mail server, most of the time the mail queue files are
> getting updated anyway as you process them, and usually the mail is
> delivered before 5 seconds is up anyway.  


it's just one of those things that get compounded with journaling
filesystems though..... a single async write that happens "sometime in
the future" is one thing... having a full transaction (which acts as
barrier and synchronisation point) is something totally worse.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 13:37                                   ` Alan Cox
@ 2007-08-05 18:08                                     ` Ingo Molnar
  2007-08-05 19:11                                       ` Alan Cox
  2007-08-08 18:22                                       ` Bill Davidsen
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 18:08 UTC (permalink / raw)
  To: Alan Cox
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> And you honestly think that putting it in Kconfig as well as allowing 
> users to screw up horribly and creating incompatible defaults you

So far you've not offered one realistic scenario of "screw up horribly". 
People have been using noatime for a long time and there are no horror 
stories about that. _Which_ OSS HSM software relies on atime?

> can't test for in a user space app where it matters is going to 
> *change* this.

The patch i posted today adds /proc/sys/kernel/mount_with_atime. That 
can be tested by user-space, if it truly cares about atime.

> Do you really think anyone who said "noatime, compatibility, umm errr" 
> is going to say "noatime, compatibility, but hey its in Kconfig lets 
> do it". You argument doesn't hold up to minimal rational 
> consideration. Posting to the distribution devel list with: "Its a 50% 
> performance win, we need to fix these corner cases, here's a tmpwatch 
> patch" is *exactly* what is needed to change it, and Kconfig options 
> are irrelevant to that.

i did exactly that 6 months ago, check your email folders. I went by the 
"process". But it doesnt really matter anymore, Ubuntu has done the step 
and Fedora will be forced to do it too. But it's sad that it took us 10 
years. I'd like to remind you again:

|| ...For me, I would say 50% is not enough to describe the _visible_ 
|| benefits... Not talking any specific number but past 10sec-1min+ 
|| lagging in X is history, it's gone and I really don't miss it that 
|| much... :-) Cannot reproduce even a second long delay anymore in 
|| window focusing under considerable load as it's basically 
|| instantaneous (I can see that it's loaded but doesn't affect the 
|| feeling of responsiveness I'm now getting), even on some loads that I 
|| couldn't previously even dream of... [...]

we really have to ask ourselves whether the "process" is correct if 
advantages to the user of this order of magnitude can be brushed aside 
with simple "this breaks binary-only HSM" and "it's not standards 
compliant" arguments.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 17:59                                   ` Jeff Garzik
@ 2007-08-05 18:09                                     ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 18:09 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Alan Cox, Claudio Martins, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Jeff Garzik <jeff@garzik.org> wrote:

> > yeah, i didnt mean to say that it is _always_ a big issue, but "only 
> > a small number of files are read" is a very, very small minority of 
> > even the database server world.
> 
> OTOH, consider a popular Linux task, web serving.  atime results in a 
> lot of unnecessary disk traffic.

it's a big, noticeable effect on 99% of the Linux boxes.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:02                                   ` Arjan van de Ven
@ 2007-08-05 18:37                                     ` Jörn Engel
  2007-08-05 20:21                                       ` Jörn Engel
  0 siblings, 1 reply; 188+ messages in thread
From: Jörn Engel @ 2007-08-05 18:37 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jörn Engel, Willy Tarreau, Ingo Molnar, Alan Cox,
	Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, 5 August 2007 11:02:33 -0700, Arjan van de Ven wrote:
> 
> but does it work with relatime ?

Like a greased penguin.  I had to reboot with my ugly patch posted
earlier in the patch to actually test it, though.  Relatime suffers from
a distribution problem, nothing else.

Guess I should throw in a kernel compile test as well, just to get a
feel for the performance.

Jörn

-- 
Homo Sapiens is a goal, not a description.
-- unknown

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:21                             ` Ingo Molnar
                                                 ` (2 preceding siblings ...)
  2007-08-05 12:47                               ` Alan Cox
@ 2007-08-05 18:44                               ` Dave Jones
  2007-08-05 18:58                                 ` adi
  2007-08-06  6:39                                 ` Ingo Molnar
  3 siblings, 2 replies; 188+ messages in thread
From: Dave Jones @ 2007-08-05 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 09:21:41AM +0200, Ingo Molnar wrote:
 > * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
 > 
 > > With a Red Hat on if we can move from /dev/hda to /dev/sda in FC7 then 
 > > we can move from atime to noatime by default on FC8 with appropriate 
 > > release note warnings and having a couple of betas to find out what 
 > > other than mutt goes boom.
 > 
 > btw., Mutt does not go boom, i use it myself. It works just fine and 
 > notices new mails even on a noatime,nodiratime filesystem.
 
It still fails miserably for me.

If I hit 'C' and '?' I get a list of my mail folders, with some of them
marked 'N' if they have new mail.  Without atime, those N's never show
up and every mbox looks like it has no new mail.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:44                               ` Dave Jones
@ 2007-08-05 18:58                                 ` adi
  2007-08-06  6:39                                 ` Ingo Molnar
  1 sibling, 0 replies; 188+ messages in thread
From: adi @ 2007-08-05 18:58 UTC (permalink / raw)
  To: Dave Jones, Ingo Molnar, Alan Cox, J??rn Engel, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, Aug 05, 2007 at 02:44:08PM -0400, Dave Jones wrote:
> It still fails miserably for me.
> 
> If I hit 'C' and '?' I get a list of my mail folders, with some of them
> marked 'N' if they have new mail.  Without atime, those N's never show
> up and every mbox looks like it has no new mail.

This is true for one using mbox_type=mbox (i.e unix native mailbox
format). Maildir type should work just fine as mutt will noticed
that new mail has arrived on 'new' subdir (according to maildir spec).

Then yes, it is configuration dependent.

Regards,

P.Y. Adi Prasaja

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 13:22                             ` Diego Calleja
@ 2007-08-05 19:03                               ` david
  2007-08-06  6:52                                 ` Ingo Molnar
  2007-08-10  4:04                                 ` Bill Davidsen
  2007-08-06  6:58                               ` Ingo Molnar
  1 sibling, 2 replies; 188+ messages in thread
From: david @ 2007-08-05 19:03 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Ingo Molnar, Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1680 bytes --]

On Sun, 5 Aug 2007, Diego Calleja wrote:

> El Sun, 5 Aug 2007 09:13:20 +0200, Ingo Molnar <mingo@elte.hu> escribió:
>
>> Measurements show that noatime helps 20-30% on regular desktop
>> workloads, easily 50% for kernel builds and much more than that (in
>> excess of 100%) for file-read-intense workloads. We cannot just walk
>
>
> And as everybody knows in servers is a popular practice to disable it.
> According to an interview to the kernel.org admins....
>
> "Beyond that, Peter noted, "very little fancy is going on, and that is good
> because fancy is hard to maintain." He explained that the only fancy thing
> being done is that all filesystems are mounted noatime meaning that the
> system doesn't have to make writes to the filesystem for files which are
> simply being read, "that cut the load average in half."
>
> I bet that some people would consider such performance hit a bug...
>

actually, it's popular practice to disable it by people who know how big a 
hit it is and know how few programs use it.

i've been a linux sysadmin for 10 years, and have known about noatime for 
at least 7 years, but I always thought of it in the catagory of 'use it 
only on your performance critical machines where you are trying to extract 
every ounce of performance, and keep an eye out for things misbehaving'

I never imagined that itwas the 20%+ hit that is being described, and with 
so little impact, or I would have switched to it across the board years 
ago.

I'll bet there are a lot of admins out there in the same boat.

adding an option in the kernel to change the default sounds like a very 
good first step, even if the default isn't changed today.

David Lang

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 16:45                         ` Linus Torvalds
@ 2007-08-05 19:09                           ` Ingo Molnar
  2007-08-05 19:22                             ` [patch] implement smarter atime updates support Ingo Molnar
  2007-08-05 19:29                             ` [PATCH 00/23] per device dirty throttling -v8 Alan Cox
  0 siblings, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakob Oestergaard, Jeff Garzik, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 5 Aug 2007, Ingo Molnar wrote:
> > 
> > you mean tmpwatch? The trivial change below fixes this. And with that 
> > we've come to the end of an extremely short list of atime dependencies.
> 
> You wouldn't even need these kinds of games.
> 
> What we could do is to make "relatime" updates a bit smarter.
> 
> A bit smarter would be:
> 
>  - update atime if the old atime is <= than mtime/ctime
> 
>    Logic: things like mailers can care about whether some new state has 
>    been read or not. This is the current relatime.
> 
>  - update atime if the old atime is more than X seconds in the past 
>    (defaulting to one day or something)
> 
>    Logic: things like tmpwatch and backup software may want to remove 
>    stuff that hasn't been touched in a long time, but they sure don't care 
>    about "exact" atime.

ok, i've implemented this and it's working fine. Check out the 
relatime_need_update() function for the details of the logic. Atime 
update frequency is 1 day with that, and we update at least once after 
every modification as well, for the mailer logic.

tested it by moving the date forward:

  # date
  Sun Aug  5 22:55:14 CEST 2007
  # date -s "Tue Aug  7 22:55:14 CEST 2007"
  Tue Aug  7 22:55:14 CEST 2007

access to a file did not generate disk IO before the date was set, and 
it generated exactly one IO after the date was set.

( should i perhaps reduce the number of boot options and only use a
  single "norelatime_default" boot option to turn this off? )

	Ingo

------------------------------------>
Subject: [patch] add norelatime/relatime boot options, CONFIG_DEFAULT_RELATIME
From: Ingo Molnar <mingo@elte.hu>

change relatime updates to be performed once per day. This makes
relatime a compatible solution for HSM, mailer-notification and
tmpwatch applications too.

also add the CONFIG_DEFAULT_RELATIME kernel option, which makes
"norelatime" the default for all mounts without an extra kernel
boot option.

add the "norelatime" (and "relatime") boot options to enable/disable
relatime updates for all filesystems.

also add the /proc/sys/kernel/mount_with_relatime flag which can be changed
runtime to modify the behavior of subsequent new mounts.

tested by moving the date forward:

   # date
   Sun Aug  5 22:55:14 CEST 2007
   # date -s "Tue Aug  7 22:55:14 CEST 2007"
   Tue Aug  7 22:55:14 CEST 2007

access to a file did not generate disk IO before the date was set, and
it generated exactly one IO after the date was set.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |   12 +++++++
 fs/Kconfig                          |   17 ++++++++++
 fs/inode.c                          |   48 ++++++++++++++++++++--------
 fs/namespace.c                      |   61 ++++++++++++++++++++++++++++++++++++
 include/linux/mount.h               |    2 +
 kernel/sysctl.c                     |    9 +++++
 6 files changed, 136 insertions(+), 13 deletions(-)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -303,6 +303,12 @@ and is between 256 and 4096 characters. 
 
 	atascsi=	[HW,SCSI] Atari SCSI
 
+	relatime        [FS] default to enabled relatime updates on all
+			filesystems.
+
+	relatime=       [FS] default to enabled/disabled relatime updates on
+			all filesystems.
+
 	atkbd.extra=	[HW] Enable extra LEDs and keys on IBM RapidAccess,
 			EzKey and similar keyboards
 
@@ -1100,6 +1106,12 @@ and is between 256 and 4096 characters. 
 	noasync		[HW,M68K] Disables async and sync negotiation for
 			all devices.
 
+	norelatime      [FS] default to disabled relatime updates on all
+			filesystems.
+
+	norelatime=     [FS] default to disabled/enabled relatime updates
+			on all filesystems.
+
 	nobats		[PPC] Do not use BATs for mapping kernel lowmem
 			on "Classic" PPC cores.
 
Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,23 @@ config 9P_FS
 
 endmenu
 
+config DEFAULT_RELATIME
+	bool "Mount all filesystems with relatime by default"
+	default y
+	help
+	  If you say Y here, all your filesystems will be mounted
+	  with the "relatime" mount option. This eliminates many atime
+	  ('file last accessed' timestamp) updates (which otherwise
+	  is performed on every file access and generates a write
+	  IO to the inode) and thus speeds up IO. Atime is still updated,
+	  but only once per day.
+
+	  The mtime ('file last modified') and ctime ('file created')
+	  timestamp are unaffected by this change.
+
+	  Use the "norelatime" kernel boot option to turn off this
+	  feature.
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/inode.c
===================================================================
--- linux.orig/fs/inode.c
+++ linux/fs/inode.c
@@ -1162,6 +1162,36 @@ sector_t bmap(struct inode * inode, sect
 }
 EXPORT_SYMBOL(bmap);
 
+/*
+ * With relative atime, only update atime if the
+ * previous atime is earlier than either the ctime or
+ * mtime.
+ */
+static int relatime_need_update(struct inode *inode, struct timespec now)
+{
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
@@ -1191,22 +1221,14 @@ void touch_atime(struct vfsmount *mnt, s
 			return;
 		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 			return;
-
-		if (mnt->mnt_flags & MNT_RELATIME) {
-			/*
-			 * With relative atime, only update atime if the
-			 * previous atime is earlier than either the ctime or
-			 * mtime.
-			 */
-			if (timespec_compare(&inode->i_mtime,
-						&inode->i_atime) < 0 &&
-			    timespec_compare(&inode->i_ctime,
-						&inode->i_atime) < 0)
+	}
+	now = current_fs_time(inode->i_sb);
+	if (mnt) {
+		if (mnt->mnt_flags & MNT_RELATIME)
+			if (!relatime_need_update(inode, now))
 				return;
-		}
 	}
 
-	now = current_fs_time(inode->i_sb);
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1107,6 +1107,8 @@ int do_add_mount(struct vfsmount *newmnt
 		goto unlock;
 
 	newmnt->mnt_flags = mnt_flags;
+	WARN_ON_ONCE(newmnt->mnt_flags & MNT_RELATIME);
+
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
 
@@ -1362,6 +1364,60 @@ int copy_mount_options(const void __user
 }
 
 /*
+ * Allow users to disable (or enable) atime updates via a .config
+ * option or via the boot line, or via /proc/sys/fs/mount_with_relatime:
+ */
+int mount_with_relatime __read_mostly =
+#ifdef CONFIG_DEFAULT_RELATIME
+1
+#else
+0
+#endif
+;
+
+/*
+ * The "norelatime=", "atime=", "norelatime" and "relatime" boot parameters:
+ */
+static int toggle_relatime_updates(int val)
+{
+	mount_with_relatime = val;
+
+	printk("Relative atime updates are: %s\n", val ? "on" : "off");
+
+	return 1;
+}
+
+static int __init set_relatime_setup(char *str)
+{
+	int val;
+
+	get_option(&str, &val);
+	return toggle_relatime_updates(val);
+}
+__setup("relatime=", set_relatime_setup);
+
+static int __init set_norelatime_setup(char *str)
+{
+	int val;
+
+	get_option(&str, &val);
+	return toggle_relatime_updates(!val);
+}
+__setup("norelatime=", set_norelatime_setup);
+
+static int __init set_relatime(char *str)
+{
+	return toggle_relatime_updates(1);
+}
+__setup("relatime", set_relatime);
+
+static int __init set_norelatime(char *str)
+{
+	return toggle_relatime_updates(0);
+}
+__setup("norelatime", set_norelatime);
+
+/*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
  *
@@ -1409,6 +1465,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	else if (mount_with_relatime &&
+				!(flags & (MNT_NOATIME | MNT_NODIRATIME))) {
+		mnt_flags |= MNT_RELATIME;
+		flags |= MS_RELATIME;
+	}
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h
+++ linux/include/linux/mount.h
@@ -103,5 +103,7 @@ extern void shrink_submounts(struct vfsm
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
+extern int mount_with_relatime;
+
 #endif
 #endif /* _LINUX_MOUNT_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -1206,6 +1207,14 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "mount_with_relatime",
+		.data		= &mount_with_relatime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:08                                     ` Ingo Molnar
@ 2007-08-05 19:11                                       ` Alan Cox
  2007-08-08 18:22                                       ` Bill Davidsen
  1 sibling, 0 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-05 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: J??rn Engel, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, 5 Aug 2007 20:08:26 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > And you honestly think that putting it in Kconfig as well as allowing 
> > users to screw up horribly and creating incompatible defaults you
> 
> So far you've not offered one realistic scenario of "screw up horribly". 
> People have been using noatime for a long time and there are no horror 
> stories about that. _Which_ OSS HSM software relies on atime?

Whats this about "OSS". OSS or proprietary. And you've been given one
example already - tmpwatch. Although its more of a trash compactor than
HSM.

> > can't test for in a user space app where it matters is going to 
> > *change* this.
> 
> The patch i posted today adds /proc/sys/kernel/mount_with_atime. That 
> can be tested by user-space, if it truly cares about atime.

We have an existing API and ABI thank you. See man mount.

> > Do you really think anyone who said "noatime, compatibility, umm errr" 
> > is going to say "noatime, compatibility, but hey its in Kconfig lets 
> > do it". You argument doesn't hold up to minimal rational 
> > consideration. Posting to the distribution devel list with: "Its a 50% 
> > performance win, we need to fix these corner cases, here's a tmpwatch 
> > patch" is *exactly* what is needed to change it, and Kconfig options 
> > are irrelevant to that.
> 
> i did exactly that 6 months ago, check your email folders. I went by the 
> "process". But it doesnt really matter anymore, Ubuntu has done the step 

And your Kconfig argument is still not rational. A question I note you
chose not to answer. Anyway if Ubuntu has switched to noatime by default
(or relatime) and hasn't used a Kconfig line that proves my whole point -
we don't need one and its pointless to add so.

> we really have to ask ourselves whether the "process" is correct if 
> advantages to the user of this order of magnitude can be brushed aside 
> with simple "this breaks binary-only HSM" and "it's not standards 
> compliant" arguments.

Thats a discussion to have with your distribution development team. The
kernel provides the required facilities already. Open source means
everyone can do cool stuff as they see fit and natural selection will do
the rest.

Look I agree entirely with you that relatime, or noatime + minor package
patches is the right thing to do for FC8. I've also pointed out you can
build and release tuning packages for FC 7 and they'll make the
distribution. FC8 beta 1 approaches so now is the time to be talking to
the distribution people and to the ever kernel building Dave Jones about
it.

But none of this makes stupid Kconfig hacks the right answer.

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* [patch] implement smarter atime updates support
  2007-08-05 19:09                           ` Ingo Molnar
@ 2007-08-05 19:22                             ` Ingo Molnar
  2007-08-05 19:28                               ` [patch] implement smarter atime updates support, v2 Ingo Molnar
  2007-08-05 19:53                               ` [patch] implement smarter atime updates support Arjan van de Ven
  2007-08-05 19:29                             ` [PATCH 00/23] per device dirty throttling -v8 Alan Cox
  1 sibling, 2 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakob Oestergaard, Jeff Garzik, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Ingo Molnar <mingo@elte.hu> wrote:

> tested it by moving the date forward:
> 
>   # date
>   Sun Aug  5 22:55:14 CEST 2007
>   # date -s "Tue Aug  7 22:55:14 CEST 2007"
>   Tue Aug  7 22:55:14 CEST 2007
> 
> access to a file did not generate disk IO before the date was set, and 
> it generated exactly one IO after the date was set.
> 
> ( should i perhaps reduce the number of boot options and only use a
>   single "norelatime_default" boot option to turn this off? )

ok, cleaned it up some more: only a single, consistent boot option and 
all the switches (be that config, boot or sysctl) are now called 
"default_relatime". Also, got rid of that #ifdef ugliness in namespace.c 
via a cleaner Kconfig solution (suggested by Peter Zijlstra).

	Ingo

---------------------------->
Subject: [patch] implement smarter atime updates support
From: Ingo Molnar <mingo@elte.hu>

change relatime updates to be performed once per day. This makes
relatime a compatible solution for HSM, mailer-notification and
tmpwatch applications too.

also add the CONFIG_DEFAULT_RELATIME kernel option, which makes
"norelatime" the default for all mounts without an extra kernel
boot option.

add the "default_relatime=0" boot option to turn this off.

also add the /proc/sys/kernel/default_relatime flag which can be changed
runtime to modify the behavior of subsequent new mounts.

tested by moving the date forward:

   # date
   Sun Aug  5 22:55:14 CEST 2007
   # date -s "Tue Aug  7 22:55:14 CEST 2007"
   Tue Aug  7 22:55:14 CEST 2007

access to a file did not generate disk IO before the date was set, and
it generated exactly one IO after the date was set.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |    4 +++
 fs/Kconfig                          |   22 ++++++++++++++++
 fs/inode.c                          |   48 ++++++++++++++++++++++++++----------
 fs/namespace.c                      |   25 ++++++++++++++++++
 include/linux/mount.h               |    2 +
 kernel/sysctl.c                     |    9 ++++++
 6 files changed, 97 insertions(+), 13 deletions(-)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -525,6 +525,10 @@ and is between 256 and 4096 characters. 
 			This is a 16-member array composed of values
 			ranging from 0-255.
 
+	default_relatime=
+			[FS] mount all filesystems with relative atime
+			updates by default.
+
 	default_utf8=   [VT]
 			Format=<0|1>
 			Set system-wide default UTF-8 mode for all tty's.
Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,28 @@ config 9P_FS
 
 endmenu
 
+config DEFAULT_RELATIME
+	bool "Mount all filesystems with relatime by default"
+	default y
+	help
+	  If you say Y here, all your filesystems will be mounted
+	  with the "relatime" mount option. This eliminates many atime
+	  ('file last accessed' timestamp) updates (which otherwise
+	  is performed on every file access and generates a write
+	  IO to the inode) and thus speeds up IO. Atime is still updated,
+	  but only once per day.
+
+	  The mtime ('file last modified') and ctime ('file created')
+	  timestamp are unaffected by this change.
+
+	  Use the "norelatime" kernel boot option to turn off this
+	  feature.
+
+config DEFAULT_RELATIME_VAL
+	int
+	default "1" if DEFAULT_RELATIME
+	default "0"
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/inode.c
===================================================================
--- linux.orig/fs/inode.c
+++ linux/fs/inode.c
@@ -1162,6 +1162,36 @@ sector_t bmap(struct inode * inode, sect
 }
 EXPORT_SYMBOL(bmap);
 
+/*
+ * With relative atime, only update atime if the
+ * previous atime is earlier than either the ctime or
+ * mtime.
+ */
+static int relatime_need_update(struct inode *inode, struct timespec now)
+{
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
@@ -1191,22 +1221,14 @@ void touch_atime(struct vfsmount *mnt, s
 			return;
 		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 			return;
-
-		if (mnt->mnt_flags & MNT_RELATIME) {
-			/*
-			 * With relative atime, only update atime if the
-			 * previous atime is earlier than either the ctime or
-			 * mtime.
-			 */
-			if (timespec_compare(&inode->i_mtime,
-						&inode->i_atime) < 0 &&
-			    timespec_compare(&inode->i_ctime,
-						&inode->i_atime) < 0)
+	}
+	now = current_fs_time(inode->i_sb);
+	if (mnt) {
+		if (mnt->mnt_flags & MNT_RELATIME)
+			if (!relatime_need_update(inode, now))
 				return;
-		}
 	}
 
-	now = current_fs_time(inode->i_sb);
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1107,6 +1107,8 @@ int do_add_mount(struct vfsmount *newmnt
 		goto unlock;
 
 	newmnt->mnt_flags = mnt_flags;
+	WARN_ON_ONCE(newmnt->mnt_flags & MNT_RELATIME);
+
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
 
@@ -1362,6 +1364,24 @@ int copy_mount_options(const void __user
 }
 
 /*
+ * Allow users to disable (or enable) atime updates via a .config
+ * option or via the boot line, or via /proc/sys/fs/default_relatime:
+ */
+int default_relatime __read_mostly = CONFIG_DEFAULT_RELATIME_VAL;
+
+static int __init set_default_relatime(char *str)
+{
+	get_option(&str, &default_relatime);
+
+	printk(KERN_INFO "Mount all filesystems with"
+		"default relative atime updates: %s.\n",
+		default_relatime ? "enabled" : "disabled");
+
+	return 1;
+}
+__setup("default_relatime=", set_default_relatime);
+
+/*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
  *
@@ -1409,6 +1429,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	else if (default_relatime &&
+				!(flags & (MNT_NOATIME | MNT_NODIRATIME))) {
+		mnt_flags |= MNT_RELATIME;
+		flags |= MS_RELATIME;
+	}
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h
+++ linux/include/linux/mount.h
@@ -103,5 +103,7 @@ extern void shrink_submounts(struct vfsm
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
+extern int default_relatime;
+
 #endif
 #endif /* _LINUX_MOUNT_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -1206,6 +1207,14 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "default_relatime",
+		.data		= &default_relatime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

^ permalink raw reply	[flat|nested] 188+ messages in thread

* [patch] implement smarter atime updates support, v2
  2007-08-05 19:22                             ` [patch] implement smarter atime updates support Ingo Molnar
@ 2007-08-05 19:28                               ` Ingo Molnar
  2007-08-05 20:42                                 ` Theodore Tso
  2007-08-05 19:53                               ` [patch] implement smarter atime updates support Arjan van de Ven
  1 sibling, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakob Oestergaard, Jeff Garzik, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


new version:

added the relatime_interval sysctl that allows the changing of the atime 
update frequency. (default: 1 day / 86400 seconds)

	Ingo

-------------------------->
Subject: [patch] [patch] implement smarter atime updates support
From: Ingo Molnar <mingo@elte.hu>

change relatime updates to be performed once per day. This makes
relatime a compatible solution for HSM, mailer-notification and
tmpwatch applications too.

also add the CONFIG_DEFAULT_RELATIME kernel option, which makes
"norelatime" the default for all mounts without an extra kernel
boot option.

add the "default_relatime=0" boot option to turn this off.

also add the /proc/sys/kernel/default_relatime flag which can be changed
runtime to modify the behavior of subsequent new mounts.

tested by moving the date forward:

   # date
   Sun Aug  5 22:55:14 CEST 2007
   # date -s "Tue Aug  7 22:55:14 CEST 2007"
   Tue Aug  7 22:55:14 CEST 2007

access to a file did not generate disk IO before the date was set, and
it generated exactly one IO after the date was set.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |    8 +++++
 fs/Kconfig                          |   22 ++++++++++++++
 fs/inode.c                          |   53 +++++++++++++++++++++++++++---------
 fs/namespace.c                      |   24 ++++++++++++++++
 include/linux/mount.h               |    3 ++
 kernel/sysctl.c                     |   17 +++++++++++
 6 files changed, 114 insertions(+), 13 deletions(-)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -525,6 +525,10 @@ and is between 256 and 4096 characters. 
 			This is a 16-member array composed of values
 			ranging from 0-255.
 
+	default_relatime=
+			[FS] mount all filesystems with relative atime
+			updates by default.
+
 	default_utf8=   [VT]
 			Format=<0|1>
 			Set system-wide default UTF-8 mode for all tty's.
@@ -1468,6 +1472,10 @@ and is between 256 and 4096 characters. 
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c			
 
+	relatime_interval=
+			[FS] relative atime update frequency, in seconds.
+			(default: 1 day: 86400 seconds)
+
 	reserve=	[KNL,BUGS] Force the kernel to ignore some iomem area
 
 	reservetop=	[X86-32]
Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,28 @@ config 9P_FS
 
 endmenu
 
+config DEFAULT_RELATIME
+	bool "Mount all filesystems with relatime by default"
+	default y
+	help
+	  If you say Y here, all your filesystems will be mounted
+	  with the "relatime" mount option. This eliminates many atime
+	  ('file last accessed' timestamp) updates (which otherwise
+	  is performed on every file access and generates a write
+	  IO to the inode) and thus speeds up IO. Atime is still updated,
+	  but only once per day.
+
+	  The mtime ('file last modified') and ctime ('file created')
+	  timestamp are unaffected by this change.
+
+	  Use the "norelatime" kernel boot option to turn off this
+	  feature.
+
+config DEFAULT_RELATIME_VAL
+	int
+	default "1" if DEFAULT_RELATIME
+	default "0"
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/inode.c
===================================================================
--- linux.orig/fs/inode.c
+++ linux/fs/inode.c
@@ -1162,6 +1162,41 @@ sector_t bmap(struct inode * inode, sect
 }
 EXPORT_SYMBOL(bmap);
 
+/*
+ * Relative atime updates frequency (default: 1 day):
+ */
+int relatime_interval __read_mostly = 24*60*60;
+
+/*
+ * With relative atime, only update atime if the
+ * previous atime is earlier than either the ctime or
+ * mtime.
+ */
+static int relatime_need_update(struct inode *inode, struct timespec now)
+{
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= relatime_interval)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
@@ -1191,22 +1226,14 @@ void touch_atime(struct vfsmount *mnt, s
 			return;
 		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 			return;
-
-		if (mnt->mnt_flags & MNT_RELATIME) {
-			/*
-			 * With relative atime, only update atime if the
-			 * previous atime is earlier than either the ctime or
-			 * mtime.
-			 */
-			if (timespec_compare(&inode->i_mtime,
-						&inode->i_atime) < 0 &&
-			    timespec_compare(&inode->i_ctime,
-						&inode->i_atime) < 0)
+	}
+	now = current_fs_time(inode->i_sb);
+	if (mnt) {
+		if (mnt->mnt_flags & MNT_RELATIME)
+			if (!relatime_need_update(inode, now))
 				return;
-		}
 	}
 
-	now = current_fs_time(inode->i_sb);
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1107,6 +1107,7 @@ int do_add_mount(struct vfsmount *newmnt
 		goto unlock;
 
 	newmnt->mnt_flags = mnt_flags;
+
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
 
@@ -1362,6 +1363,24 @@ int copy_mount_options(const void __user
 }
 
 /*
+ * Allow users to disable (or enable) atime updates via a .config
+ * option or via the boot line, or via /proc/sys/fs/default_relatime:
+ */
+int default_relatime __read_mostly = CONFIG_DEFAULT_RELATIME_VAL;
+
+static int __init set_default_relatime(char *str)
+{
+	get_option(&str, &default_relatime);
+
+	printk(KERN_INFO "Mount all filesystems with"
+		"default relative atime updates: %s.\n",
+		default_relatime ? "enabled" : "disabled");
+
+	return 1;
+}
+__setup("default_relatime=", set_default_relatime);
+
+/*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
  *
@@ -1409,6 +1428,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	else if (default_relatime &&
+				!(flags & (MNT_NOATIME | MNT_NODIRATIME))) {
+		mnt_flags |= MNT_RELATIME;
+		flags |= MS_RELATIME;
+	}
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h
+++ linux/include/linux/mount.h
@@ -103,5 +103,8 @@ extern void shrink_submounts(struct vfsm
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
+extern int default_relatime;
+extern int relatime_interval;
+
 #endif
 #endif /* _LINUX_MOUNT_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -1206,6 +1207,22 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "default_relatime",
+		.data		= &default_relatime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "relatime_interval",
+		.data		= &relatime_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 19:09                           ` Ingo Molnar
  2007-08-05 19:22                             ` [patch] implement smarter atime updates support Ingo Molnar
@ 2007-08-05 19:29                             ` Alan Cox
  2007-08-05 19:32                               ` Ingo Molnar
  1 sibling, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 19:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jakob Oestergaard, Jeff Garzik, miklos, akpm,
	neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

> change relatime updates to be performed once per day. This makes
> relatime a compatible solution for HSM, mailer-notification and
> tmpwatch applications too.

Sweet
> 

> also add the CONFIG_DEFAULT_RELATIME kernel option, which makes
> "norelatime" the default for all mounts without an extra kernel
> boot option.

Should be a mount option.


> +	relatime        [FS] default to enabled relatime updates on all
> +			filesystems.
> +
> +	relatime=       [FS] default to enabled/disabled relatime updates on
> +			all filesystems.
> +

Double patch

>  	atkbd.extra=	[HW] Enable extra LEDs and keys on IBM RapidAccess,
>  			EzKey and similar keyboards
>  
> @@ -1100,6 +1106,12 @@ and is between 256 and 4096 characters. 
>  	noasync		[HW,M68K] Disables async and sync negotiation for
>  			all devices.
>  
> +	norelatime      [FS] default to disabled relatime updates on all
> +			filesystems.
> +
> +	norelatime=     [FS] default to disabled/enabled relatime updates
> +			on all filesystems.
> +

Double patch

> +config DEFAULT_RELATIME
> +	bool "Mount all filesystems with relatime by default"
> +	default y

Changes behaviour so probably should default n. Better yet it should be
the mount option so its flexible and strongly encouraged for vendors.

>  /*
> + * Allow users to disable (or enable) atime updates via a .config
> + * option or via the boot line, or via /proc/sys/fs/mount_with_relatime:
> + */
> +int mount_with_relatime __read_mostly =
> +#ifdef CONFIG_DEFAULT_RELATIME
> +1
> +#else
> +0
> +#endif
> +;

This ifdef mess would go away for a mount option

> +/*
> + * The "norelatime=", "atime=", "norelatime" and "relatime" boot parameters:
> + */
> +static int toggle_relatime_updates(int val)
> +{
> +	mount_with_relatime = val;
> +
> +	printk("Relative atime updates are: %s\n", val ? "on" : "off");
> +
> +	return 1;
> +}
> +
> +static int __init set_relatime_setup(char *str)
> +{
> +	int val;
> +
> +	get_option(&str, &val);
> +	return toggle_relatime_updates(val);
> +}
> +__setup("relatime=", set_relatime_setup);
> +
> +static int __init set_norelatime_setup(char *str)
> +{
> +	int val;
> +
> +	get_option(&str, &val);
> +	return toggle_relatime_updates(!val);
> +}
> +__setup("norelatime=", set_norelatime_setup);
> +
> +static int __init set_relatime(char *str)
> +{
> +	return toggle_relatime_updates(1);
> +}
> +__setup("relatime", set_relatime);
> +
> +static int __init set_norelatime(char *str)
> +{
> +	return toggle_relatime_updates(0);
> +}
> +__setup("norelatime", set_norelatime);


All the above chunk is unneccessary as it can be a mount option. That
avoids tons of messy extra code and complication. Users are far safer
editing fstab than grub.conf.

> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "mount_with_relatime",
> +		.data		= &mount_with_relatime,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},

More code you don't need if you just leave it as a mount option.

I'd much rather see the small clean patch for this as a mount option.
Leave the rest to users/distros/lwn and it'll just happen now you've
sorted the compabitility problems.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 19:29                             ` [PATCH 00/23] per device dirty throttling -v8 Alan Cox
@ 2007-08-05 19:32                               ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-05 19:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Jakob Oestergaard, Jeff Garzik, miklos, akpm,
	neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > also add the CONFIG_DEFAULT_RELATIME kernel option, which makes 
> > "norelatime" the default for all mounts without an extra kernel boot 
> > option.
> 
> Should be a mount option.

it is already a mount option too.

> > +	relatime        [FS] default to enabled relatime updates on all
> > +			filesystems.
> > +
> > +	relatime=       [FS] default to enabled/disabled relatime updates on
> > +			all filesystems.
> > +
> 
> Double patch

no - it was not a double patch, i made all the common variants valid 
boot options: "relatime", "relatime=0/1", "norelatime" and 
"norelatime=0/1". Anyway, this is mooth, in the latest (v2) version 
there's only a single boot parameter.

> > +config DEFAULT_RELATIME
> > +	bool "Mount all filesystems with relatime by default"
> > +	default y
> 
> Changes behaviour so probably should default n. Better yet it should 
> be the mount option so its flexible and strongly encouraged for 
> vendors.

relatime is a mount option already. And distros can disable it if they 
want. (they are conscious about their kernel config selections anyway.)

> > +0
> > +#endif
> > +;
> 
> This ifdef mess would go away for a mount option

i fixed that in v2.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [patch] implement smarter atime updates support
  2007-08-05 19:22                             ` [patch] implement smarter atime updates support Ingo Molnar
  2007-08-05 19:28                               ` [patch] implement smarter atime updates support, v2 Ingo Molnar
@ 2007-08-05 19:53                               ` Arjan van de Ven
  2007-08-05 20:04                                 ` Alan Cox
  1 sibling, 1 reply; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-05 19:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jakob Oestergaard, Jeff Garzik, miklos, akpm,
	neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


> +static int relatime_need_update(struct inode *inode, struct timespec now)
> +{
> +	/*
> +	 * Is mtime younger than atime? If yes, update atime:
> +	 */
> +	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
> +		return 1;
> +	/*
> +	 * Is ctime younger than atime? If yes, update atime:
> +	 */
> +	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
> +		return 1;
> +
> +	/*
> +	 * Is the previous atime value older than a day? If yes,
> +	 * update atime:
> +	 */
> +	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
> +		return 1;


you might want to add

	/* 
	 * if the inode is dirty already, do the atime update since
	 * we'll be doing the disk IO anyway to clean the inode.
	 */
	if (inode->i_state & I_DIRTY)
		return 1;



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [patch] implement smarter atime updates support
  2007-08-05 19:53                               ` [patch] implement smarter atime updates support Arjan van de Ven
@ 2007-08-05 20:04                                 ` Alan Cox
  2007-08-05 20:22                                   ` Arjan van de Ven
  0 siblings, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-05 20:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Jakob Oestergaard, Jeff Garzik,
	miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

O> you might want to add
> 
> 	/* 
> 	 * if the inode is dirty already, do the atime update since
> 	 * we'll be doing the disk IO anyway to clean the inode.
> 	 */
> 	if (inode->i_state & I_DIRTY)
> 		return 1;

This makes the actual result somewhat less predictable. Is that wise ?
Right now its clear what happens based on what user sequence of events
and that this is easily repeatable.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:37                                     ` Jörn Engel
@ 2007-08-05 20:21                                       ` Jörn Engel
  2007-08-05 20:33                                         ` Andrew Morton
  0 siblings, 1 reply; 188+ messages in thread
From: Jörn Engel @ 2007-08-05 20:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Willy Tarreau, Jörn Engel, Alan Cox,
	Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, 5 August 2007 20:37:14 +0200, Jörn Engel wrote:
> 
> Guess I should throw in a kernel compile test as well, just to get a
> feel for the performance.

Three runs each of noatime, relatime and atime, both with cold caches
and with warm caches.  Scripts below.  Run on a Thinkpad T40, 1.5GHz,
2GiB RAM, 60GB 2.5" IDE disk, ext3.

Biggest difference between atime and noatime (median run, cold cache) is
~2.3%, nowhere near the numbers claimed by Ingo.  Ingo, how did you
measure 10% and more?

noatime, cold cache	relatime, cold cache	atime, cold cache
	                	                
real    2m10.242s	real    2m10.549s	real    2m10.388s
user    1m46.886s	user    1m46.680s	user    1m47.000s
sys     0m8.243s	sys     0m8.423s	sys     0m8.239s
	                	                
real    2m11.270s	real    2m11.212s	real    2m14.280s
user    1m46.940s	user    1m46.776s	user    1m46.670s
sys     0m8.139s	sys     0m8.283s	sys     0m8.503s
	                	                
real    2m11.601s	real    2m14.861s	real    2m14.335s
user    1m46.920s	user    1m47.103s	user    1m46.846s
sys     0m8.246s	sys     0m8.266s	sys     0m8.349s
	                	                
noatime, warm cache	relatime, warm cache	atime, warm cache
	                	                
real    1m55.894s	real    1m56.053s	real    1m56.905s
user    1m46.683s	user    1m46.600s	user    1m46.853s
sys     0m8.186s	sys     0m8.349s	sys     0m8.249s
	                	                
real    1m55.823s	real    1m56.093s	real    1m57.077s
user    1m46.583s	user    1m46.913s	user    1m46.590s
sys     0m8.259s	sys     0m7.966s	sys     0m8.523s
	                	                
real    1m55.789s	real    1m56.214s	real    1m57.224s
user    1m46.803s	user    1m46.753s	user    1m46.953s
sys     0m8.053s	sys     0m8.113s	sys     0m8.113s

Jörn

-- 
Data expands to fill the space available for storage.
-- Parkinson's Law

Cold cache script:
#!/bin/sh
make distclean
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
make allnoconfig
time make

Warm cache script:
#!/bin/sh
make distclean
make allnoconfig
rgrep laksdflkdsaflkadsfja .
time make

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [patch] implement smarter atime updates support
  2007-08-05 20:04                                 ` Alan Cox
@ 2007-08-05 20:22                                   ` Arjan van de Ven
  0 siblings, 0 replies; 188+ messages in thread
From: Arjan van de Ven @ 2007-08-05 20:22 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Linus Torvalds, Jakob Oestergaard, Jeff Garzik,
	miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, 2007-08-05 at 21:04 +0100, Alan Cox wrote:
> O> you might want to add
> > 
> > 	/* 
> > 	 * if the inode is dirty already, do the atime update since
> > 	 * we'll be doing the disk IO anyway to clean the inode.
> > 	 */
> > 	if (inode->i_state & I_DIRTY)
> > 		return 1;
> 
> This makes the actual result somewhat less predictable. Is that wise ?
> Right now its clear what happens based on what user sequence of events
> and that this is easily repeatable.

I can see the repeatability argument; on the flipside, having a system
of "opportunistic atime", eg as good as you can go cheaply, but with
minimum guarantees has some attraction as well. For example one could
imagine a system where the inode gets it's atime updated anyway, just
not flagged for writing back to disk. If it later undergoes some event
that would cause it to go to disk, it gets preserved...

otoh that's even more unpredictable since VM pressure could drop this
update early.

For the dirty case, such drawbacks don't exist; it's just one more step
of "when we can cheaply".

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 20:21                                       ` Jörn Engel
@ 2007-08-05 20:33                                         ` Andrew Morton
  0 siblings, 0 replies; 188+ messages in thread
From: Andrew Morton @ 2007-08-05 20:33 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Arjan van de Ven, Willy Tarreau, Alan Cox,
	Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sun, 5 Aug 2007 22:21:12 +0200 Jörn Engel <joern@logfs.org> wrote:

> On Sun, 5 August 2007 20:37:14 +0200, Jörn Engel wrote:
> > 
> > Guess I should throw in a kernel compile test as well, just to get a
> > feel for the performance.
> 
> Three runs each of noatime, relatime and atime, both with cold caches
> and with warm caches.  Scripts below.  Run on a Thinkpad T40, 1.5GHz,
> 2GiB RAM, 60GB 2.5" IDE disk, ext3.
> 
> Biggest difference between atime and noatime (median run, cold cache) is
> ~2.3%, nowhere near the numbers claimed by Ingo.  Ingo, how did you
> measure 10% and more?

Ingo had CONFIG_DEBUG_INFO=y, which generates heaps more writeout,
but no additional atime updates.

Ingo had a faster computer ;)  That will generate many more MB/sec
write traffic, so the cost of those atime seeks becomes proportionally
higher.  Basically: you're CPU-limited, Ingo is seek-limited.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:01                             ` Arjan van de Ven
@ 2007-08-05 20:34                               ` Christoph Hellwig
  0 siblings, 0 replies; 188+ messages in thread
From: Christoph Hellwig @ 2007-08-05 20:34 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Theodore Tso, Linus Torvalds, J?rn Engel, Ingo Molnar,
	Jeff Garzik, Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, nikita,
	trond.myklebust, yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 11:01:18AM -0700, Arjan van de Ven wrote:
> 
> on the journalling side this would be one transaction (not 5 milion)
> and... since inodes are grouped on disk, you can even get some better
> coalescing this way... 
> 
> Wonder if we could do inode-grouping smartly; eg if we HAVE to write
> inode X, also write out the atime-dirty inodes in range X-Y to X+Y
> (where Y is some tunable) in the same IO..

We already have filesystems in the tree that do such advances things as
inode writeback clustering for more than ten years :)

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:42                       ` Jörn Engel
@ 2007-08-05 20:36                         ` Christoph Hellwig
  2007-08-06 18:03                           ` Chuck Ebbert
                                             ` (2 more replies)
  0 siblings, 3 replies; 188+ messages in thread
From: Christoph Hellwig @ 2007-08-05 20:36 UTC (permalink / raw)
  To: J??rn Engel
  Cc: Ingo Molnar, Jeff Garzik, Linus Torvalds, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Sat, Aug 04, 2007 at 09:42:59PM +0200, J??rn Engel wrote:
> On Sat, 4 August 2007 21:26:15 +0200, J??rn Engel wrote:
> > 
> > Given the choice between only "atime" and "noatime" I'd agree with you.
> > Heck, I use it myself.  But "relatime" seems to combine the best of both
> > worlds.  It currently just suffers from mount not supporting it in any
> > relevant distro.
> 
> And here is a completely untested patch to enable it by default.  Ingo,
> can you see how good this fares compared to "atime" and
> "noatime,nodiratime"?

Umm, no f**king way.  atime selection is 100% policy and belongs into
userspace.  Add to that the problem that we can't actually re-enable
atimes because of the way the vfs-level mount flags API is designed.
Instead of doing such a fugly kernel patch just talk to the handfull
of distributions that matter to update their defaults.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  0:26             ` Andi Kleen
  2007-08-05 15:00               ` Theodore Tso
@ 2007-08-05 20:41               ` Christoph Hellwig
  2007-08-06 10:42                 ` Andi Kleen
  2007-08-16 10:18               ` Helge Hafting
  2 siblings, 1 reply; 188+ messages in thread
From: Christoph Hellwig @ 2007-08-05 20:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
> I always thought the right solution would be to just sync atime only
> very very lazily. This means if a inode is only dirty because of an
> atime update put it on a "only write out when there is nothing to do
> or the memory is really needed" list.

Which is the policy I implemented for XFS a while ago.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [patch] implement smarter atime updates support, v2
  2007-08-05 19:28                               ` [patch] implement smarter atime updates support, v2 Ingo Molnar
@ 2007-08-05 20:42                                 ` Theodore Tso
  2007-08-06  5:36                                   ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Theodore Tso @ 2007-08-05 20:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jakob Oestergaard, Jeff Garzik, miklos, akpm,
	neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 09:28:38PM +0200, Ingo Molnar wrote:
> 
> added the relatime_interval sysctl that allows the changing of the atime 
> update frequency. (default: 1 day / 86400 seconds)

What if you specify the interval as a per-mount option?  i.e., 

	mount -o relatime=86400 /dev/sda2 /u1

If you had this, I don't think we would need the sysctl tuning parameter.

							- Ted

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:57                 ` Florian Weimer
@ 2007-08-05 20:43                   ` Christoph Hellwig
  0 siblings, 0 replies; 188+ messages in thread
From: Christoph Hellwig @ 2007-08-05 20:43 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sun, Aug 05, 2007 at 09:57:02AM +0200, Florian Weimer wrote:
> For instance, some editors don't perform fsync-then-rename, but simply
> truncate the file when saving (because they want to preserve hard
> links).  With XFS, this tends to cause null bytes on crashes.  Since
> ext3 has got a much larger install base, this would result in lots of
> bug reports, I fear.

XFS has recently been changed to only updated the on-disk i_size after
data writeback has finished to get rid of this irritation.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 17:22     ` Brice Figureau
@ 2007-08-05 22:17       ` Andi Kleen
  2007-08-06  8:40         ` Brice Figureau
  0 siblings, 1 reply; 188+ messages in thread
From: Andi Kleen @ 2007-08-05 22:17 UTC (permalink / raw)
  To: Brice Figureau; +Cc: linux-kernel

Brice Figureau <brice+lklm@daysofwonder.com> writes:
> 
>  2) I _still_ don't get the "performances" of 2.6.17, but since that's the
> better combination I could get, I think there is IMHO progress in the right
> direction (to be compared to no progress since 2.6.18, that's better :-)).

If you could characterize your workload well (e.g. how many disks,
what file systems, what load on mysql) perhaps it would be possible
to reproduce the problem with a test program or a mysql driver.
Then it could be bisected.

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:16             ` Florian Weimer
  2007-08-05  6:00               ` Andrew Morton
@ 2007-08-05 22:46               ` Theodore Tso
  2007-08-06  0:24               ` David Chinner
  2 siblings, 0 replies; 188+ messages in thread
From: Theodore Tso @ 2007-08-05 22:46 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sat, Aug 04, 2007 at 09:16:35PM +0200, Florian Weimer wrote:
> * Andrew Morton:
> 
> > The easy preventive is to mount with data=writeback.  Maybe that should
> > have been the default.
> 
> The documentation I could find suggests that this may lead to a
> security weakness (old data in blocks of a file that was grown just
> before the crash leaks to a different user).  XFS overwrites that data
> with zeros upon reboot, which tends to irritate users when it happens.
> 
> From this point of view, data=ordered doesn't seem too bad.

The other alternative which addresses the security concern is
data=journal, which if you have a big enough journal, can sometimes be
*faster* than data=ordered or even data=writeback, because it reduces
seeking.  The problem is that it's workload dependent which is better;
if the workload is very, very heavy on data writes, each data block
ends up getting writen twice, once to the journal and once to the
final location on disk, and so this halves your total max write
bandwidth.  But if the workload doesn't do as much writing, and is
very seeky, and or is very, very, fsync()-centric (like a mailhub),
data=journal is probably the right answer.

						- Ted

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 10:42                   ` Jeff Garzik
  2007-08-05 10:58                     ` Jakob Oestergaard
@ 2007-08-05 23:43                     ` David Chinner
  1 sibling, 0 replies; 188+ messages in thread
From: David Chinner @ 2007-08-05 23:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jakob Oestergaard, Linus Torvalds, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, Ingo Molnar, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Sun, Aug 05, 2007 at 06:42:30AM -0400, Jeff Garzik wrote:
> Jakob Oestergaard wrote:
> >Oh dear.
> >
> >Why not just make ext3 fsync() a no-op while you're at it?
> >
> >Distros can turn it back on if it's needed...
> >
> >Of course I'm not serious, but like atime, fsync() is something one
> 
> No, they are nothing alike, and you are just making yourself look silly 
> if you compare them.  fsync has to do with fundamental guarantees about 
> data.

Hi Jeff - just as a point to note, I think you should check the spec
for fsync before stating that:

"It is explicitly intended that a null implementation is permitted."

and

"... fsync() might or might not actually cause data to be written where it is
safe from a power failure."

http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html

So fsync() does not have to provide the fundamental guarantees you think
it should.

Note - I'm not saying that this is at all sane (it's crazy, IMO), I'm just
pointing out that a "nofsync" mount option to avoid fsync overhead is a
legal thing to do....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 19:16             ` Florian Weimer
  2007-08-05  6:00               ` Andrew Morton
  2007-08-05 22:46               ` Theodore Tso
@ 2007-08-06  0:24               ` David Chinner
  2 siblings, 0 replies; 188+ messages in thread
From: David Chinner @ 2007-08-06  0:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sat, Aug 04, 2007 at 09:16:35PM +0200, Florian Weimer wrote:
> * Andrew Morton:
> 
> > The easy preventive is to mount with data=writeback.  Maybe that should
> > have been the default.
> 
> The documentation I could find suggests that this may lead to a
> security weakness (old data in blocks of a file that was grown just
> before the crash leaks to a different user).  XFS overwrites that data
> with zeros upon reboot, which tends to irritate users when it happens.

XFS has never overwritten data on reboot. It leaves holes when the kernel has
failed to write out data. A hole == zeros so XFS does not expose stale data in
this situation. As it is, the underlying XFS problem (lack of synchronisation
between inode size update and data writes has been mostly fixed in 2.6.22 by
only updating the file size to be written to disk on data I/O completion.

FWIW, fsync() would prevent this from happening, but many application writers
seem strangely reluctant to put fsync() calls into code to ensure the data
they write is safely on disk.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [patch] implement smarter atime updates support, v2
  2007-08-05 20:42                                 ` Theodore Tso
@ 2007-08-06  5:36                                   ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06  5:36 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Jakob Oestergaard, Jeff Garzik,
	miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, Peter Zijlstra,
	linux-mm, Linux Kernel Mailing List, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Theodore Tso <tytso@mit.edu> wrote:

> On Sun, Aug 05, 2007 at 09:28:38PM +0200, Ingo Molnar wrote:
> > 
> > added the relatime_interval sysctl that allows the changing of the 
> > atime update frequency. (default: 1 day / 86400 seconds)
> 
> What if you specify the interval as a per-mount option?  i.e.,
> 
> 	mount -o relatime=86400 /dev/sda2 /u1
> 
> If you had this, I don't think we would need the sysctl tuning 
> parameter.

it's much more flexible if there are _more_ options available. People 
can thus make use of the feature earlier, use it even on distros that 
dont support it yet, etc.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:44                               ` Dave Jones
  2007-08-05 18:58                                 ` adi
@ 2007-08-06  6:39                                 ` Ingo Molnar
  2007-08-06 15:59                                   ` Dave Jones
  1 sibling, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06  6:39 UTC (permalink / raw)
  To: Dave Jones, Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Dave Jones <davej@redhat.com> wrote:

>  > btw., Mutt does not go boom, i use it myself. It works just fine 
>  > and notices new mails even on a noatime,nodiratime filesystem.
>  
> It still fails miserably for me.
> 
> If I hit 'C' and '?' I get a list of my mail folders, with some of 
> them marked 'N' if they have new mail.  Without atime, those N's never 
> show up and every mbox looks like it has no new mail.

does it work with the "atime on steroids" patch below? (no need to 
configure anything, just apply the patch and go.)

	Ingo

----------------------->
Subject: [patch] [patch] implement smarter atime updates support
From: Ingo Molnar <mingo@elte.hu>

change relatime updates to be performed once per day. This makes
relatime a compatible solution for HSM, mailer-notification and
tmpwatch applications too.

also add the CONFIG_DEFAULT_RELATIME kernel option, which makes
"norelatime" the default for all mounts without an extra kernel
boot option.

add the "default_relatime=0" boot option to turn this off.

also add the /proc/sys/kernel/default_relatime flag which can be changed
runtime to modify the behavior of subsequent new mounts.

tested by moving the date forward:

   # date
   Sun Aug  5 22:55:14 CEST 2007
   # date -s "Tue Aug  7 22:55:14 CEST 2007"
   Tue Aug  7 22:55:14 CEST 2007

access to a file did not generate disk IO before the date was set, and
it generated exactly one IO after the date was set.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |    8 +++++
 fs/Kconfig                          |   22 ++++++++++++++
 fs/inode.c                          |   53 +++++++++++++++++++++++++++---------
 fs/namespace.c                      |   24 ++++++++++++++++
 include/linux/mount.h               |    3 ++
 kernel/sysctl.c                     |   17 +++++++++++
 6 files changed, 114 insertions(+), 13 deletions(-)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -525,6 +525,10 @@ and is between 256 and 4096 characters. 
 			This is a 16-member array composed of values
 			ranging from 0-255.
 
+	default_relatime=
+			[FS] mount all filesystems with relative atime
+			updates by default.
+
 	default_utf8=   [VT]
 			Format=<0|1>
 			Set system-wide default UTF-8 mode for all tty's.
@@ -1468,6 +1472,10 @@ and is between 256 and 4096 characters. 
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c			
 
+	relatime_interval=
+			[FS] relative atime update frequency, in seconds.
+			(default: 1 day: 86400 seconds)
+
 	reserve=	[KNL,BUGS] Force the kernel to ignore some iomem area
 
 	reservetop=	[X86-32]
Index: linux/fs/Kconfig
===================================================================
--- linux.orig/fs/Kconfig
+++ linux/fs/Kconfig
@@ -2060,6 +2060,28 @@ config 9P_FS
 
 endmenu
 
+config DEFAULT_RELATIME
+	bool "Mount all filesystems with relatime by default"
+	default y
+	help
+	  If you say Y here, all your filesystems will be mounted
+	  with the "relatime" mount option. This eliminates many atime
+	  ('file last accessed' timestamp) updates (which otherwise
+	  is performed on every file access and generates a write
+	  IO to the inode) and thus speeds up IO. Atime is still updated,
+	  but only once per day.
+
+	  The mtime ('file last modified') and ctime ('file created')
+	  timestamp are unaffected by this change.
+
+	  Use the "norelatime" kernel boot option to turn off this
+	  feature.
+
+config DEFAULT_RELATIME_VAL
+	int
+	default "1" if DEFAULT_RELATIME
+	default "0"
+
 if BLOCK
 menu "Partition Types"
 
Index: linux/fs/inode.c
===================================================================
--- linux.orig/fs/inode.c
+++ linux/fs/inode.c
@@ -1162,6 +1162,41 @@ sector_t bmap(struct inode * inode, sect
 }
 EXPORT_SYMBOL(bmap);
 
+/*
+ * Relative atime updates frequency (default: 1 day):
+ */
+int relatime_interval __read_mostly = 24*60*60;
+
+/*
+ * With relative atime, only update atime if the
+ * previous atime is earlier than either the ctime or
+ * mtime.
+ */
+static int relatime_need_update(struct inode *inode, struct timespec now)
+{
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= relatime_interval)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
@@ -1191,22 +1226,14 @@ void touch_atime(struct vfsmount *mnt, s
 			return;
 		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 			return;
-
-		if (mnt->mnt_flags & MNT_RELATIME) {
-			/*
-			 * With relative atime, only update atime if the
-			 * previous atime is earlier than either the ctime or
-			 * mtime.
-			 */
-			if (timespec_compare(&inode->i_mtime,
-						&inode->i_atime) < 0 &&
-			    timespec_compare(&inode->i_ctime,
-						&inode->i_atime) < 0)
+	}
+	now = current_fs_time(inode->i_sb);
+	if (mnt) {
+		if (mnt->mnt_flags & MNT_RELATIME)
+			if (!relatime_need_update(inode, now))
 				return;
-		}
 	}
 
-	now = current_fs_time(inode->i_sb);
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c
+++ linux/fs/namespace.c
@@ -1107,6 +1107,7 @@ int do_add_mount(struct vfsmount *newmnt
 		goto unlock;
 
 	newmnt->mnt_flags = mnt_flags;
+
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
 
@@ -1362,6 +1363,24 @@ int copy_mount_options(const void __user
 }
 
 /*
+ * Allow users to disable (or enable) atime updates via a .config
+ * option or via the boot line, or via /proc/sys/fs/default_relatime:
+ */
+int default_relatime __read_mostly = CONFIG_DEFAULT_RELATIME_VAL;
+
+static int __init set_default_relatime(char *str)
+{
+	get_option(&str, &default_relatime);
+
+	printk(KERN_INFO "Mount all filesystems with"
+		"default relative atime updates: %s.\n",
+		default_relatime ? "enabled" : "disabled");
+
+	return 1;
+}
+__setup("default_relatime=", set_default_relatime);
+
+/*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
  *
@@ -1409,6 +1428,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	else if (default_relatime &&
+				!(flags & (MNT_NOATIME | MNT_NODIRATIME))) {
+		mnt_flags |= MNT_RELATIME;
+		flags |= MS_RELATIME;
+	}
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h
+++ linux/include/linux/mount.h
@@ -103,5 +103,8 @@ extern void shrink_submounts(struct vfsm
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
+extern int default_relatime;
+extern int relatime_interval;
+
 #endif
 #endif /* _LINUX_MOUNT_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -30,6 +30,7 @@
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -1206,6 +1207,22 @@ static ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "default_relatime",
+		.data		= &default_relatime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "relatime_interval",
+		.data		= &relatime_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 19:03                               ` david
@ 2007-08-06  6:52                                 ` Ingo Molnar
  2007-08-10  4:04                                 ` Bill Davidsen
  1 sibling, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06  6:52 UTC (permalink / raw)
  To: david
  Cc: Diego Calleja, Alan Cox, J??rn Engel, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard


* david@lang.hm <david@lang.hm> wrote:

> i've been a linux sysadmin for 10 years, and have known about noatime 
> for at least 7 years, but I always thought of it in the catagory of 
> 'use it only on your performance critical machines where you are 
> trying to extract every ounce of performance, and keep an eye out for 
> things misbehaving'
> 
> I never imagined that itwas the 20%+ hit that is being described, and 
> with so little impact, or I would have switched to it across the board 
> years ago.
> 
> I'll bet there are a lot of admins out there in the same boat.
> 
> adding an option in the kernel to change the default sounds like a 
> very good first step, even if the default isn't changed today.

yep - but note that this was a gradual effect along the years, today the 
assymetry between CPU performance and disk-seek performance is 
proportionally larger than 10 years ago. Today CPUs are nearly 100 times 
faster than 10 years ago, but disk seeks got only 2-3 times faster. (and 
even that only if you have a high rpm disk - most desktops dont.)

10 years ago noatime was a nifty hack that made a difference if you had 
lots of files. But it still was a problem with no immediate easy 
solution and people developed their counter-arguments. Today the same 
counter-arguments are used, but the situation has evolved alot.

and note that often this has a bigger everyday effect than the tweaking 
of CPU scheduling, IO scheduling or swapping behavior (!). My desktop 
systems rarely swap, have plenty of CPU power to spare, but atime 
updates still have a noticeable latency impact, regardless of the memory 
pressure. Linux has _lots_ of "performance reserves", so people dont 
normally notice when comparing it to other OSs, but still we should not 
be so wasteful with our IO performance, for such a fundamental thing as 
reading files.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 13:29                                     ` Willy Tarreau
@ 2007-08-06  6:57                                       ` Ingo Molnar
  2007-08-06 13:12                                         ` Willy Tarreau
  0 siblings, 1 reply; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06  6:57 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Alan Cox, Claudio Martins, Jeff Garzik, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Willy Tarreau <w@1wt.eu> wrote:

> In your example above, maybe it's the opposite, users know they can 
> keep a file in /tmp one more week by simply cat'ing it.

sure - and i'm not arguing that noatime should the kernel-wide default. 
In every single patch i sent it was a .config option (and a boot option 
_and_ a sysctl option that i think you missed) that a user/distro 
enables or disabled. But i think the /tmp argument is not very strong: 
/tmp is fundamentally volatile, and you can grow dependencies on pretty 
much _any_ aspect of the kernel. So the question isnt "is there impact" 
(there is, at least for noatime), the question is "is it still worth 
doing it".

> Changing the kernel in a non-easily reversible way is not kind to the 
> users.

none of my patches did any of that...

anyway, my latest patch doesnt do noatime, it does the "more intelligent 
relatime" approach.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 13:22                             ` Diego Calleja
  2007-08-05 19:03                               ` david
@ 2007-08-06  6:58                               ` Ingo Molnar
  1 sibling, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06  6:58 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Diego Calleja <diegocg@gmail.com> wrote:

> > Measurements show that noatime helps 20-30% on regular desktop 
> > workloads, easily 50% for kernel builds and much more than that (in 
> > excess of 100%) for file-read-intense workloads. We cannot just walk
> 
> And as everybody knows in servers is a popular practice to disable it. 
> According to an interview to the kernel.org admins....

yeah - but i'd be surprised if more than 1% of all Linux servers out 
there had noatime.

> "Beyond that, Peter noted, "very little fancy is going on, and that is 
> good because fancy is hard to maintain." He explained that the only 
> fancy thing being done is that all filesystems are mounted noatime 
> meaning that the system doesn't have to make writes to the filesystem 
> for files which are simply being read, "that cut the load average in 
> half."

nice quote :-)

> I bet that some people would consider such performance hit a bug...

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 22:17       ` Andi Kleen
@ 2007-08-06  8:40         ` Brice Figureau
  2007-08-14  1:44           ` Stewart Smith
  0 siblings, 1 reply; 188+ messages in thread
From: Brice Figureau @ 2007-08-06  8:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Hi Andi,

On Mon, 2007-08-06 at 00:17 +0200, Andi Kleen wrote:
> Brice Figureau <brice+lklm@daysofwonder.com> writes:
> > 
> >  2) I _still_ don't get the "performances" of 2.6.17, but since that's the
> > better combination I could get, I think there is IMHO progress in the right
> > direction (to be compared to no progress since 2.6.18, that's better :-)).
> 
> If you could characterize your workload well (e.g. how many disks,
> what file systems, what load on mysql) perhaps it would be possible
> to reproduce the problem with a test program or a mysql driver.
> Then it could be bisected.

My server is a Dell Poweredge 2850 (bi-Xeon EM64T 3GHz running without
HT, 4GB of RAM), with a Perc 4/Di (a LSI megaraid with a BBU of 256MB). 
The hardware RAID card has 2 channels, one is connected to 2 10k RPM
146GB SCSI disk that are mirrored in a RAID 1 array on which the system
resides (/dev/sda). The second channel is connected to 4 10k RPM 146GB
disks, on a RAID 10 array which contains the database files and database
logs (/dev/sdb).

The kernel and userspace are 64bits.
Above the hardware RAID arrays there is LVM2 with two physical groups
(one per array). The RAID10 has only one logical volume.

The database volume (the RAID10) is an ext3 volume mounted with
rw,noexec,nosuid,nodev,noatime,data=writeback.

The I/O scheduler on all arrays is deadline.

/proc knobs with values other than defaults are:
/proc/sys/vm/swappiness = 2
/proc/sys/vm/dirty_background_ratio = 1
/proc/sys/vm/dirty_ratio = 2
/proc/sys/vm/vfs_cache_pressure = 1

The only thing running on the server is mysql. 
Mysql memory footprint is about 90% of physical RAM. Mysql is configured
to use exclusively InnoDB.

Mysql accesses its database files in O_DIRECT mode.
Since the database fits in RAM, the only kind of access Mysql is doing
is writing to the innodb log, the mysql binlog and finally to the innodb
database files.
There are certainly a whole lot of fsync'ing happening.
All the database reads are done from the innodb in-RAM cache.

During all my kernel tests (see the original bug report) the machine was
not swapping (so that's not the reason of the stuttering).

If that helps:
db1:~# cat /proc/meminfo 
MemTotal:      4052420 kB
MemFree:         23972 kB
Buffers:         54420 kB
Cached:         168096 kB
SwapCached:    1541744 kB
Active:        3723468 kB
Inactive:       157180 kB
SwapTotal:    11863960 kB
SwapFree:     10193064 kB
Dirty:             320 kB
Writeback:           0 kB
AnonPages:     3657744 kB
Mapped:          20508 kB
Slab:           119964 kB
SReclaimable:   103564 kB
SUnreclaim:      16400 kB
PageTables:       9408 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  13890168 kB
Committed_AS:  3826764 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    268604 kB
VmallocChunk: 34359469435 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

An typical iostat (taken every 2s under light load):
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     2.00    0.00    3.50     0.00    44.00    12.57     0.00    0.00   0.00   0.00
sdb               0.00     9.00    0.50   27.00     4.00   288.00    10.62     0.01    0.36   0.36   1.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   223.50    7.50  185.50    60.00  5964.00    31.21     0.15    0.78   0.56  10.80

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     1.00    0.00    1.00     0.00    15.92    16.00     0.00    0.00   0.00   0.00
sdb               0.00   198.01   19.90  156.22   159.20  2833.83    16.99     0.04    0.24   0.20   3.58

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     5.00    0.50   17.00     4.00   176.00    10.29     0.01    0.69   0.69   1.20

Would it help if I try blktrace on this server to capture the I/O ?
I enabled it while compiling the kernel, but I don't know yet how to use
it:
any pointer on how to activate it and capture useful information?

Many thanks,
-- 
Brice Figureau <brice+lklm@daysofwonder.com>


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 20:41               ` Christoph Hellwig
@ 2007-08-06 10:42                 ` Andi Kleen
  0 siblings, 0 replies; 188+ messages in thread
From: Andi Kleen @ 2007-08-06 10:42 UTC (permalink / raw)
  To: Christoph Hellwig, Andi Kleen, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard

On Sun, Aug 05, 2007 at 09:41:12PM +0100, Christoph Hellwig wrote:
> On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
> > I always thought the right solution would be to just sync atime only
> > very very lazily. This means if a inode is only dirty because of an
> > atime update put it on a "only write out when there is nothing to do
> > or the memory is really needed" list.
> 
> Which is the policy I implemented for XFS a while ago.

How would that work? I didn't think XFS had separate inode lists.

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06  6:57                                       ` Ingo Molnar
@ 2007-08-06 13:12                                         ` Willy Tarreau
  0 siblings, 0 replies; 188+ messages in thread
From: Willy Tarreau @ 2007-08-06 13:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Claudio Martins, Jeff Garzik, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Mon, Aug 06, 2007 at 08:57:12AM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > In your example above, maybe it's the opposite, users know they can 
> > keep a file in /tmp one more week by simply cat'ing it.
> 
> sure - and i'm not arguing that noatime should the kernel-wide default. 
> In every single patch i sent it was a .config option (and a boot option 
> _and_ a sysctl option that i think you missed) that a user/distro 
> enables or disabled. But i think the /tmp argument is not very strong: 
> /tmp is fundamentally volatile, and you can grow dependencies on pretty 
> much _any_ aspect of the kernel. So the question isnt "is there impact" 
> (there is, at least for noatime), the question is "is it still worth 
> doing it".
> 
> > Changing the kernel in a non-easily reversible way is not kind to the 
> > users.
> 
> none of my patches did any of that...

I did not notice you talked about a sysctl. A sysctl provides the ability
to switch the behaviour without rebooting, while both the config option
and the command line require a reboot.

> anyway, my latest patch doesnt do noatime, it does the "more intelligent 
> relatime" approach.

... which is not equivalent noatime in the initial example.

Regards,
Willy


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 15:00               ` Theodore Tso
@ 2007-08-06 13:47                 ` Chris Mason
  2007-08-17  0:45                 ` Dave Jones
  1 sibling, 0 replies; 188+ messages in thread
From: Chris Mason @ 2007-08-06 13:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andi Kleen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard

On Sun, 5 Aug 2007 11:00:29 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
> > I always thought the right solution would be to just sync atime only
> > very very lazily. This means if a inode is only dirty because of an
> > atime update put it on a "only write out when there is nothing to do
> > or the memory is really needed" list.
> 
> As I've mentioend earlier, the memory balancing issues that arise when
> we add an "atime dirty" bit scare me a little.  It can be addressed,
> obviously, but at the cost of more code complexity.

ext3 and reiser both use a dirty_inode method to make sure that we
don't actually have dirty inodes.  This way, kswapd doesn't get stuck
on the log and is able to do real work.

It would be interesting to see a comparison of relatime with a kinoded
that is willing to get stuck on the log.  The FS would need a few
tweaks so that write_inode() could know if it really needed to log or
not, but for testing you could just drop ext3_dirty_inode and have
ext3_write_inode do real work.

Then just change kswapd to kick a new kinoded and benchmark away.  A
real patch would have to look for places where mark_inode_dirty was
used and expected the dirty_inode callback to log things right away,
but for testing its good enough.

-chris

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06  6:39                                 ` Ingo Molnar
@ 2007-08-06 15:59                                   ` Dave Jones
  2007-08-06 16:16                                     ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Dave Jones @ 2007-08-06 15:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Mon, Aug 06, 2007 at 08:39:09AM +0200, Ingo Molnar wrote:
 > 
 > * Dave Jones <davej@redhat.com> wrote:
 > 
 > >  > btw., Mutt does not go boom, i use it myself. It works just fine 
 > >  > and notices new mails even on a noatime,nodiratime filesystem.
 > >  
 > > It still fails miserably for me.
 > > 
 > > If I hit 'C' and '?' I get a list of my mail folders, with some of 
 > > them marked 'N' if they have new mail.  Without atime, those N's never 
 > > show up and every mbox looks like it has no new mail.
 > 
 > does it work with the "atime on steroids" patch below? (no need to 
 > configure anything, just apply the patch and go.)

people have reported that relatime does work, but my util-linux
isn't new enough to support it, so I've never got it to work.
I'll give your diff a try later, though as it seems to be
equivalent I expect it'll work.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06 15:59                                   ` Dave Jones
@ 2007-08-06 16:16                                     ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-06 16:16 UTC (permalink / raw)
  To: Dave Jones, Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david


* Dave Jones <davej@redhat.com> wrote:

>  > does it work with the "atime on steroids" patch below? (no need to 
>  > configure anything, just apply the patch and go.)
> 
> people have reported that relatime does work, but my util-linux isn't 
> new enough to support it, so I've never got it to work. I'll give your 
> diff a try later, though as it seems to be equivalent I expect it'll 
> work.

would still be nice if you could test it and report back :)

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 20:36                         ` Christoph Hellwig
@ 2007-08-06 18:03                           ` Chuck Ebbert
  2007-08-06 18:53                             ` Jeff Garzik
  2007-08-06 19:37                             ` Alan Cox
  2007-08-08 21:10                           ` Martin J. Bligh
  2007-08-14  9:57                           ` Helge Hafting
  2 siblings, 2 replies; 188+ messages in thread
From: Chuck Ebbert @ 2007-08-06 18:03 UTC (permalink / raw)
  To: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On 08/05/2007 04:36 PM, Christoph Hellwig wrote:
> 
> Umm, no f**king way.  atime selection is 100% policy and belongs into
> userspace.  Add to that the problem that we can't actually re-enable
> atimes because of the way the vfs-level mount flags API is designed.
> Instead of doing such a fugly kernel patch just talk to the handfull
> of distributions that matter to update their defaults.
> 

We already tried that here. The response: "If noatime is so great, why
isn't it the default in the kernel?"

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06 18:03                           ` Chuck Ebbert
@ 2007-08-06 18:53                             ` Jeff Garzik
  2007-08-06 19:37                             ` Alan Cox
  1 sibling, 0 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-06 18:53 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Christoph Hellwig, J??rn Engel, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

Chuck Ebbert wrote:
> On 08/05/2007 04:36 PM, Christoph Hellwig wrote:
>> Umm, no f**king way.  atime selection is 100% policy and belongs into
>> userspace.  Add to that the problem that we can't actually re-enable
>> atimes because of the way the vfs-level mount flags API is designed.
>> Instead of doing such a fugly kernel patch just talk to the handfull
>> of distributions that matter to update their defaults.

> We already tried that here. The response: "If noatime is so great, why
> isn't it the default in the kernel?"

Yes, and around and around we go :/

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06 18:03                           ` Chuck Ebbert
  2007-08-06 18:53                             ` Jeff Garzik
@ 2007-08-06 19:37                             ` Alan Cox
  2007-08-06 19:46                               ` Chuck Ebbert
  1 sibling, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-06 19:37 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

> We already tried that here. The response: "If noatime is so great, why
> isn't it the default in the kernel?"

Ok so we have a pile of people @redhat.com sitting on linux-kernel
complaining about Red Hat distributions not taking it up. Guys - can
we just fix it internally please like sensible folk ?

Ingo's latest 'not quite noatime' seems to cure mutt/tmpwatch so it might
finally make sense to do so.

Alan

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06 19:37                             ` Alan Cox
@ 2007-08-06 19:46                               ` Chuck Ebbert
  2007-08-07  7:05                                 ` Ingo Molnar
  0 siblings, 1 reply; 188+ messages in thread
From: Chuck Ebbert @ 2007-08-06 19:46 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On 08/06/2007 03:37 PM, Alan Cox wrote:
>> We already tried that here. The response: "If noatime is so great, why
>> isn't it the default in the kernel?"
> 
> Ok so we have a pile of people @redhat.com sitting on linux-kernel
> complaining about Red Hat distributions not taking it up. Guys - can
> we just fix it internally please like sensible folk ?
> 
> Ingo's latest 'not quite noatime' seems to cure mutt/tmpwatch so it might
> finally make sense to do so.

Do we report max(ctime, mtime) as the atime by default when noatime
is set or do we still need that to be done?


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2007-08-03 22:21 ` [PATCH 00/23] per device dirty throttling -v8 Linus Torvalds
@ 2007-08-06 20:26 ` Miklos Szeredi
  2007-08-08 12:25 ` richard kennedy
  25 siblings, 0 replies; 188+ messages in thread
From: Miklos Szeredi @ 2007-08-06 20:26 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, a.p.zijlstra, nikita, trond.myklebust,
	yingchao.zhou, richard, torvalds

> Per device dirty throttling patches

Andrew, may I inquire about your plans with this?

> These patches aim to improve balance_dirty_pages() and directly address three
> issues:
>   1) inter device starvation
>   2) stacked device deadlocks

This one interests me most, due to various real life, reported
problems with fuse filesystems.  For this reason I'd really like to
get this or a subset of it into mainline as soon as possible.

This patchset (or rather the -v7 version) has been running on my
laptop for a couple of weeks without problems.  I've also verified
that it solves the fuse and loop issues.

I have some qualms about the complexity of various parts though.
Especially the "proportions" library, which I'm having problems
understanding.  I'm not sure that this level of sophistication is
really needed to solve the issues with the old code.

Miklos

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06 19:46                               ` Chuck Ebbert
@ 2007-08-07  7:05                                 ` Ingo Molnar
  0 siblings, 0 replies; 188+ messages in thread
From: Ingo Molnar @ 2007-08-07  7:05 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Alan Cox, Christoph Hellwig, J??rn Engel, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david


* Chuck Ebbert <cebbert@redhat.com> wrote:

> > Ingo's latest 'not quite noatime' seems to cure mutt/tmpwatch so it 
> > might finally make sense to do so.
> 
> Do we report max(ctime, mtime) as the atime by default when noatime is 
> set or do we still need that to be done?

noatime is unchanged by my patch (it is not the same as the 'improved 
relatime' mode my patch activates), but it would make sense to do your 
change, independently.

	Ingo

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 20:28                       ` Jeff Garzik
  2007-08-04 21:47                         ` Alan Cox
@ 2007-08-07 18:55                         ` Bill Davidsen
  2007-08-07 19:35                           ` Alan Cox
  1 sibling, 1 reply; 188+ messages in thread
From: Bill Davidsen @ 2007-08-07 18:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Jeff Garzik wrote:
> Alan Cox wrote:
>> In some setups it will and in others it won't. Nor is it the only
>> application that has this requirement. Ext3 currently is a standards
>> compliant file system. Turn off atime and its very non standards
>> compliant, turn to relatime and its not standards compliant but nobody
>> will break (which is good)
> 
> Linux has always been a "POSIX unless its stupid" type of system.  For 
> the upstream kernel, we should do the right thing -- noatime by default 
> -- but allow distros and people that care about rigid compliance to 
> easily change the default.
> 
However, relatime has the POSIX behavior without the overhead. Therefore 
that (and maybe reldiratime?) are a far better choice. I don't see a big 
problem with some version of utils not supporting it, since it can be in 
the kernel and will be in the utils soon enough. We have lived without 
it this long, sounds as if we could live a bit longer.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 21:51                           ` Alan Cox
  2007-08-05  7:21                             ` Ingo Molnar
  2007-08-05  7:37                             ` Ingo Molnar
@ 2007-08-07 19:09                             ` Bill Davidsen
  2 siblings, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-07 19:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

Alan Cox wrote:
>> i cannot over-emphasise how much of a deal it is in practice. Atime 
>> updates are by far the biggest IO performance deficiency that Linux has 
>> today. Getting rid of atime updates would give us more everyday Linux 
>> performance than all the pagecache speedups of the past 10 years, 
>> _combined_.
>>
>> it's also perhaps the most stupid Unix design idea of all times. Unix is 
>> really nice and well done, but think about this a bit:
> 
> Think about the user for a moment instead. 
> 
> Do things right. The job of the kernel is not to "correct" for
> distribution policy decisions. The distributions need to change policy.
> You do that by showing the distributions the numbers. 
> 
> With a Red Hat on if we can move from /dev/hda to /dev/sda in FC7 then we
> can move from atime to noatime by default on FC8 with appropriate release
> note warnings and having a couple of betas to find out what other than
> mutt goes boom.

Is there really enough benefit between relatime and noatime to justify 
that? If atime doesn't get updated at all it *will* impact operations, 
and unless there's a real performance gain the path which provides at 
least nominal POSIX compliance seems best.

Plauger's law of least astonishment.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-07 18:55                         ` Bill Davidsen
@ 2007-08-07 19:35                           ` Alan Cox
  2007-08-08 17:44                             ` Bill Davidsen
  0 siblings, 1 reply; 188+ messages in thread
From: Alan Cox @ 2007-08-07 19:35 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel, linux-mm

> However, relatime has the POSIX behavior without the overhead. Therefore 

No. relatime has approximately SuS behaviour. Its not the same as
"correct" behaviour.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 23:51                           ` Claudio Martins
  2007-08-05  0:49                             ` Alan Cox
@ 2007-08-07 21:20                             ` Bill Davidsen
  1 sibling, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-07 21:20 UTC (permalink / raw)
  To: Claudio Martins
  Cc: Alan Cox, Jeff Garzik, Ingo Molnar, Jörn Engel,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Claudio Martins wrote:
> On Saturday 04 August 2007, Alan Cox wrote:
>> Linux has never been a "suprise your kernel interfaces all just changed
>> today" kernel, nor a "gosh you upgraded and didn't notice your backups
>> broke" kernel.
>>
> 
>  Can you give examples of backup solutions that rely on atime being updated?
> I can understand backup tools using mtime/ctime for incremental backups (like 
> tar + Amanda, etc), but I'm having trouble figuring out why someone would 
> want to use atime for that.
> 
Programs which migrate unused files or delete them are the usual cases.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 17:17               ` Ingo Molnar
  2007-08-04 17:38                 ` Diego Calleja
@ 2007-08-08 10:43                 ` Karel Zak
  1 sibling, 0 replies; 188+ messages in thread
From: Karel Zak @ 2007-08-08 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Diego Calleja, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

On Sat, Aug 04, 2007 at 07:17:24PM +0200, Ingo Molnar wrote:
> 
> * Diego Calleja <diegocg@gmail.com> wrote:
> 
> > El Sat, 4 Aug 2007 18:37:33 +0200, Ingo Molnar <mingo@elte.hu> escribió:
> > 
> > > thousands of applications. So for most file workloads we give 
> > > Windows a 20%-30% performance edge, for almost nothing. (for 
> > > RAM-starved kernel builds the performance difference between atime 
> > > and noatime+nodiratime setups is more on the order of 40%)
> > 
> > Just curious - do you have numbers with relatime?
> 
> nope. Stupid question, i just tried it and got this:
> 
>  EXT3-fs: Unrecognized mount option "relatime" or missing value
> 
> i've got util-linux-2.13-0.46.fc6 and 2.6.22 on that box, shouldnt that 

 The relatime patch has been applied to util-lilnux-ng-2.13 (now -rc3),
 you will see it in Fedora 8 (and probably in the others distros).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2007-08-06 20:26 ` Miklos Szeredi
@ 2007-08-08 12:25 ` richard kennedy
  2007-08-08 13:54   ` Andi Kleen
  25 siblings, 1 reply; 188+ messages in thread
From: richard kennedy @ 2007-08-08 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	torvalds

On Fri, 2007-08-03 at 14:37 +0200, Peter Zijlstra wrote:
> Per device dirty throttling patches
> 
> These patches aim to improve balance_dirty_pages() and directly address three
> issues:
>   1) inter device starvation
>   2) stacked device deadlocks
>   3) inter process starvation
<snip>
Hi Peter,
I've been testing your patch with a simple test case that copies a 3GB
file from sda -> sda, and copies a 1GB file from sda -> sdb.
the script is roughly this :-

dd bs=64k if=[sda]/data3g of=[sda]/temp_data3g &
sleep 60
dd bs=64k if=[sda]/data1g of=[sdb]/temp_data1g &
wait
sleep 200

On my amd64x2 desktop machine where sda is a sata 250 GB drive & sdb is
an ide 300 GB drive.

Running this test 5 times gives
2.6.23-rc1-mm2
1GB copy MB/s	3GB copy MB/s
16.2		16.1
15.2		14.6
17.3		14.6
18.0		14.5
19.0		14.6

2.6.23-rc1-mm2+pddt_patch
1GB copy MB/s	3GB copy MB/s
23.0		14.7
24.0		14.6
20.4		14.8
22.6		14.5
23.2		14.5

This is on a standard desktop machine so there are lots of other
processes running on it, and although there is a degree of variability
in the numbers,they are very repeatable and your patch always out
performs the stock mm2.
looks good to me

Richard
  







^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 12:25 ` richard kennedy
@ 2007-08-08 13:54   ` Andi Kleen
  2007-08-10  4:17     ` Bill Davidsen
  0 siblings, 1 reply; 188+ messages in thread
From: Andi Kleen @ 2007-08-08 13:54 UTC (permalink / raw)
  To: richard kennedy
  Cc: Peter Zijlstra, linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	torvalds

richard kennedy <richard@rsk.demon.co.uk> writes:
> 
> This is on a standard desktop machine so there are lots of other
> processes running on it, and although there is a degree of variability
> in the numbers,they are very repeatable and your patch always out
> performs the stock mm2.
> looks good to me

iirc the goal of this is less to get better performance, but to avoid long user visible
latencies.  Of course if it's faster it's great too, but that's only secondary.

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-07 19:35                           ` Alan Cox
@ 2007-08-08 17:44                             ` Bill Davidsen
  0 siblings, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-08 17:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, linux-mm

Alan Cox wrote:
>> However, relatime has the POSIX behavior without the overhead. Therefore 
> 
> No. relatime has approximately SuS behaviour. Its not the same as
> "correct" behaviour.
> 
Actually correct, but in terms of what can or does break, relatime seems 
a lot closer than noatime, I can't (personally) come up with any 
scenario where real applications would see something which would change 
behavior adversely.

Making noatime a default in the kernel requiring a boot option to 
restore current behavior seems to be a turn toward the "it doesn't 
really work right but it's *fast*" model. If vendors wanted noatime they 
are smart enough to enable it. Now with relatime giving most of the 
benefits and few (of any) of the side effects, I would expect a change.

By all means relatime by default in FC8, but not noatime, and let those 
who find some measurable benefit from noatime use it.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 18:08                                     ` Ingo Molnar
  2007-08-05 19:11                                       ` Alan Cox
@ 2007-08-08 18:22                                       ` Bill Davidsen
  2007-08-08 19:39                                         ` Jeff Garzik
  1 sibling, 1 reply; 188+ messages in thread
From: Bill Davidsen @ 2007-08-08 18:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, J??rn Engel, Jeff Garzik, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

Ingo Molnar wrote:

> || ...For me, I would say 50% is not enough to describe the _visible_ 
> || benefits... Not talking any specific number but past 10sec-1min+ 
> || lagging in X is history, it's gone and I really don't miss it that 
> || much... :-) Cannot reproduce even a second long delay anymore in 
> || window focusing under considerable load as it's basically 
> || instantaneous (I can see that it's loaded but doesn't affect the 
> || feeling of responsiveness I'm now getting), even on some loads that I 
> || couldn't previously even dream of... [...]
> 
> we really have to ask ourselves whether the "process" is correct if 
> advantages to the user of this order of magnitude can be brushed aside 
> with simple "this breaks binary-only HSM" and "it's not standards 
> compliant" arguments.
> 
Being standards compliant is not an argument it's a design goal, a 
requirement. Standards compliance is like pregant, you are or you're 
not. And to deliberately ignore standards for speed is saying "it's too 
hard to do it right, I'll do it wrong and it will be faster." The answer 
is to do it smarter, with solutions like relatime (which can be enhanced 
as Linus noted) which provide performance benefits without ignoring 
standards, or use of a filesystem which does a better job. But when it 
goes in the kernel the choice of having per-filesystem behavior either 
vanishes or becomes an exercise in complex and as-yet unwritten mount 
options.

There are certainly ways to improve ext3, not journaling atime updates 
would certainly be one, less frequent updates of dirty inodes, whatever. 
But if a user wants to give up standards compliance it should be a 
deliberate choice, not something which the average user will not 
understand or learn to do.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 18:22                                       ` Bill Davidsen
@ 2007-08-08 19:39                                         ` Jeff Garzik
  2007-08-08 20:31                                           ` Bill Davidsen
  2007-08-08 23:18                                           ` Alan Cox
  0 siblings, 2 replies; 188+ messages in thread
From: Jeff Garzik @ 2007-08-08 19:39 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Ingo Molnar, Alan Cox, J??rn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

Bill Davidsen wrote:
> Being standards compliant is not an argument it's a design goal, a 
> requirement. Standards compliance is like pregant, you are or you're 

Linux history says different.  There was always the "final 1%" of 
compliance that required silliness we really did not want to bother with.

	Jeff



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 19:39                                         ` Jeff Garzik
@ 2007-08-08 20:31                                           ` Bill Davidsen
  2007-08-08 23:18                                           ` Alan Cox
  1 sibling, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-08 20:31 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ingo Molnar, Alan Cox, J??rn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

Jeff Garzik wrote:
> Bill Davidsen wrote:
>> Being standards compliant is not an argument it's a design goal, a 
>> requirement. Standards compliance is like pregant, you are or you're 
>
> Linux history says different.  There was always the "final 1%" of 
> compliance that required silliness we really did not want to bother with. 

This is not 1%, this is a user-visible change in behavior, relative to 
all previous Linux versions. There has been a way for ages to trade 
performance for standards for users or distributions, and standards have 
been chosen. Given that there is now a way to get virtually all of the 
performance without giving up atime completely, why the sudden attempt 
to change to a less satisfactory default?

I could understand a push to quickly get relatime with a few 
enhancements (the functionality if not the exact code) into 
distributions, even as a default, but forcing user or distribution 
changes just to retain the same dehavior doesn't seem reasonable. It 
assumes that vendors and users are so stupid they can't understand why 
benchmark results and more important than standards. People who run 
servers are smart enough to decide if their application will run as 
expected without atime.

People have lived with this compromise for a very long time, and it 
seems that a far more balanced solution will be in the kernel soon.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 20:36                         ` Christoph Hellwig
  2007-08-06 18:03                           ` Chuck Ebbert
@ 2007-08-08 21:10                           ` Martin J. Bligh
  2007-08-08 21:21                             ` Andrew Morton
  2007-08-14  9:57                           ` Helge Hafting
  2 siblings, 1 reply; 188+ messages in thread
From: Martin J. Bligh @ 2007-08-08 21:10 UTC (permalink / raw)
  To: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Christoph Hellwig wrote:
> On Sat, Aug 04, 2007 at 09:42:59PM +0200, J??rn Engel wrote:
>   
>> On Sat, 4 August 2007 21:26:15 +0200, J??rn Engel wrote:
>>     
>>> Given the choice between only "atime" and "noatime" I'd agree with you.
>>> Heck, I use it myself.  But "relatime" seems to combine the best of both
>>> worlds.  It currently just suffers from mount not supporting it in any
>>> relevant distro.
>>>       
>> And here is a completely untested patch to enable it by default.  Ingo,
>> can you see how good this fares compared to "atime" and
>> "noatime,nodiratime"?
>>     
>
> Umm, no f**king way.  atime selection is 100% policy and belongs into
> userspace.  Add to that the problem that we can't actually re-enable
> atimes because of the way the vfs-level mount flags API is designed.
> Instead of doing such a fugly kernel patch just talk to the handfull
> of distributions that matter to update their defaults.
>   

 From what I've seen the problem seems to be that the inode
gets marked dirty when we update atime.

Why isn't this easily fixable by just adding an additional dirty
flag that says atime has changed? Then we only cause a write
when we remove the inode from the inode cache, if only atime
is updated.

Unlike relatime, there's no user-visible change (unless the
machine crashes without clean unmount, but not sure anyone
cares that much about that cornercase). Atime changes are
thus kept in-ram until umount / inode reclaim.


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 21:10                           ` Martin J. Bligh
@ 2007-08-08 21:21                             ` Andrew Morton
  2007-08-09  0:54                               ` Martin Bligh
  2007-08-10  0:21                               ` Bill Davidsen
  0 siblings, 2 replies; 188+ messages in thread
From: Andrew Morton @ 2007-08-08 21:21 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Wed, 08 Aug 2007 14:10:15 -0700
"Martin J. Bligh" <mbligh@mbligh.org> wrote:

> Why isn't this easily fixable by just adding an additional dirty
> flag that says atime has changed? Then we only cause a write
> when we remove the inode from the inode cache, if only atime
> is updated.

I think that could be made to work, and it would fix the performance
issue.

It is a behaviour change.  At present ext3 (for example) commits everything
every five seconds.  After a change like this, a crash+recovery could cause
a file's atime to go backwards by an arbitrarily large time interval - it
could easily be months.



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 19:39                                         ` Jeff Garzik
  2007-08-08 20:31                                           ` Bill Davidsen
@ 2007-08-08 23:18                                           ` Alan Cox
  1 sibling, 0 replies; 188+ messages in thread
From: Alan Cox @ 2007-08-08 23:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Bill Davidsen, Ingo Molnar, J??rn Engel, Linus Torvalds,
	Peter Zijlstra, linux-mm, Linux Kernel Mailing List, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, richard, david

On Wed, 08 Aug 2007 15:39:52 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> Bill Davidsen wrote:
> > Being standards compliant is not an argument it's a design goal, a 
> > requirement. Standards compliance is like pregant, you are or you're 
> 
> Linux history says different.  There was always the "final 1%" of 
> compliance that required silliness we really did not want to bother with.

This isn't about the 1% however. Its about API and ABI. Changing the
default is a fairly evil ABI change. Telling everyone relatime is cool on
desktops and defaulting it in the distro is not an ABI change and is very
sensible

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 21:21                             ` Andrew Morton
@ 2007-08-09  0:54                               ` Martin Bligh
  2007-08-11 23:14                                 ` Valerie Henson
  2007-08-10  0:21                               ` Bill Davidsen
  1 sibling, 1 reply; 188+ messages in thread
From: Martin Bligh @ 2007-08-09  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Andrew Morton wrote:
> On Wed, 08 Aug 2007 14:10:15 -0700
> "Martin J. Bligh" <mbligh@mbligh.org> wrote:
> 
>> Why isn't this easily fixable by just adding an additional dirty
>> flag that says atime has changed? Then we only cause a write
>> when we remove the inode from the inode cache, if only atime
>> is updated.
> 
> I think that could be made to work, and it would fix the performance
> issue.
> 
> It is a behaviour change.  At present ext3 (for example) commits everything
> every five seconds.  After a change like this, a crash+recovery could cause
> a file's atime to go backwards by an arbitrarily large time interval - it
> could easily be months.

A second pdflush / workqueue at a slower rate would alleviate that.

Yes, it's a semantic change ... but only in an incredibly small
corner-case ?


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  7:13                           ` Ingo Molnar
  2007-08-05 13:22                             ` Diego Calleja
@ 2007-08-09  0:57                             ` Greg Trounson
  2007-08-09  1:26                               ` david
  2007-08-09  2:33                               ` Andi Kleen
  1 sibling, 2 replies; 188+ messages in thread
From: Greg Trounson @ 2007-08-09  0:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Ingo Molnar wrote:
> * Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
>>>> People just need to know about the performance differences - very 
>>>> few realise its more than a fraction of a percent. I'm sure Gentoo 
>>>> will use relatime the moment anyone knows its > 5% 8)
>>> noatime,nodiratime gave 50% of wall-clock kernel rpm build 
>>> performance improvement for Dave Jones, on a beefy box. Unless i 
>>> misunderstood what you meant under 'fraction of a percent' your 
>>> numbers are _WAY_ off.
>> What numbers - I didn't quote any performance numbers ?
> 
> ok, i misunderstood your "very few realise its more than a fraction of a 
> percent" sentence, i thought you were saying it's a fraction of a 
> percent.
> 
> Measurements show that noatime helps 20-30% on regular desktop 
> workloads, easily 50% for kernel builds and much more than that (in 
> excess of 100%) for file-read-intense workloads. We cannot just walk 
> past such a _huge_ performance impact so easily without even reacting to 
> the performance arguments, and i'm happy Ubuntu picked up 
> noatime,nodiratime and is whipping up the floor with Fedora on the 
> desktop.
> 

Sorry I'm just not seeing those gains here.  With my filesystems mounted with atime 
defaults the Quake sources build in 1m28.856s.  A test with ls -ltu verifies that atime is 
working as expected.  When I remount my filesystems with:
mount [fs] -o remount,noatime,nodiratime
I get a compile time of 1m23.368s, a mere 6% improvement.

This is on a dual-core Athlon 4200+ box running 2.6.21, so I would have thought this to be 
close to a best-case file I/O test.

Greg

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-09  0:57                             ` Greg Trounson
@ 2007-08-09  1:26                               ` david
  2007-08-09  2:33                               ` Andi Kleen
  1 sibling, 0 replies; 188+ messages in thread
From: david @ 2007-08-09  1:26 UTC (permalink / raw)
  To: Greg Trounson; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Thu, 9 Aug 2007, Greg Trounson wrote:

>>  Measurements show that noatime helps 20-30% on regular desktop workloads,
>>  easily 50% for kernel builds and much more than that (in excess of 100%)
>>  for file-read-intense workloads. We cannot just walk past such a _huge_
>>  performance impact so easily without even reacting to the performance
>>  arguments, and i'm happy Ubuntu picked up noatime,nodiratime and is
>>  whipping up the floor with Fedora on the desktop.
>> 
>
> Sorry I'm just not seeing those gains here.  With my filesystems mounted with 
> atime defaults the Quake sources build in 1m28.856s.  A test with ls -ltu 
> verifies that atime is working as expected.  When I remount my filesystems 
> with:
> mount [fs] -o remount,noatime,nodiratime
> I get a compile time of 1m23.368s, a mere 6% improvement.
>
> This is on a dual-core Athlon 4200+ box running 2.6.21, so I would have 
> thought this to be close to a best-case file I/O test.

what sort of disks does this box have? and what filesystem? slower 
disks/filesystems can result in this showing a larger difference.

however 6% is a fairly significant gain.

David Lang

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-09  0:57                             ` Greg Trounson
  2007-08-09  1:26                               ` david
@ 2007-08-09  2:33                               ` Andi Kleen
  1 sibling, 0 replies; 188+ messages in thread
From: Andi Kleen @ 2007-08-09  2:33 UTC (permalink / raw)
  To: Greg Trounson; +Cc: Ingo Molnar, Linux Kernel Mailing List

Greg Trounson <gregt@maths.otago.ac.nz> writes:

> mount [fs] -o remount,noatime,nodiratime

nodiratime is implied in noatime.

> I get a compile time of 1m23.368s, a mere 6% improvement.

6% is nothing to sneeze at. A lot of optimizations would kill for less

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:01         ` Ray Lee
  2007-08-04 17:15           ` david
@ 2007-08-09  5:11           ` david
  1 sibling, 0 replies; 188+ messages in thread
From: david @ 2007-08-09  5:11 UTC (permalink / raw)
  To: Ray Lee
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard, netdev

On Sat, 4 Aug 2007, Ray Lee wrote:

> On 8/4/07, david@lang.hm <david@lang.hm> wrote:
>> On Sat, 4 Aug 2007, Ingo Molnar wrote:
>>
> At least on a surface level, your report has some similarities to
> http://lkml.org/lkml/2007/5/21/84 . In that message, John Miller
> mentions several things he tried without effect:
>
> < - I increased the max allowed receive buffer through
> < proc/sys/net/core/rmem_max and the application calls the right
> < syscall. "netstat -su" does not show any "packet receive errors".

mercury1:/proc/sys/net/core# cat rmem_*
124928
131071
mercury1:/proc/sys/net/core# netstat -su
Udp:
     697853177 packets received
     10025642 packets to unknown port received.
     191726680 packet receive errors
     63194 packets sent
     RcvbufErrors: 191726680
UdpLite:
mercury1:/proc/sys/net/core# echo "512000" >rmem_max

> < - After getting "kernel: swapper: page allocation failure.
> < order:0, mode:0x20", I increased /proc/sys/vm/min_free_kbytes

I have not seen any similar errors

> < - ixgb.txt in kernel network documentation suggests to increase
> < net.core.netdev_max_backlog to 300000. This did not help.

mercury1:/proc/sys/net/core# cat netdev_*
300
1000
mercury1:/proc/sys/net/core# echo "300000" >netdev_max_backlog

> < - I also had to increase net.core.optmem_max, because the default
> < value was too small for 700 multicast groups.

I'm not running multicast.

> As they're all pretty simple to test, it may be worthwhile to give
> them a shot just to rule things out.

unfortunantly the load is not high enough right now to see a real 
difference (it's only doing ~1400 logs/sec) I'll catch it at a higher load 
point to see if these make any difference.

David Lang

> Ray
>

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-04 16:37           ` Ingo Molnar
                               ` (3 preceding siblings ...)
  2007-08-05  0:26             ` Andi Kleen
@ 2007-08-09  6:25             ` Lionel Elie Mamane
  2007-08-09 15:02               ` Chuck Ebbert
  4 siblings, 1 reply; 188+ messages in thread
From: Lionel Elie Mamane @ 2007-08-09  6:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingcha

On Sat, Aug 04, 2007 at 06:37:33PM +0200, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:

>> The fact is, ext3 *sucks* at fsync. I hate hate hate it. It's
>> totally unusable, imnsho.

> yeah, it's really ugly. But otherwise i've got no real complaint
> about ext3 - with the obligatory qualification that
> "noatime,nodiratime" in /etc/fstab is a must. This speeds up things
> very visibly (...). So for most file workloads we give Windows a
> 20%-30% performance edge, for almost nothing.

It has been years since I used MS Windows much, but from my memories
of my these days, I was under the impression that it (at least the NT
line, the only surviving line these days) also maintained "last
accessed" times. Except I only ever saw it at "right now" because the
file explorer ... accesses the file before getting this metadata or
something like that (when you right-click on a file and ask for its
properties). It has creation and last modification time, too.

So, if my memories are correct, there is no performance edge to be
conceded by having atime (but one to be gained by not having atime).

-- 
Lionel

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-09  6:25             ` Lionel Elie Mamane
@ 2007-08-09 15:02               ` Chuck Ebbert
  2007-08-09 16:22                 ` Diego Calleja
  0 siblings, 1 reply; 188+ messages in thread
From: Chuck Ebbert @ 2007-08-09 15:02 UTC (permalink / raw)
  To: Lionel Elie Mamane, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingcha

On 08/09/2007 02:25 AM, Lionel Elie Mamane wrote:
> 
>> yeah, it's really ugly. But otherwise i've got no real complaint
>> about ext3 - with the obligatory qualification that
>> "noatime,nodiratime" in /etc/fstab is a must. This speeds up things
>> very visibly (...). So for most file workloads we give Windows a
>> 20%-30% performance edge, for almost nothing.
> 
> It has been years since I used MS Windows much, but from my memories
> of my these days, I was under the impression that it (at least the NT
> line, the only surviving line these days) also maintained "last
> accessed" times. Except I only ever saw it at "right now" because the
> file explorer ... accesses the file before getting this metadata or
> something like that (when you right-click on a file and ask for its
> properties). It has creation and last modification time, too.
> 

NT maintains atimes by default, at least up to XP. You have to edit the
registry to turn them off, and it is a single global switch -- not per
mountpoint like Unix.

And it makes a huge difference there, too.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-09 15:02               ` Chuck Ebbert
@ 2007-08-09 16:22                 ` Diego Calleja
  0 siblings, 0 replies; 188+ messages in thread
From: Diego Calleja @ 2007-08-09 16:22 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Lionel Elie Mamane, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingcha

El Thu, 09 Aug 2007 11:02:38 -0400, Chuck Ebbert <cebbert@redhat.com> escribió:

> NT maintains atimes by default, at least up to XP. You have to edit the
> registry to turn them off, and it is a single global switch -- not per
> mountpoint like Unix.
> 
> And it makes a huge difference there, too.

In windows Vista they've disabled atime updates by default.

And XP maintains atimes, but it uses a trick to avoid the performance
penalty we suffer in linux, similar to what Andi Kleen suggested: they
keep atime updates in memory for one hour, and only sync to disk after
that time - of course they also sync it if there's a oportunity to do it, like
when updating mtime.

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 17/23] mm: count writeback pages per BDI
  2007-08-03 12:37 ` [PATCH 17/23] mm: count writeback " Peter Zijlstra
@ 2007-08-09 19:15   ` Christoph Lameter
  2007-08-09 19:23     ` Peter Zijlstra
  0 siblings, 1 reply; 188+ messages in thread
From: Christoph Lameter @ 2007-08-09 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Fri, 3 Aug 2007, Peter Zijlstra wrote:

>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
> +			if (bdi_cap_writeback_dirty(bdi))
> +				__dec_bdi_stat(bdi, BDI_WRITEBACK);

Why are these not incremented and decremented in the exact location of 
NR_WRITEBACK?

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 17/23] mm: count writeback pages per BDI
  2007-08-09 19:15   ` Christoph Lameter
@ 2007-08-09 19:23     ` Peter Zijlstra
  2007-08-09 19:27       ` Christoph Lameter
  0 siblings, 1 reply; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-09 19:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, 2007-08-09 at 12:15 -0700, Christoph Lameter wrote:
> On Fri, 3 Aug 2007, Peter Zijlstra wrote:
> 
> >  						page_index(page),
> >  						PAGECACHE_TAG_WRITEBACK);
> > +			if (bdi_cap_writeback_dirty(bdi))
> > +				__dec_bdi_stat(bdi, BDI_WRITEBACK);
> 
> Why are these not incremented and decremented in the exact location of 
> NR_WRITEBACK?

int test_clear_page_writeback(struct page *page)
{
	struct address_space *mapping = page_mapping(page);
	int ret;

	if (mapping) {
		struct backing_dev_info *bdi = mapping->backing_dev_info;
		unsigned long flags;

		write_lock_irqsave(&mapping->tree_lock, flags);
		ret = TestClearPageWriteback(page);
		if (ret) {
			radix_tree_tag_clear(&mapping->page_tree,
						page_index(page),
						PAGECACHE_TAG_WRITEBACK);
			if (bdi_cap_writeback_dirty(bdi)) {
				__dec_bdi_stat(bdi, BDI_WRITEBACK);
				__bdi_writeout_inc(bdi);
			}
		}
		write_unlock_irqrestore(&mapping->tree_lock, flags);
	} else {
		ret = TestClearPageWriteback(page);
	}
	if (ret)
		dec_zone_page_state(page, NR_WRITEBACK);
	return ret;
}

Less conditionals. We already have a branch for mapping, why create
another?


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 17/23] mm: count writeback pages per BDI
  2007-08-09 19:23     ` Peter Zijlstra
@ 2007-08-09 19:27       ` Christoph Lameter
  2007-08-13  8:36         ` Peter Zijlstra
  0 siblings, 1 reply; 188+ messages in thread
From: Christoph Lameter @ 2007-08-09 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, 9 Aug 2007, Peter Zijlstra wrote:

> Less conditionals. We already have a branch for mapping, why create
> another?

Ah. Okay. This also avoids an interrupt enable disable since you can use 
__ functions. Hmmm... Would be good if we could move the vmstat 
NR_WRITEBACK update there too. Can a page without a mapping be under 
writeback? (Direct I/O?)



^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 21:21                             ` Andrew Morton
  2007-08-09  0:54                               ` Martin Bligh
@ 2007-08-10  0:21                               ` Bill Davidsen
  1 sibling, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-10  0:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Andrew Morton wrote:
> On Wed, 08 Aug 2007 14:10:15 -0700
> "Martin J. Bligh" <mbligh@mbligh.org> wrote:
> 
>> Why isn't this easily fixable by just adding an additional dirty
>> flag that says atime has changed? Then we only cause a write
>> when we remove the inode from the inode cache, if only atime
>> is updated.
> 
> I think that could be made to work, and it would fix the performance
> issue.
> 
> It is a behaviour change.  At present ext3 (for example) commits everything
> every five seconds.  After a change like this, a crash+recovery could cause
> a file's atime to go backwards by an arbitrarily large time interval - it
> could easily be months.
> 
I would think that (really) updating atime on open would be enough, 
hopefully without being too much. The "lazyatime" thing I was playing 
with only updated on open, final close, write, and fork.

I like the idea of updating once in a while, but one of the benefits of 
noatime is allowing drives to spin down via inactivity. If something 
does get done in the area of less but non-zero atime tracking, perhaps 
that could be taken into account. I have to check what "laptop_mode 
actually does, since my laptops are old installs.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 19:03                               ` david
  2007-08-06  6:52                                 ` Ingo Molnar
@ 2007-08-10  4:04                                 ` Bill Davidsen
  2007-08-11  5:19                                   ` Valdis.Kletnieks
  1 sibling, 1 reply; 188+ messages in thread
From: Bill Davidsen @ 2007-08-10  4:04 UTC (permalink / raw)
  To: david
  Cc: Diego Calleja, Ingo Molnar, Alan Cox, J??rn Engel, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard

david@lang.hm wrote:
> On Sun, 5 Aug 2007, Diego Calleja wrote:
> 
>> El Sun, 5 Aug 2007 09:13:20 +0200, Ingo Molnar <mingo@elte.hu> escribió:
>>
>>> Measurements show that noatime helps 20-30% on regular desktop
>>> workloads, easily 50% for kernel builds and much more than that (in
>>> excess of 100%) for file-read-intense workloads. We cannot just walk
>>
>>
>> And as everybody knows in servers is a popular practice to disable it.
>> According to an interview to the kernel.org admins....
>>
>> "Beyond that, Peter noted, "very little fancy is going on, and that is 
>> good
>> because fancy is hard to maintain." He explained that the only fancy 
>> thing
>> being done is that all filesystems are mounted noatime meaning that the
>> system doesn't have to make writes to the filesystem for files which are
>> simply being read, "that cut the load average in half."
>>
>> I bet that some people would consider such performance hit a bug...
>>
> 
> actually, it's popular practice to disable it by people who know how big 
> a hit it is and know how few programs use it.
> 
> i've been a linux sysadmin for 10 years, and have known about noatime 
> for at least 7 years, but I always thought of it in the catagory of 'use 
> it only on your performance critical machines where you are trying to 
> extract every ounce of performance, and keep an eye out for things 
> misbehaving'
> 
> I never imagined that itwas the 20%+ hit that is being described, and 
> with so little impact, or I would have switched to it across the board 
> years ago.
> 
To get that magnitude you need slow disk with very fast CPU. It helps 
most of systems where the disk hardware is marginal or worse for the i/o 
load. Don't take that as typical.

> I'll bet there are a lot of admins out there in the same boat.
> 
> adding an option in the kernel to change the default sounds like a very 
> good first step, even if the default isn't changed today.
> 

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-08 13:54   ` Andi Kleen
@ 2007-08-10  4:17     ` Bill Davidsen
  0 siblings, 0 replies; 188+ messages in thread
From: Bill Davidsen @ 2007-08-10  4:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: richard kennedy, Peter Zijlstra, linux-mm, linux-kernel, miklos,
	akpm, neilb, dgc, tomoki.sekiyama.qu, nikita, trond.myklebust,
	yingchao.zhou, torvalds

Andi Kleen wrote:
> richard kennedy <richard@rsk.demon.co.uk> writes:
>> This is on a standard desktop machine so there are lots of other
>> processes running on it, and although there is a degree of variability
>> in the numbers,they are very repeatable and your patch always out
>> performs the stock mm2.
>> looks good to me
> 
> iirc the goal of this is less to get better performance, but to avoid long user visible
> latencies.  Of course if it's faster it's great too, but that's only secondary.
> 
What a trade-off, if you want to get rid of long latency you have to 
live with better throughput. I can live with that. ;-)

Your point well taken, not the intent of the patch, but it may indicate 
where a performance bottleneck happens as well.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-10  4:04                                 ` Bill Davidsen
@ 2007-08-11  5:19                                   ` Valdis.Kletnieks
  0 siblings, 0 replies; 188+ messages in thread
From: Valdis.Kletnieks @ 2007-08-11  5:19 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: david, Diego Calleja, Ingo Molnar, Alan Cox, J??rn Engel,
	Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard

[-- Attachment #1: Type: text/plain, Size: 690 bytes --]

On Fri, 10 Aug 2007 00:04:45 EDT, Bill Davidsen said:

> > I never imagined that itwas the 20%+ hit that is being described, and 
> > with so little impact, or I would have switched to it across the board 
> > years ago.
> > 
> To get that magnitude you need slow disk with very fast CPU. It helps 
> most of systems where the disk hardware is marginal or worse for the i/o 
> load. Don't take that as typical.

I suspect that almost every single laptop with a Core2 Duo in it falls into
that classification, and it's getting worse every year, as we see more
disparity between CPU speeds (increasing) and disk seek times (basically nailed
to the floor for the last decade).


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-09  0:54                               ` Martin Bligh
@ 2007-08-11 23:14                                 ` Valerie Henson
  0 siblings, 0 replies; 188+ messages in thread
From: Valerie Henson @ 2007-08-11 23:14 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Andrew Morton, Christoph Hellwig, J??rn Engel, Ingo Molnar,
	Jeff Garzik, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

On Wed, Aug 08, 2007 at 05:54:57PM -0700, Martin Bligh wrote:
> Andrew Morton wrote:
> >On Wed, 08 Aug 2007 14:10:15 -0700
> >"Martin J. Bligh" <mbligh@mbligh.org> wrote:
> >
> >>Why isn't this easily fixable by just adding an additional dirty
> >>flag that says atime has changed? Then we only cause a write
> >>when we remove the inode from the inode cache, if only atime
> >>is updated.
> >
> >I think that could be made to work, and it would fix the performance
> >issue.
> >
> >It is a behaviour change.  At present ext3 (for example) commits everything
> >every five seconds.  After a change like this, a crash+recovery could cause
> >a file's atime to go backwards by an arbitrarily large time interval - it
> >could easily be months.
> 
> A second pdflush / workqueue at a slower rate would alleviate that.

This becomes delayed atime writes.  I'm not sure that it's better to
batch up the writes and do them all in one big seeky go, or to trickle
them out as they are done.  Best of all is not to do them at all.

Note when talking about saving up atime updates to write out that the
final write is going to be sloooooow.  Inodes are typically 128 bytes,
and you may have to do a seek between every one.  Currents disks can
do on the order of 100 seeks a second.  So do a find on 1000 files and
you've just created 10 seconds of I/O hanging out in memory.

-VAL

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 17/23] mm: count writeback pages per BDI
  2007-08-09 19:27       ` Christoph Lameter
@ 2007-08-13  8:36         ` Peter Zijlstra
  0 siblings, 0 replies; 188+ messages in thread
From: Peter Zijlstra @ 2007-08-13  8:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

On Thu, 2007-08-09 at 12:27 -0700, Christoph Lameter wrote:
> On Thu, 9 Aug 2007, Peter Zijlstra wrote:
> 
> > Less conditionals. We already have a branch for mapping, why create
> > another?
> 
> Ah. Okay. This also avoids an interrupt enable disable since you can use 
> __ functions. Hmmm... Would be good if we could move the vmstat 
> NR_WRITEBACK update there too. Can a page without a mapping be under 
> writeback? (Direct I/O?)

DIO still uses the mapping afaik (it needs to invalidate the page before
and after the OP).

But you could put the increment in both paths, and use the irq disable
from the mapping branch - which should be the most frequent case anyway.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-06  8:40         ` Brice Figureau
@ 2007-08-14  1:44           ` Stewart Smith
  2007-08-14  2:25             ` Andi Kleen
  0 siblings, 1 reply; 188+ messages in thread
From: Stewart Smith @ 2007-08-14  1:44 UTC (permalink / raw)
  To: Brice Figureau; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

On Mon, 2007-08-06 at 10:40 +0200, Brice Figureau wrote:
> Mysql accesses its database files in O_DIRECT mode.

binlog is written using buffered IO.

for InnoDB, binlog is synced first, then innodb log. on restart (in 5.0)
these are synced back up so you don't get inconsistencies.

and from a quick look at the innobase source, only data file is using
O_DIRECT.

> Since the database fits in RAM, the only kind of access Mysql is doing
> is writing to the innodb log, the mysql binlog and finally to the innodb
> database files.
> There are certainly a whole lot of fsync'ing happening.

yes. Keep in mind that the binlog grows in file size too... so this has
to sync all the metadata as well (ick, i know).
-- 
Stewart Smith, Senior Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332

Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 827 bytes --]

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-14  1:44           ` Stewart Smith
@ 2007-08-14  2:25             ` Andi Kleen
  2007-08-14  7:59               ` Brice Figureau
  0 siblings, 1 reply; 188+ messages in thread
From: Andi Kleen @ 2007-08-14  2:25 UTC (permalink / raw)
  To: Stewart Smith; +Cc: Brice Figureau, Andi Kleen, linux-kernel

On Tue, Aug 14, 2007 at 11:44:56AM +1000, Stewart Smith wrote:
> > Since the database fits in RAM, the only kind of access Mysql is doing
> > is writing to the innodb log, the mysql binlog and finally to the innodb
> > database files.
> > There are certainly a whole lot of fsync'ing happening.
> 
> yes. Keep in mind that the binlog grows in file size too... so this has
> to sync all the metadata as well (ick, i know).

It might be an interesting experiment to see if it still happens
with the file system remounted as ext2. ext2 has a much more 
benign fsync than ext3.

-Andi

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-14  2:25             ` Andi Kleen
@ 2007-08-14  7:59               ` Brice Figureau
  0 siblings, 0 replies; 188+ messages in thread
From: Brice Figureau @ 2007-08-14  7:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stewart Smith, linux-kernel

On Tue, 2007-08-14 at 04:25 +0200, Andi Kleen wrote:
> On Tue, Aug 14, 2007 at 11:44:56AM +1000, Stewart Smith wrote:
> > > Since the database fits in RAM, the only kind of access Mysql is doing
> > > is writing to the innodb log, the mysql binlog and finally to the innodb
> > > database files.
> > > There are certainly a whole lot of fsync'ing happening.
> > 
> > yes. Keep in mind that the binlog grows in file size too... so this has
> > to sync all the metadata as well (ick, i know).

Back in the first days of my original bug report I moved the binlogs to
another disk and it didn't change anything to my issue.

On Tue, 2007-08-14 at 04:25 +0200, Andi Kleen wrote:
> It might be an interesting experiment to see if it still happens
> with the file system remounted as ext2. ext2 has a much more 
> benign fsync than ext3.

Is it possible to perform a live remount of the fs on ext2 ?

Beside that, the RAID card has a battery backed RAM in write-back mode,
I was told that fsync don't really hurt in this case (moreover the fs is
mounted in journal=writeback mode).

I'll post soon blktrace files in the original bug report, this will show
exactly what is the disk workload in the baseline case _and_ in the
underload atypical case. Maybe that will help to shed some lights on the
issue?

Anyway, thanks,
-- 
Brice Figureau <brice+lklm@daysofwonder.com>


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 20:36                         ` Christoph Hellwig
  2007-08-06 18:03                           ` Chuck Ebbert
  2007-08-08 21:10                           ` Martin J. Bligh
@ 2007-08-14  9:57                           ` Helge Hafting
  2 siblings, 0 replies; 188+ messages in thread
From: Helge Hafting @ 2007-08-14  9:57 UTC (permalink / raw)
  To: Christoph Hellwig, J??rn Engel, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, david

Christoph Hellwig wrote:
> Umm, no f**king way.  atime selection is 100% policy and belongs into
> userspace.  Add to that the problem that we can't actually re-enable
> atimes because of the way the vfs-level mount flags API is designed.
> Instead of doing such a fugly kernel patch just talk to the handfull
> of distributions that matter to update their defaults.
>   

Indeed.  Just change /bin/mount so it defaults to "noatime"
unless there is an explicit "atime". Similiar for diratime.
Problem solved.

Helge Hafting

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: per device dirty throttling -v8
       [not found]                                   ` <fa.uq0BQtrgp66a08hpsF+vrqXUNC4@ifi.uio.no>
@ 2007-08-15 18:16                                     ` david.balazic
  0 siblings, 0 replies; 188+ messages in thread
From: david.balazic @ 2007-08-15 18:16 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="us-ascii", Size: 1303 bytes --]

On Aug 4, 10:15 pm, Arjan van de Ven <ar...@infradead.org> wrote:
> On Sat, 2007-08-04 at 12:47 -0700, Linus Torvalds wrote:
>
> > On Sat, 4 Aug 2007, Jörn Engel wrote:
>
> > > Given the choice between only "atime" and "noatime" I'd agree with you.
> > > Heck, I use it myself.  But "relatime" seems to combine the best of both
> > > worlds.  It currently just suffers from mount not supporting it in any
> > > relevant distro.
>
> > Well, we could make it the default for the kernel (possibly under a
> > "fast-atime" config option), and then people can add "atime" or "noatime"
> > as they wish, since mount has supported _those_ options for a long time.
>
> there is another trick possible (more involved though, Al will have to
> jump in on that one I suspect): Have 2 types of "dirty inode" states;
> one is the current dirty state (meaning the full range of ext3
> transactions etc) and "lighter" state of "atime-dirty"; which will not
> do the background syncs or journal transactions (so if your machine
> crashes, you lose the atime update) but it does keep atime for most
> normal cases and keeps it standard compliant "except after a crash".

Am I one of the few that thinks this would be a win-win solution ? :-(
I guess it requires a lot more coding than relatime.

Regards,
David Balažic


^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05  0:26             ` Andi Kleen
  2007-08-05 15:00               ` Theodore Tso
  2007-08-05 20:41               ` Christoph Hellwig
@ 2007-08-16 10:18               ` Helge Hafting
  2 siblings, 0 replies; 188+ messages in thread
From: Helge Hafting @ 2007-08-16 10:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard

Andi Kleen wrote:
> I always thought the right solution would be to just sync atime only
> very very lazily. This means if a inode is only dirty because of an
> atime update put it on a "only write out when there is nothing to do
> or the memory is really needed" list.
>   
Seems like a good idea.  atimes will then be written only by
memory pressure - or umount.  The atimes could be wrong after
a crash, but loosing atimes only is not something
I'd worry about.

Helge Hafting

^ permalink raw reply	[flat|nested] 188+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v8
  2007-08-05 15:00               ` Theodore Tso
  2007-08-06 13:47                 ` Chris Mason
@ 2007-08-17  0:45                 ` Dave Jones
  1 sibling, 0 replies; 188+ messages in thread
From: Dave Jones @ 2007-08-17  0:45 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-kernel

On Sun, Aug 05, 2007 at 11:00:29AM -0400, Theodore Tso wrote:

 > P.S.  Yet alternative is to specify noatime on an individual
 > file/directory basis.  We've had this capability for a *long* time,
 > and if a distro were to set noatime for all files in certain
 > hierarchies (i.e., /usr/include) and certain top-level directories
 > (since the chattr +A flag is inherited)

This came across my mind again earlier, and I went digging.
Can you explain how this works?

I've eyeballed the ext2/ext3 code, and feel like I'm missing something obvious.
I'm guessing that for eg, with /usr/include/stdio.h, we check the inodes
for all four parts of path, and if any of them are +A we avoid the
atime update ?  If so, where does that inheritance happen in the code?

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 188+ messages in thread

end of thread, other threads:[~2007-08-17  0:45 UTC | newest]

Thread overview: 188+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-03 12:37 [PATCH 00/23] per device dirty throttling -v8 Peter Zijlstra
2007-08-03 12:37 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
2007-08-03 12:37 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
2007-08-03 12:37 ` [PATCH 03/23] lib: percpu_counter variable batch Peter Zijlstra
2007-08-03 12:37 ` [PATCH 04/23] lib: make percpu_counter_add take s64 Peter Zijlstra
2007-08-03 12:37 ` [PATCH 05/23] lib: percpu_counter_set Peter Zijlstra
2007-08-03 12:37 ` [PATCH 06/23] lib: percpu_counter_sum_positive Peter Zijlstra
2007-08-03 12:37 ` [PATCH 07/23] lib: percpu_count_sum() Peter Zijlstra
2007-08-03 12:37 ` [PATCH 08/23] lib: percpu_counter_init error handling Peter Zijlstra
2007-08-03 12:37 ` [PATCH 09/23] lib: percpu_counter_init_irq Peter Zijlstra
2007-08-03 12:37 ` [PATCH 10/23] mm: bdi init hooks Peter Zijlstra
2007-08-03 12:37 ` [PATCH 11/23] containers: " Peter Zijlstra
2007-08-03 12:37 ` [PATCH 12/23] mtd: " Peter Zijlstra
2007-08-03 12:37 ` [PATCH 13/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
2007-08-03 12:37 ` [PATCH 14/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
2007-08-03 12:37 ` [PATCH 15/23] mm: scalable bdi statistics counters Peter Zijlstra
2007-08-03 12:37 ` [PATCH 16/23] mm: count reclaimable pages per BDI Peter Zijlstra
2007-08-03 12:37 ` [PATCH 17/23] mm: count writeback " Peter Zijlstra
2007-08-09 19:15   ` Christoph Lameter
2007-08-09 19:23     ` Peter Zijlstra
2007-08-09 19:27       ` Christoph Lameter
2007-08-13  8:36         ` Peter Zijlstra
2007-08-03 12:37 ` [PATCH 18/23] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-08-03 12:37 ` [PATCH 19/23] lib: floating proportions Peter Zijlstra
2007-08-03 12:37 ` [PATCH 20/23] lib: floating proportions _single Peter Zijlstra
2007-08-03 12:37 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
2007-08-03 12:37 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
2007-08-03 12:37 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
2007-08-03 22:21 ` [PATCH 00/23] per device dirty throttling -v8 Linus Torvalds
2007-08-04  6:32   ` Ingo Molnar
2007-08-04  7:07     ` Ingo Molnar
2007-08-04  7:44       ` david
2007-08-04 16:01         ` Ray Lee
2007-08-04 17:15           ` david
2007-08-09  5:11           ` david
2007-08-04 10:33       ` Ingo Molnar
2007-08-04 16:17         ` Linus Torvalds
2007-08-04 16:37           ` Ingo Molnar
2007-08-04 16:51             ` Andrew Morton
2007-08-04 16:56               ` Ingo Molnar
2007-08-04 20:23                 ` Alan Cox
2007-08-04 17:02             ` Diego Calleja
2007-08-04 17:17               ` Ingo Molnar
2007-08-04 17:38                 ` Diego Calleja
2007-08-04 17:51                   ` Diego Calleja
2007-08-08 10:43                 ` Karel Zak
2007-08-04 17:39             ` Linus Torvalds
2007-08-04 18:08               ` Jeff Garzik
2007-08-04 19:12                 ` Jörn Engel
2007-08-04 19:21                   ` Ingo Molnar
2007-08-04 19:26                     ` Jörn Engel
2007-08-04 19:42                       ` Jörn Engel
2007-08-05 20:36                         ` Christoph Hellwig
2007-08-06 18:03                           ` Chuck Ebbert
2007-08-06 18:53                             ` Jeff Garzik
2007-08-06 19:37                             ` Alan Cox
2007-08-06 19:46                               ` Chuck Ebbert
2007-08-07  7:05                                 ` Ingo Molnar
2007-08-08 21:10                           ` Martin J. Bligh
2007-08-08 21:21                             ` Andrew Morton
2007-08-09  0:54                               ` Martin Bligh
2007-08-11 23:14                                 ` Valerie Henson
2007-08-10  0:21                               ` Bill Davidsen
2007-08-14  9:57                           ` Helge Hafting
2007-08-04 19:47                       ` Linus Torvalds
2007-08-04 19:49                         ` Linus Torvalds
2007-08-04 20:00                         ` Ingo Molnar
2007-08-04 20:11                           ` Ingo Molnar
2007-08-04 20:13                             ` Arjan van de Ven
2007-08-05  8:18                           ` [patch] add noatime/atime boot options, CONFIG_DEFAULT_NOATIME Ingo Molnar
2007-08-04 20:13                         ` [PATCH 00/23] per device dirty throttling -v8 Arjan van de Ven
2007-08-04 21:48                           ` Theodore Tso
2007-08-05 18:01                             ` Arjan van de Ven
2007-08-05 20:34                               ` Christoph Hellwig
     [not found]                         ` <fa.7rstQpXif2z9y2n2HD+qxLFnueg@ifi.uio.no>
     [not found]                           ` <fa.6VOZrceT65Vh8CIRIta0zSg2V38@ifi.uio.no>
     [not found]                             ` <fa.xcZCTa5cHDOhrcyXZ2gZbzbu7g0@ifi.uio.no>
     [not found]                               ` <fa.JjZwG90x+07YaOx8h5VLN+9AL/8@ifi.uio.no>
     [not found]                                 ` <fa.V9U4mAEXVjNqblhzu7GRmxif7Uw@ifi.uio.no>
     [not found]                                   ` <fa.uq0BQtrgp66a08hpsF+vrqXUNC4@ifi.uio.no>
2007-08-15 18:16                                     ` david.balazic
2007-08-04 20:11                     ` [PATCH 00/23] " Alan Cox
2007-08-04 20:28                       ` Jeff Garzik
2007-08-04 21:47                         ` Alan Cox
2007-08-04 23:51                           ` Claudio Martins
2007-08-05  0:49                             ` Alan Cox
2007-08-05  7:28                               ` Ingo Molnar
2007-08-05 10:29                                 ` Jakob Oestergaard
2007-08-05 12:46                                 ` Alan Cox
2007-08-05 12:58                                   ` Ingo Molnar
2007-08-05 13:29                                     ` Willy Tarreau
2007-08-06  6:57                                       ` Ingo Molnar
2007-08-06 13:12                                         ` Willy Tarreau
2007-08-05 14:46                               ` Theodore Tso
2007-08-05 17:55                                 ` Ingo Molnar
2007-08-05 17:59                                   ` Jeff Garzik
2007-08-05 18:09                                     ` Ingo Molnar
2007-08-05 18:08                                 ` Arjan van de Ven
2007-08-07 21:20                             ` Bill Davidsen
2007-08-05  7:18                           ` Ingo Molnar
2007-08-07 18:55                         ` Bill Davidsen
2007-08-07 19:35                           ` Alan Cox
2007-08-08 17:44                             ` Bill Davidsen
2007-08-04 20:28                       ` Ingo Molnar
2007-08-04 20:34                         ` Arjan van de Ven
2007-08-04 21:03                         ` Ingo Molnar
2007-08-04 21:51                           ` Alan Cox
2007-08-05  7:21                             ` Ingo Molnar
2007-08-05  7:29                               ` Andrew Morton
2007-08-05  7:39                                 ` Ingo Molnar
2007-08-05  8:53                               ` Willy Tarreau
2007-08-05 14:17                                 ` Jörn Engel
2007-08-05 18:02                                   ` Arjan van de Ven
2007-08-05 18:37                                     ` Jörn Engel
2007-08-05 20:21                                       ` Jörn Engel
2007-08-05 20:33                                         ` Andrew Morton
2007-08-05 12:47                               ` Alan Cox
2007-08-05 12:56                                 ` Ingo Molnar
2007-08-05 18:44                               ` Dave Jones
2007-08-05 18:58                                 ` adi
2007-08-06  6:39                                 ` Ingo Molnar
2007-08-06 15:59                                   ` Dave Jones
2007-08-06 16:16                                     ` Ingo Molnar
2007-08-05  7:37                             ` Ingo Molnar
2007-08-05  9:04                               ` Jeff Garzik
2007-08-05 12:43                               ` Alan Cox
2007-08-05 12:54                                 ` Ingo Molnar
2007-08-05 13:37                                   ` Alan Cox
2007-08-05 18:08                                     ` Ingo Molnar
2007-08-05 19:11                                       ` Alan Cox
2007-08-08 18:22                                       ` Bill Davidsen
2007-08-08 19:39                                         ` Jeff Garzik
2007-08-08 20:31                                           ` Bill Davidsen
2007-08-08 23:18                                           ` Alan Cox
2007-08-07 19:09                             ` Bill Davidsen
2007-08-04 21:48                         ` Alan Cox
2007-08-05  7:13                           ` Ingo Molnar
2007-08-05 13:22                             ` Diego Calleja
2007-08-05 19:03                               ` david
2007-08-06  6:52                                 ` Ingo Molnar
2007-08-10  4:04                                 ` Bill Davidsen
2007-08-11  5:19                                   ` Valdis.Kletnieks
2007-08-06  6:58                               ` Ingo Molnar
2007-08-09  0:57                             ` Greg Trounson
2007-08-09  1:26                               ` david
2007-08-09  2:33                               ` Andi Kleen
2007-08-04 22:39                         ` Ilpo Järvinen
2007-08-05 10:20                 ` Jakob Oestergaard
2007-08-05 10:42                   ` Jeff Garzik
2007-08-05 10:58                     ` Jakob Oestergaard
2007-08-05 12:46                       ` Ingo Molnar
2007-08-05 13:46                         ` Jakob Oestergaard
2007-08-05 16:45                         ` Linus Torvalds
2007-08-05 19:09                           ` Ingo Molnar
2007-08-05 19:22                             ` [patch] implement smarter atime updates support Ingo Molnar
2007-08-05 19:28                               ` [patch] implement smarter atime updates support, v2 Ingo Molnar
2007-08-05 20:42                                 ` Theodore Tso
2007-08-06  5:36                                   ` Ingo Molnar
2007-08-05 19:53                               ` [patch] implement smarter atime updates support Arjan van de Ven
2007-08-05 20:04                                 ` Alan Cox
2007-08-05 20:22                                   ` Arjan van de Ven
2007-08-05 19:29                             ` [PATCH 00/23] per device dirty throttling -v8 Alan Cox
2007-08-05 19:32                               ` Ingo Molnar
2007-08-05 23:43                     ` David Chinner
2007-08-05  0:26             ` Andi Kleen
2007-08-05 15:00               ` Theodore Tso
2007-08-06 13:47                 ` Chris Mason
2007-08-17  0:45                 ` Dave Jones
2007-08-05 20:41               ` Christoph Hellwig
2007-08-06 10:42                 ` Andi Kleen
2007-08-16 10:18               ` Helge Hafting
2007-08-09  6:25             ` Lionel Elie Mamane
2007-08-09 15:02               ` Chuck Ebbert
2007-08-09 16:22                 ` Diego Calleja
2007-08-04 16:41           ` Andrew Morton
2007-08-04 17:26             ` Nikita Danilov
2007-08-04 19:16             ` Florian Weimer
2007-08-05  6:00               ` Andrew Morton
2007-08-05  7:57                 ` Florian Weimer
2007-08-05 20:43                   ` Christoph Hellwig
2007-08-05 22:46               ` Theodore Tso
2007-08-06  0:24               ` David Chinner
2007-08-05  0:28         ` Andi Kleen
2007-08-04 16:15       ` Linus Torvalds
2007-08-05 17:22     ` Brice Figureau
2007-08-05 22:17       ` Andi Kleen
2007-08-06  8:40         ` Brice Figureau
2007-08-14  1:44           ` Stewart Smith
2007-08-14  2:25             ` Andi Kleen
2007-08-14  7:59               ` Brice Figureau
2007-08-06 20:26 ` Miklos Szeredi
2007-08-08 12:25 ` richard kennedy
2007-08-08 13:54   ` Andi Kleen
2007-08-10  4:17     ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).