LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [performance bug] kernel building regression on 64 LCPUs machine
@ 2011-01-19  1:55 Alex,Shi
  2011-01-19  2:03 ` Shaohua Li
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Alex,Shi @ 2011-01-19  1:55 UTC (permalink / raw)
  To: czoccolo; +Cc: vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

Shaohua and I tested kernel building performance on latest kernel. and
found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
system. We find this performance dropping is due to commit
749ef9f8423054e326f. If we revert this patch or just change the
WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
recovered. 

iostat report show with the commit, read request merge number increased
and write request merge dropped. The total request size increased and
queue length dropped. So we tested another patch: only change WRITE_SYNC
to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.

we didn't test deadline IO mode, just test cfq. seems insert write
request into sync queue effect much read performance, but we don't know
details. What's your comments of this? 

iostat of .37 kernel:
rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
iostat of commit reverted .37:
26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323

vmstat report show, read bandwidth dropping: 
vmstat of .37:
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
vmstat of revert all from .37
2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0

Regards
Alex

===
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..27ac2f3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..69ff08e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19  1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi
@ 2011-01-19  2:03 ` Shaohua Li
  2011-01-19 12:56   ` Jan Kara
  2011-01-20 15:16   ` Vivek Goyal
  2011-01-19 14:32 ` Ted Ts'o
  2011-01-21  7:23 ` Corrado Zoccolo
  2 siblings, 2 replies; 38+ messages in thread
From: Shaohua Li @ 2011-01-19  2:03 UTC (permalink / raw)
  To: Shi, Alex, jack, tytso
  Cc: czoccolo, vgoyal, jaxboe, linux-kernel, Chen, Tim C

add Jan and Theodore to the loop.

On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> Shaohua and I tested kernel building performance on latest kernel. and
> found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> system. We find this performance dropping is due to commit
> 749ef9f8423054e326f. If we revert this patch or just change the
> WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> recovered. 
> 
> iostat report show with the commit, read request merge number increased
> and write request merge dropped. The total request size increased and
> queue length dropped. So we tested another patch: only change WRITE_SYNC
> to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.

> we didn't test deadline IO mode, just test cfq. seems insert write
> request into sync queue effect much read performance, but we don't know
> details. What's your comments of this? 
> 
> iostat of .37 kernel:
> rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
> iostat of commit reverted .37:
> 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
> 
> vmstat report show, read bandwidth dropping: 
> vmstat of .37:
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
> vmstat of revert all from .37
> 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
> 
> Regards
> Alex
> 
> ===
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 34a4861..27ac2f3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
>  	int first_tag = 0;
>  	int tag_flag;
>  	int i;
> -	int write_op = WRITE_SYNC;
> +	int write_op = WRITE;
>  
>  	/*
>  	 * First job: lock down the current transaction and wait for
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f3ad159..69ff08e 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	int tag_bytes = journal_tag_bytes(journal);
>  	struct buffer_head *cbh = NULL; /* For transactional checksums */
>  	__u32 crc32_sum = ~0;
> -	int write_op = WRITE_SYNC;
> +	int write_op = WRITE;
>  
>  	/*
>  	 * First job: lock down the current transaction and wait for
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19  2:03 ` Shaohua Li
@ 2011-01-19 12:56   ` Jan Kara
  2011-01-20  7:52     ` Alex,Shi
  2011-01-20 15:16   ` Vivek Goyal
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-01-19 12:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, jack, tytso, czoccolo, vgoyal, jaxboe, linux-kernel,
	Chen, Tim C

On Wed 19-01-11 10:03:26, Shaohua Li wrote:
> add Jan and Theodore to the loop.
  Thanks.

> On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > Shaohua and I tested kernel building performance on latest kernel. and
> > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > system. We find this performance dropping is due to commit
> > 749ef9f8423054e326f. If we revert this patch or just change the
> > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > recovered. 
> > 
> > iostat report show with the commit, read request merge number increased
> > and write request merge dropped. The total request size increased and
> > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> 
> > we didn't test deadline IO mode, just test cfq. seems insert write
> > request into sync queue effect much read performance, but we don't know
> > details. What's your comments of this? 
  Indeed it seems that the optimization of the case where we wait for the
transaction is negatively impacting the performance when we are not. Does
patch below help for your load? It refines the logic when WRITE_SYNC
is needed (of course, we should also test whether the patch works for fsync
heavy loads as well).
  The patch is mostly a proof of concept and only lightly tested so be
careful...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

>From 6060f56ac919ac6b146057787af219bbf2503ea0 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 13:45:04 +0100
Subject: [PATCH] jbd2: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c       |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/checkpoint.c  |    2 +-
 fs/jbd2/commit.c      |    4 ++--
 fs/jbd2/journal.c     |   19 ++++++++++---------
 fs/jbd2/transaction.c |   13 ++++++-------
 fs/ocfs2/aops.c       |    2 +-
 fs/ocfs2/super.c      |    2 +-
 include/linux/jbd2.h  |   18 ++++++++++--------
 9 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 7829b28..19434da 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
+	if (jbd2_log_start_commit(journal, commit_tid, true)) {
 		/*
 		 * When the journal is on a different device than the
 		 * fs data disk, we need to issue the barrier in
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0f10ccd..eee8486 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4121,7 +4121,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 
 	trace_ext4_sync_fs(sb, wait);
 	flush_workqueue(sbi->dio_unwritten_wq);
-	if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
+	if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
 		if (wait)
 			jbd2_log_wait_commit(sbi->s_journal, target);
 	}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 6a79fd0..3436d53 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 			       "Waiting for Godot: block %llu\n",
 			       journal->j_devname,
 			       (unsigned long long) bh->b_blocknr);
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 		jbd2_log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..19973eb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 9e46869..e278fa2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -475,7 +475,7 @@ int __jbd2_log_space_left(journal_t *journal)
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
-int __jbd2_log_start_commit(journal_t *journal, tid_t target)
+int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -485,7 +485,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -496,12 +497,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int jbd2_log_start_commit(journal_t *journal, tid_t tid)
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	write_lock(&journal->j_state_lock);
-	ret = __jbd2_log_start_commit(journal, tid);
+	ret = __jbd2_log_start_commit(journal, tid, will_wait);
 	write_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -524,7 +525,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
 	read_lock(&journal->j_state_lock);
 	if (journal->j_running_transaction && !current->journal_info) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -544,7 +545,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -552,7 +553,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__jbd2_log_start_commit(journal, tid);
+		__jbd2_log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1559,7 +1560,7 @@ int jbd2_journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1675,7 +1676,7 @@ void __jbd2_journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JBD2_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	write_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 3948932..3d5215e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -222,7 +222,7 @@ repeat:
 		atomic_sub(nblocks, &transaction->t_outstanding_credits);
 		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
 				TASK_UNINTERRUPTIBLE);
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 		read_unlock(&journal->j_state_lock);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -465,7 +465,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
-	__jbd2_log_start_commit(journal, transaction->t_tid);
+	__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	read_unlock(&journal->j_state_lock);
 
 	lock_map_release(&handle->h_lockdep_map);
@@ -1361,8 +1361,6 @@ int jbd2_journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
@@ -1383,15 +1381,16 @@ int jbd2_journal_stop(handle_t *handle)
 
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
-		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
-
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
 		 * to wait for the commit to complete.
 		 */
 		if (handle->h_sync && !(current->flags & PF_MEMALLOC))
 			wait_for_commit = 1;
+
+		/* This is non-blocking */
+		jbd2_log_start_commit(journal, transaction->t_tid,
+				      wait_for_commit);
 	}
 
 	/*
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..d493f32 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
 		goto out;
 	}
 
-	if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
+	if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
 		jbd2_log_wait_commit(osb->journal->j_journal, target);
 		ret = 1;
 	}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 31c3ffd..aab448c 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
 	}
 
 	if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal,
-				      &target)) {
+				      &target, wait)) {
 		if (wait)
 			jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal,
 					     target);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 27e79c2..46aaf45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -631,11 +631,6 @@ struct transaction_s
 	 */
 	atomic_t		t_handle_count;
 
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 	unsigned int t_flushed_data_blocks:1;
 
 	/*
@@ -900,6 +895,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t			j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -1200,9 +1202,9 @@ extern void	jbd2_journal_switch_revoke_table(journal_t *journal);
  */
 
 int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
-int jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19  1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi
  2011-01-19  2:03 ` Shaohua Li
@ 2011-01-19 14:32 ` Ted Ts'o
  2011-01-20  2:12   ` Shaohua Li
  2011-01-21  7:23 ` Corrado Zoccolo
  2 siblings, 1 reply; 38+ messages in thread
From: Ted Ts'o @ 2011-01-19 14:32 UTC (permalink / raw)
  To: Alex,Shi; +Cc: czoccolo, vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

On Wed, Jan 19, 2011 at 09:55:48AM +0800, Alex,Shi wrote:
> Shaohua and I tested kernel building performance on latest kernel. and
> found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> system. 

What kerenl version were you comparing against?  And what were you
using to benchmark this performance drop?  I'm curious what your
workload was; I'm guesisng it was some kind of mixed read/write
workload?  Were there any fsync()'s involved?

Thanks, regards,

						- Ted

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19 14:32 ` Ted Ts'o
@ 2011-01-20  2:12   ` Shaohua Li
  0 siblings, 0 replies; 38+ messages in thread
From: Shaohua Li @ 2011-01-20  2:12 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Shi, Alex, czoccolo, vgoyal, jaxboe, linux-kernel, Chen, Tim C

On Wed, 2011-01-19 at 22:32 +0800, Ted Ts'o wrote:
> On Wed, Jan 19, 2011 at 09:55:48AM +0800, Alex,Shi wrote:
> > Shaohua and I tested kernel building performance on latest kernel. and
> > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > system. 
> 
> What kerenl version were you comparing against?  And what were you
> using to benchmark this performance drop?  I'm curious what your
> workload was; I'm guesisng it was some kind of mixed read/write
> workload?  Were there any fsync()'s involved?
2.6.36 vs 2.6.37. this exists in 2.6.37-rc1. it's a kbuild, no a mixed
read/write workload. kbuild hasn't fsync involved

Thanks,
Shaohua



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19 12:56   ` Jan Kara
@ 2011-01-20  7:52     ` Alex,Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Alex,Shi @ 2011-01-20  7:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Li, Shaohua, tytso, czoccolo, vgoyal, jaxboe, linux-kernel, Chen, Tim C

On Wed, 2011-01-19 at 20:56 +0800, Jan Kara wrote:
> On Wed 19-01-11 10:03:26, Shaohua Li wrote:
> > add Jan and Theodore to the loop.
>   Thanks.
> 
> > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > Shaohua and I tested kernel building performance on latest kernel. and
> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > system. We find this performance dropping is due to commit
> > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > recovered.
> > >
> > > iostat report show with the commit, read request merge number increased
> > > and write request merge dropped. The total request size increased and
> > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> >
> > > we didn't test deadline IO mode, just test cfq. seems insert write
> > > request into sync queue effect much read performance, but we don't know
> > > details. What's your comments of this?
>   Indeed it seems that the optimization of the case where we wait for the
> transaction is negatively impacting the performance when we are not. Does
> patch below help for your load? It refines the logic when WRITE_SYNC
> is needed (of course, we should also test whether the patch works for fsync
> heavy loads as well).
>   The patch is mostly a proof of concept and only lightly tested so be
> careful...
> 

I tested the patch after remove t_synchronous_commit lines in
include/trace/events/jbd2.h on 2.6.38-rc1 kernel. 
but did not find clear improvement on kbuild. 


         


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19  2:03 ` Shaohua Li
  2011-01-19 12:56   ` Jan Kara
@ 2011-01-20 15:16   ` Vivek Goyal
  2011-01-21  7:17     ` Shaohua Li
  2011-01-26  8:15     ` Shaohua Li
  1 sibling, 2 replies; 38+ messages in thread
From: Vivek Goyal @ 2011-01-20 15:16 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, jack, tytso, czoccolo, jaxboe, linux-kernel, Chen, Tim C

On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> add Jan and Theodore to the loop.
> 
> On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > Shaohua and I tested kernel building performance on latest kernel. and
> > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > system. We find this performance dropping is due to commit
> > 749ef9f8423054e326f. If we revert this patch or just change the
> > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > recovered. 
> > 
> > iostat report show with the commit, read request merge number increased
> > and write request merge dropped. The total request size increased and
> > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> 

Yep, it does sound like reduce write merging. But moving journal commits
back to WRITE, then fsync performance will drop as there will be idling
introduced between fsync thread and journalling thread. So that does
not sound like a good idea either.

Secondly, in presence of mixed workload (some other sync read happening)
WRITES can get less bandwidth and sync workload much more. So by 
marking journal commits as WRITES you might increase the delay there
in completion in presence of other sync workload.

So Jan Kara's approach makes sense that if somebody is waiting on
commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
did it not work for you. Is it possible to run some traces and do
more debugging that figure out what's happening.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-20 15:16   ` Vivek Goyal
@ 2011-01-21  7:17     ` Shaohua Li
  2011-01-26  8:15     ` Shaohua Li
  1 sibling, 0 replies; 38+ messages in thread
From: Shaohua Li @ 2011-01-21  7:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Shi, Alex, jack, tytso, czoccolo, jaxboe, linux-kernel, Chen, Tim C

On Thu, 2011-01-20 at 23:16 +0800, Vivek Goyal wrote:
> On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > add Jan and Theodore to the loop.
> > 
> > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > Shaohua and I tested kernel building performance on latest kernel. and
> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > system. We find this performance dropping is due to commit
> > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > recovered. 
> > > 
> > > iostat report show with the commit, read request merge number increased
> > > and write request merge dropped. The total request size increased and
> > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > 
> 
> Yep, it does sound like reduce write merging. But moving journal commits
> back to WRITE, then fsync performance will drop as there will be idling
> introduced between fsync thread and journalling thread. So that does
> not sound like a good idea either.
> 
> Secondly, in presence of mixed workload (some other sync read happening)
> WRITES can get less bandwidth and sync workload much more. So by 
> marking journal commits as WRITES you might increase the delay there
> in completion in presence of other sync workload.
> 
> So Jan Kara's approach makes sense that if somebody is waiting on
> commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> did it not work for you. Is it possible to run some traces and do
> more debugging that figure out what's happening.
I'll debug and see if I can find anything.

Thanks,
Shaohua



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-19  1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi
  2011-01-19  2:03 ` Shaohua Li
  2011-01-19 14:32 ` Ted Ts'o
@ 2011-01-21  7:23 ` Corrado Zoccolo
  2011-01-21  7:47   ` Alex,Shi
  2011-01-21  8:20   ` Shaohua Li
  2 siblings, 2 replies; 38+ messages in thread
From: Corrado Zoccolo @ 2011-01-21  7:23 UTC (permalink / raw)
  To: Alex,Shi; +Cc: vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote:
> Shaohua and I tested kernel building performance on latest kernel. and
> found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> system. We find this performance dropping is due to commit
> 749ef9f8423054e326f. If we revert this patch or just change the
> WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> recovered.
>
> iostat report show with the commit, read request merge number increased
> and write request merge dropped. The total request size increased and
> queue length dropped. So we tested another patch: only change WRITE_SYNC
> to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
>
> we didn't test deadline IO mode, just test cfq. seems insert write
> request into sync queue effect much read performance, but we don't know
> details. What's your comments of this?
>
> iostat of .37 kernel:
> rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
> iostat of commit reverted .37:
> 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323

>From these numbers, it seems to me that with the patch reverted, the
write bandwidth is really low, and probably you are keeping most
written files in the buffer cache during the whole compile, while the
non-reverted kernel is making progress in writing out the files. So
the 'improved' read bandwidth is due to unfairness w.r.t. writes.
Does the result change if you add a final sync and time that as well,
in order to see the full time to make it on disk?

I think that in a more extreme test where you end up filling all the
buffer cache with written data, you will see much longer stalls with
the revert than without.

Thanks,
Corrado

>
> vmstat report show, read bandwidth dropping:
> vmstat of .37:
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
> vmstat of revert all from .37
> 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
>
> Regards
> Alex
>
> ===
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 34a4861..27ac2f3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
>        int first_tag = 0;
>        int tag_flag;
>        int i;
> -       int write_op = WRITE_SYNC;
> +       int write_op = WRITE;
>
>        /*
>         * First job: lock down the current transaction and wait for
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f3ad159..69ff08e 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>        int tag_bytes = journal_tag_bytes(journal);
>        struct buffer_head *cbh = NULL; /* For transactional checksums */
>        __u32 crc32_sum = ~0;
> -       int write_op = WRITE_SYNC;
> +       int write_op = WRITE;
>
>        /*
>         * First job: lock down the current transaction and wait for
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-21  7:23 ` Corrado Zoccolo
@ 2011-01-21  7:47   ` Alex,Shi
  2011-01-21  7:52     ` Alex,Shi
  2011-01-21  8:20   ` Shaohua Li
  1 sibling, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-01-21  7:47 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote:
> On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > Shaohua and I tested kernel building performance on latest kernel. and
> > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > system. We find this performance dropping is due to commit
> > 749ef9f8423054e326f. If we revert this patch or just change the
> > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > recovered.
> >
> > iostat report show with the commit, read request merge number increased
> > and write request merge dropped. The total request size increased and
> > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> >
> > we didn't test deadline IO mode, just test cfq. seems insert write
> > request into sync queue effect much read performance, but we don't know
> > details. What's your comments of this?
> >
> > iostat of .37 kernel:
> > rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
> > iostat of commit reverted .37:
> > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
> 
> From these numbers, it seems to me that with the patch reverted, the
> write bandwidth is really low, and probably you are keeping most
> written files in the buffer cache during the whole compile, while the
> non-reverted kernel is making progress in writing out the files. So
> the 'improved' read bandwidth is due to unfairness w.r.t. writes.
> Does the result change if you add a final sync and time that as well,
> in order to see the full time to make it on disk?

Agree with your guess, but kbuild is such kind of benchmark, we can not
change its behavior. :( 


> 
> I think that in a more extreme test where you end up filling all the
> buffer cache with written data, you will see much longer stalls with
> the revert than without.

Have to do this? and if so, it is not kbuild. :) 

BTW, the Jan's patch has a little improvement on kbuild. In many time
testing, it seems about 3% improving. 

The average iostat output of Jan's patch: 
 rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192

> 
> Thanks,
> Corrado
> 
> >
> > vmstat report show, read bandwidth dropping:
> > vmstat of .37:
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
> > vmstat of revert all from .37
> > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
> >
> > Regards
> > Alex
> >
> > ===
> > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > index 34a4861..27ac2f3 100644
> > --- a/fs/jbd/commit.c
> > +++ b/fs/jbd/commit.c
> > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
> >        int first_tag = 0;
> >        int tag_flag;
> >        int i;
> > -       int write_op = WRITE_SYNC;
> > +       int write_op = WRITE;
> >
> >        /*
> >         * First job: lock down the current transaction and wait for
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index f3ad159..69ff08e 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >        int tag_bytes = journal_tag_bytes(journal);
> >        struct buffer_head *cbh = NULL; /* For transactional checksums */
> >        __u32 crc32_sum = ~0;
> > -       int write_op = WRITE_SYNC;
> > +       int write_op = WRITE;
> >
> >        /*
> >         * First job: lock down the current transaction and wait for
> >
> >
> >
> 
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-21  7:47   ` Alex,Shi
@ 2011-01-21  7:52     ` Alex,Shi
  2011-01-21  8:13       ` Corrado Zoccolo
  0 siblings, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-01-21  7:52 UTC (permalink / raw)
  To: Corrado Zoccolo, Jan Kara, tytso
  Cc: vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

Sorry for forgetting Jan and tytso. added again. 

On Fri, 2011-01-21 at 15:47 +0800, Alex,Shi wrote:
> On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote:
> > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > > Shaohua and I tested kernel building performance on latest kernel. and
> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > system. We find this performance dropping is due to commit
> > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > recovered.
> > >
> > > iostat report show with the commit, read request merge number increased
> > > and write request merge dropped. The total request size increased and
> > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > >
> > > we didn't test deadline IO mode, just test cfq. seems insert write
> > > request into sync queue effect much read performance, but we don't know
> > > details. What's your comments of this?
> > >
> > > iostat of .37 kernel:
> > > rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > > 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
> > > iostat of commit reverted .37:
> > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
> > 
> > From these numbers, it seems to me that with the patch reverted, the
> > write bandwidth is really low, and probably you are keeping most
> > written files in the buffer cache during the whole compile, while the
> > non-reverted kernel is making progress in writing out the files. So
> > the 'improved' read bandwidth is due to unfairness w.r.t. writes.
> > Does the result change if you add a final sync and time that as well,
> > in order to see the full time to make it on disk?
> 
> Agree with your guess, but kbuild is such kind of benchmark, we can not
> change its behavior. :( 
> 
> 
> > 
> > I think that in a more extreme test where you end up filling all the
> > buffer cache with written data, you will see much longer stalls with
> > the revert than without.
> 
> Have to do this? and if so, it is not kbuild. :) 
> 
> BTW, the Jan's patch has a little improvement on kbuild. In many time
> testing, it seems about 3% improving. 
> 
> The average iostat output of Jan's patch: 
>  rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> 23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192
> 
> > 
> > Thanks,
> > Corrado
> > 
> > >
> > > vmstat report show, read bandwidth dropping:
> > > vmstat of .37:
> > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
> > > vmstat of revert all from .37
> > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
> > >
> > > Regards
> > > Alex
> > >
> > > ===
> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > > index 34a4861..27ac2f3 100644
> > > --- a/fs/jbd/commit.c
> > > +++ b/fs/jbd/commit.c
> > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
> > >        int first_tag = 0;
> > >        int tag_flag;
> > >        int i;
> > > -       int write_op = WRITE_SYNC;
> > > +       int write_op = WRITE;
> > >
> > >        /*
> > >         * First job: lock down the current transaction and wait for
> > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > > index f3ad159..69ff08e 100644
> > > --- a/fs/jbd2/commit.c
> > > +++ b/fs/jbd2/commit.c
> > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > >        int tag_bytes = journal_tag_bytes(journal);
> > >        struct buffer_head *cbh = NULL; /* For transactional checksums */
> > >        __u32 crc32_sum = ~0;
> > > -       int write_op = WRITE_SYNC;
> > > +       int write_op = WRITE;
> > >
> > >        /*
> > >         * First job: lock down the current transaction and wait for
> > >
> > >
> > >
> > 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-21  7:52     ` Alex,Shi
@ 2011-01-21  8:13       ` Corrado Zoccolo
  0 siblings, 0 replies; 38+ messages in thread
From: Corrado Zoccolo @ 2011-01-21  8:13 UTC (permalink / raw)
  To: Alex,Shi
  Cc: Jan Kara, tytso, vgoyal, jaxboe, linux-kernel, Li, Shaohua, Chen, Tim C

On Fri, Jan 21, 2011 at 8:52 AM, Alex,Shi <alex.shi@intel.com> wrote:
> Sorry for forgetting Jan and tytso. added again.
>
> On Fri, 2011-01-21 at 15:47 +0800, Alex,Shi wrote:
>> On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote:
>> > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote:
>> > > Shaohua and I tested kernel building performance on latest kernel. and
>> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
>> > > system. We find this performance dropping is due to commit
>> > > 749ef9f8423054e326f. If we revert this patch or just change the
>> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
>> > > recovered.
>> > >
>> > > iostat report show with the commit, read request merge number increased
>> > > and write request merge dropped. The total request size increased and
>> > > queue length dropped. So we tested another patch: only change WRITE_SYNC
>> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
>> > >
>> > > we didn't test deadline IO mode, just test cfq. seems insert write
>> > > request into sync queue effect much read performance, but we don't know
>> > > details. What's your comments of this?
>> > >
>> > > iostat of .37 kernel:
>> > > rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
>> > > 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
>> > > iostat of commit reverted .37:
>> > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
>> >
>> > From these numbers, it seems to me that with the patch reverted, the
>> > write bandwidth is really low, and probably you are keeping most
>> > written files in the buffer cache during the whole compile, while the
>> > non-reverted kernel is making progress in writing out the files. So
>> > the 'improved' read bandwidth is due to unfairness w.r.t. writes.
>> > Does the result change if you add a final sync and time that as well,
>> > in order to see the full time to make it on disk?
>>
>> Agree with your guess, but kbuild is such kind of benchmark, we can not
>> change its behavior. :(
Why not? Benchmarks should try to model real workloads, and after I
build a kernel, I run lilo, that performs a sync (and typically takes
several seconds). If I used grub, I would pay the sync while
unmounting the filesystem, but I would experience the multi-second
delay anyway.

>>
>>
>> >
>> > I think that in a more extreme test where you end up filling all the
>> > buffer cache with written data, you will see much longer stalls with
>> > the revert than without.
>>
>> Have to do this? and if so, it is not kbuild. :)
It might be, on a machine with fewer ram. Can you try running with
just 1G of ram?
>>
>> BTW, the Jan's patch has a little improvement on kbuild. In many time
>> testing, it seems about 3% improving.
The fact is, does it improve (or at least is a noop) also on
constrained memory systems?

>> The average iostat output of Jan's patch:
>>  rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
>> 23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192
>>
>> >
>> > Thanks,
>> > Corrado
>> >
>> > >
>> > > vmstat report show, read bandwidth dropping:
>> > > vmstat of .37:
>> > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
>> > > vmstat of revert all from .37
>> > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
>> > >
>> > > Regards
>> > > Alex
>> > >
>> > > ===
>> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
>> > > index 34a4861..27ac2f3 100644
>> > > --- a/fs/jbd/commit.c
>> > > +++ b/fs/jbd/commit.c
>> > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
>> > >        int first_tag = 0;
>> > >        int tag_flag;
>> > >        int i;
>> > > -       int write_op = WRITE_SYNC;
>> > > +       int write_op = WRITE;
>> > >
>> > >        /*
>> > >         * First job: lock down the current transaction and wait for
>> > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
>> > > index f3ad159..69ff08e 100644
>> > > --- a/fs/jbd2/commit.c
>> > > +++ b/fs/jbd2/commit.c
>> > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>> > >        int tag_bytes = journal_tag_bytes(journal);
>> > >        struct buffer_head *cbh = NULL; /* For transactional checksums */
>> > >        __u32 crc32_sum = ~0;
>> > > -       int write_op = WRITE_SYNC;
>> > > +       int write_op = WRITE;
>> > >
>> > >        /*
>> > >         * First job: lock down the current transaction and wait for
>> > >
>> > >
>> > >
>> >
>> >
>> >
>>
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-21  7:23 ` Corrado Zoccolo
  2011-01-21  7:47   ` Alex,Shi
@ 2011-01-21  8:20   ` Shaohua Li
  1 sibling, 0 replies; 38+ messages in thread
From: Shaohua Li @ 2011-01-21  8:20 UTC (permalink / raw)
  To: Corrado Zoccolo, Jan Kara, Ted Ts'o
  Cc: Shi, Alex, vgoyal, jaxboe, linux-kernel, Chen, Tim C

On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote:
> On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > Shaohua and I tested kernel building performance on latest kernel. and
> > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > system. We find this performance dropping is due to commit
> > 749ef9f8423054e326f. If we revert this patch or just change the
> > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > recovered.
> >
> > iostat report show with the commit, read request merge number increased
> > and write request merge dropped. The total request size increased and
> > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> >
> > we didn't test deadline IO mode, just test cfq. seems insert write
> > request into sync queue effect much read performance, but we don't know
> > details. What's your comments of this?
> >
> > iostat of .37 kernel:
> > rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > 22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
> > iostat of commit reverted .37:
> > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
> 
> From these numbers, it seems to me that with the patch reverted, the
> write bandwidth is really low, and probably you are keeping most
> written files in the buffer cache during the whole compile, while the
> non-reverted kernel is making progress in writing out the files. So
> the 'improved' read bandwidth is due to unfairness w.r.t. writes.
> Does the result change if you add a final sync and time that as well,
> in order to see the full time to make it on disk?
> 
> I think that in a more extreme test where you end up filling all the
> buffer cache with written data, you will see much longer stalls with
> the revert than without.
Not just bandwidth issue. cpu might have more idle. so I bet even the
'kbuild + sync' workload still has regression

And the no-write-merge of WRITE_SYNC isn't good, at least we should use
WRITE_SYNC_PLUG.

Thanks,
Shaohua

> >
> > vmstat report show, read bandwidth dropping:
> > vmstat of .37:
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
> > vmstat of revert all from .37
> > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
> >
> > Regards
> > Alex
> >
> > ===
> > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> > index 34a4861..27ac2f3 100644
> > --- a/fs/jbd/commit.c
> > +++ b/fs/jbd/commit.c
> > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
> >        int first_tag = 0;
> >        int tag_flag;
> >        int i;
> > -       int write_op = WRITE_SYNC;
> > +       int write_op = WRITE;
> >
> >        /*
> >         * First job: lock down the current transaction and wait for
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index f3ad159..69ff08e 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >        int tag_bytes = journal_tag_bytes(journal);
> >        struct buffer_head *cbh = NULL; /* For transactional checksums */
> >        __u32 crc32_sum = ~0;
> > -       int write_op = WRITE_SYNC;
> > +       int write_op = WRITE;
> >
> >        /*
> >         * First job: lock down the current transaction and wait for
> >
> >
> >
> 
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-20 15:16   ` Vivek Goyal
  2011-01-21  7:17     ` Shaohua Li
@ 2011-01-26  8:15     ` Shaohua Li
  2011-02-12  9:21       ` Alex,Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Shaohua Li @ 2011-01-26  8:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Shi, Alex, jack, tytso, czoccolo, jaxboe, linux-kernel, Chen, Tim C

On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > add Jan and Theodore to the loop.
> >
> > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > Shaohua and I tested kernel building performance on latest kernel. and
> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > system. We find this performance dropping is due to commit
> > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > recovered.
> > >
> > > iostat report show with the commit, read request merge number increased
> > > and write request merge dropped. The total request size increased and
> > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> >
> 
> Yep, it does sound like reduce write merging. But moving journal commits
> back to WRITE, then fsync performance will drop as there will be idling
> introduced between fsync thread and journalling thread. So that does
> not sound like a good idea either.
> 
> Secondly, in presence of mixed workload (some other sync read happening)
> WRITES can get less bandwidth and sync workload much more. So by
> marking journal commits as WRITES you might increase the delay there
> in completion in presence of other sync workload.
> 
> So Jan Kara's approach makes sense that if somebody is waiting on
> commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> did it not work for you. Is it possible to run some traces and do
> more debugging that figure out what's happening.
Sorry for the long delay.

Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
ext4, maybe this is the reason.
I changed jbd to use WRITE for journal_commit_transaction. With the change and
Jan's patch, the test seems fine.

Jan,
can you send a patch with similar change for ext3? So we can do more tests.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-01-26  8:15     ` Shaohua Li
@ 2011-02-12  9:21       ` Alex,Shi
  2011-02-12 18:25         ` Corrado Zoccolo
  0 siblings, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-02-12  9:21 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: Vivek Goyal, jack, tytso, czoccolo, jaxboe, linux-kernel, Chen, Tim C

On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > > add Jan and Theodore to the loop.
> > >
> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > > system. We find this performance dropping is due to commit
> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > > recovered.
> > > >
> > > > iostat report show with the commit, read request merge number increased
> > > > and write request merge dropped. The total request size increased and
> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > >
> > 
> > Yep, it does sound like reduce write merging. But moving journal commits
> > back to WRITE, then fsync performance will drop as there will be idling
> > introduced between fsync thread and journalling thread. So that does
> > not sound like a good idea either.
> > 
> > Secondly, in presence of mixed workload (some other sync read happening)
> > WRITES can get less bandwidth and sync workload much more. So by
> > marking journal commits as WRITES you might increase the delay there
> > in completion in presence of other sync workload.
> > 
> > So Jan Kara's approach makes sense that if somebody is waiting on
> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > did it not work for you. Is it possible to run some traces and do
> > more debugging that figure out what's happening.
> Sorry for the long delay.
> 
> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> ext4, maybe this is the reason.
> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> Jan's patch, the test seems fine.
Let me clarify the bug situation again. 
With the following scenarios, the regression is clear.
1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
kbuild on /dev/sdb1 with ext4. 
but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
no regressions whenever with or without Jan's patch. 
So, problem focus on the ccache scenario, (from fedora 11, ccache is
default setting). 

If we compare the vmstat output with or without ccache, there is too
many write when ccache enabled. According the result, it should to do
some tunning on ext3 fs. 


vmstat average output per 10 seconds, without ccache
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 

vmstat average output per 10 seconds, with ccache
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0


> 
> Jan,
> can you send a patch with similar change for ext3? So we can do more tests.
> 
> Thanks,
> Shaohua




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-12  9:21       ` Alex,Shi
@ 2011-02-12 18:25         ` Corrado Zoccolo
  2011-02-14  2:25           ` Alex,Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Corrado Zoccolo @ 2011-02-12 18:25 UTC (permalink / raw)
  To: Alex,Shi
  Cc: Li, Shaohua, Vivek Goyal, jack, tytso, jaxboe, linux-kernel, Chen, Tim C

On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote:
> On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
>> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
>> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
>> > > add Jan and Theodore to the loop.
>> > >
>> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
>> > > > Shaohua and I tested kernel building performance on latest kernel. and
>> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
>> > > > system. We find this performance dropping is due to commit
>> > > > 749ef9f8423054e326f. If we revert this patch or just change the
>> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
>> > > > recovered.
>> > > >
>> > > > iostat report show with the commit, read request merge number increased
>> > > > and write request merge dropped. The total request size increased and
>> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
>> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
>> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
>> > >
>> >
>> > Yep, it does sound like reduce write merging. But moving journal commits
>> > back to WRITE, then fsync performance will drop as there will be idling
>> > introduced between fsync thread and journalling thread. So that does
>> > not sound like a good idea either.
>> >
>> > Secondly, in presence of mixed workload (some other sync read happening)
>> > WRITES can get less bandwidth and sync workload much more. So by
>> > marking journal commits as WRITES you might increase the delay there
>> > in completion in presence of other sync workload.
>> >
>> > So Jan Kara's approach makes sense that if somebody is waiting on
>> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
>> > did it not work for you. Is it possible to run some traces and do
>> > more debugging that figure out what's happening.
>> Sorry for the long delay.
>>
>> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
>> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
>> ext4, maybe this is the reason.
>> I changed jbd to use WRITE for journal_commit_transaction. With the change and
>> Jan's patch, the test seems fine.
> Let me clarify the bug situation again.
> With the following scenarios, the regression is clear.
> 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> kbuild on /dev/sdb1 with ext4.
> but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> no regressions whenever with or without Jan's patch.
> So, problem focus on the ccache scenario, (from fedora 11, ccache is
> default setting).
>
> If we compare the vmstat output with or without ccache, there is too
> many write when ccache enabled. According the result, it should to do
> some tunning on ext3 fs.
Is ext3 configured with data ordered or writeback?
I think ccache might be performing fsyncs, and this is a bad workload
for ext3, especially in ordered mode.
It might be that my patch introduced a regression in ext3 fsync
performance, but I don't understand how reverting only the change in
jbd2 (that is the ext4 specific journaling daemon) could restore it.
The two partitions are on different disks, so each one should be
isolated from the I/O perspective (do they share a single
controller?). The only interaction I see happens at the VM level,
since changing performance of any of the two changes the rate at which
pages can be cleaned.

Corrado
>
>
> vmstat average output per 10 seconds, without ccache
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
>
> vmstat average output per 10 seconds, with ccache
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
>
>
>>
>> Jan,
>> can you send a patch with similar change for ext3? So we can do more tests.
>>
>> Thanks,
>> Shaohua
>
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-12 18:25         ` Corrado Zoccolo
@ 2011-02-14  2:25           ` Alex,Shi
  2011-02-15  1:10             ` Shaohua Li
  0 siblings, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-02-14  2:25 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Li, Shaohua, Vivek Goyal, jack, tytso, jaxboe, linux-kernel, Chen, Tim C

On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> >> > > add Jan and Theodore to the loop.
> >> > >
> >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> >> > > > system. We find this performance dropping is due to commit
> >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> >> > > > recovered.
> >> > > >
> >> > > > iostat report show with the commit, read request merge number increased
> >> > > > and write request merge dropped. The total request size increased and
> >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> >> > >
> >> >
> >> > Yep, it does sound like reduce write merging. But moving journal commits
> >> > back to WRITE, then fsync performance will drop as there will be idling
> >> > introduced between fsync thread and journalling thread. So that does
> >> > not sound like a good idea either.
> >> >
> >> > Secondly, in presence of mixed workload (some other sync read happening)
> >> > WRITES can get less bandwidth and sync workload much more. So by
> >> > marking journal commits as WRITES you might increase the delay there
> >> > in completion in presence of other sync workload.
> >> >
> >> > So Jan Kara's approach makes sense that if somebody is waiting on
> >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> >> > did it not work for you. Is it possible to run some traces and do
> >> > more debugging that figure out what's happening.
> >> Sorry for the long delay.
> >>
> >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> >> ext4, maybe this is the reason.
> >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> >> Jan's patch, the test seems fine.
> > Let me clarify the bug situation again.
> > With the following scenarios, the regression is clear.
> > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > kbuild on /dev/sdb1 with ext4.
> > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > no regressions whenever with or without Jan's patch.
> > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > default setting).
> >
> > If we compare the vmstat output with or without ccache, there is too
> > many write when ccache enabled. According the result, it should to do
> > some tunning on ext3 fs.
> Is ext3 configured with data ordered or writeback?

The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. 

> I think ccache might be performing fsyncs, and this is a bad workload
> for ext3, especially in ordered mode.
> It might be that my patch introduced a regression in ext3 fsync
> performance, but I don't understand how reverting only the change in
> jbd2 (that is the ext4 specific journaling daemon) could restore it.
> The two partitions are on different disks, so each one should be
> isolated from the I/O perspective (do they share a single
> controller?).

No, sda/sdb use separated controller. 
 
>  The only interaction I see happens at the VM level,
> since changing performance of any of the two changes the rate at which
> pages can be cleaned.
> 
> Corrado
> >
> >
> > vmstat average output per 10 seconds, without ccache
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> >
> > vmstat average output per 10 seconds, with ccache
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> >
> >
> >>
> >> Jan,
> >> can you send a patch with similar change for ext3? So we can do more tests.
> >>
> >> Thanks,
> >> Shaohua
> >
> >
> >
> >
> 
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-14  2:25           ` Alex,Shi
@ 2011-02-15  1:10             ` Shaohua Li
  2011-02-21 16:49               ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Shaohua Li @ 2011-02-15  1:10 UTC (permalink / raw)
  To: Shi, Alex, Jan Kara
  Cc: Corrado Zoccolo, Vivek Goyal, jack, tytso, jaxboe, linux-kernel,
	Chen, Tim C

On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote:
> On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > >> > > add Jan and Theodore to the loop.
> > >> > >
> > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > >> > > > system. We find this performance dropping is due to commit
> > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > >> > > > recovered.
> > >> > > >
> > >> > > > iostat report show with the commit, read request merge number increased
> > >> > > > and write request merge dropped. The total request size increased and
> > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > >> > >
> > >> >
> > >> > Yep, it does sound like reduce write merging. But moving journal commits
> > >> > back to WRITE, then fsync performance will drop as there will be idling
> > >> > introduced between fsync thread and journalling thread. So that does
> > >> > not sound like a good idea either.
> > >> >
> > >> > Secondly, in presence of mixed workload (some other sync read happening)
> > >> > WRITES can get less bandwidth and sync workload much more. So by
> > >> > marking journal commits as WRITES you might increase the delay there
> > >> > in completion in presence of other sync workload.
> > >> >
> > >> > So Jan Kara's approach makes sense that if somebody is waiting on
> > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > >> > did it not work for you. Is it possible to run some traces and do
> > >> > more debugging that figure out what's happening.
> > >> Sorry for the long delay.
> > >>
> > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> > >> ext4, maybe this is the reason.
> > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> > >> Jan's patch, the test seems fine.
> > > Let me clarify the bug situation again.
> > > With the following scenarios, the regression is clear.
> > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > > kbuild on /dev/sdb1 with ext4.
> > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > > no regressions whenever with or without Jan's patch.
> > > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > > default setting).
> > >
> > > If we compare the vmstat output with or without ccache, there is too
> > > many write when ccache enabled. According the result, it should to do
> > > some tunning on ext3 fs.
> > Is ext3 configured with data ordered or writeback?
> 
> The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. 
> 
> > I think ccache might be performing fsyncs, and this is a bad workload
> > for ext3, especially in ordered mode.
> > It might be that my patch introduced a regression in ext3 fsync
> > performance, but I don't understand how reverting only the change in
> > jbd2 (that is the ext4 specific journaling daemon) could restore it.
> > The two partitions are on different disks, so each one should be
> > isolated from the I/O perspective (do they share a single
> > controller?).
> 
> No, sda/sdb use separated controller. 
>  
> >  The only interaction I see happens at the VM level,
> > since changing performance of any of the two changes the rate at which
> > pages can be cleaned.
> > 
> > Corrado
> > >
> > >
> > > vmstat average output per 10 seconds, without ccache
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> > >
> > > vmstat average output per 10 seconds, with ccache
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> > >
> > >
> > >>
> > >> Jan,
> > >> can you send a patch with similar change for ext3? So we can do more tests.
Hi Jan,
can you send a patch with both ext3 and ext4 changes? Our test shows
your patch has positive effect, but need confirm with the ext3 change.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-15  1:10             ` Shaohua Li
@ 2011-02-21 16:49               ` Jan Kara
  2011-02-23  8:24                 ` Alex,Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-02-21 16:49 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Jan Kara, Corrado Zoccolo, Vivek Goyal, tytso, jaxboe,
	linux-kernel, Chen, Tim C

[-- Attachment #1: Type: text/plain, Size: 5448 bytes --]

On Tue 15-02-11 09:10:01, Shaohua Li wrote:
> On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote:
> > On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> > > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote:
> > > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> > > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > > >> > > add Jan and Theodore to the loop.
> > > >> > >
> > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > >> > > > system. We find this performance dropping is due to commit
> > > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > >> > > > recovered.
> > > >> > > >
> > > >> > > > iostat report show with the commit, read request merge number increased
> > > >> > > > and write request merge dropped. The total request size increased and
> > > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > > >> > >
> > > >> >
> > > >> > Yep, it does sound like reduce write merging. But moving journal commits
> > > >> > back to WRITE, then fsync performance will drop as there will be idling
> > > >> > introduced between fsync thread and journalling thread. So that does
> > > >> > not sound like a good idea either.
> > > >> >
> > > >> > Secondly, in presence of mixed workload (some other sync read happening)
> > > >> > WRITES can get less bandwidth and sync workload much more. So by
> > > >> > marking journal commits as WRITES you might increase the delay there
> > > >> > in completion in presence of other sync workload.
> > > >> >
> > > >> > So Jan Kara's approach makes sense that if somebody is waiting on
> > > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > > >> > did it not work for you. Is it possible to run some traces and do
> > > >> > more debugging that figure out what's happening.
> > > >> Sorry for the long delay.
> > > >>
> > > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> > > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> > > >> ext4, maybe this is the reason.
> > > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> > > >> Jan's patch, the test seems fine.
> > > > Let me clarify the bug situation again.
> > > > With the following scenarios, the regression is clear.
> > > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > > > kbuild on /dev/sdb1 with ext4.
> > > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > > > no regressions whenever with or without Jan's patch.
> > > > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > > > default setting).
> > > >
> > > > If we compare the vmstat output with or without ccache, there is too
> > > > many write when ccache enabled. According the result, it should to do
> > > > some tunning on ext3 fs.
> > > Is ext3 configured with data ordered or writeback?
> > 
> > The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. 
> > 
> > > I think ccache might be performing fsyncs, and this is a bad workload
> > > for ext3, especially in ordered mode.
> > > It might be that my patch introduced a regression in ext3 fsync
> > > performance, but I don't understand how reverting only the change in
> > > jbd2 (that is the ext4 specific journaling daemon) could restore it.
> > > The two partitions are on different disks, so each one should be
> > > isolated from the I/O perspective (do they share a single
> > > controller?).
> > 
> > No, sda/sdb use separated controller. 
> >  
> > >  The only interaction I see happens at the VM level,
> > > since changing performance of any of the two changes the rate at which
> > > pages can be cleaned.
> > > 
> > > Corrado
> > > >
> > > >
> > > > vmstat average output per 10 seconds, without ccache
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> > > >
> > > > vmstat average output per 10 seconds, with ccache
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> > > >
> > > >
> > > >>
> > > >> Jan,
> > > >> can you send a patch with similar change for ext3? So we can do more tests.
> Hi Jan,
> can you send a patch with both ext3 and ext4 changes? Our test shows
> your patch has positive effect, but need confirm with the ext3 change.
  Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a
while to get to this.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 11044 bytes --]

>From 5674f84e70db8274d5aac56a41439ea3b3bcb46e Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 13:45:04 +0100
Subject: [PATCH 1/2] jbd2: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c       |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/checkpoint.c  |    2 +-
 fs/jbd2/commit.c      |    4 ++--
 fs/jbd2/journal.c     |   19 ++++++++++---------
 fs/jbd2/transaction.c |   13 ++++++-------
 fs/ocfs2/aops.c       |    2 +-
 fs/ocfs2/super.c      |    2 +-
 include/linux/jbd2.h  |   18 ++++++++++--------
 9 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 7829b28..19434da 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
+	if (jbd2_log_start_commit(journal, commit_tid, true)) {
 		/*
 		 * When the journal is on a different device than the
 		 * fs data disk, we need to issue the barrier in
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 48ce561..0aeb877 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4113,7 +4113,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 
 	trace_ext4_sync_fs(sb, wait);
 	flush_workqueue(sbi->dio_unwritten_wq);
-	if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
+	if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
 		if (wait)
 			jbd2_log_wait_commit(sbi->s_journal, target);
 	}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 6a79fd0..3436d53 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 			       "Waiting for Godot: block %llu\n",
 			       journal->j_devname,
 			       (unsigned long long) bh->b_blocknr);
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 		jbd2_log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..19973eb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 9e46869..e278fa2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -475,7 +475,7 @@ int __jbd2_log_space_left(journal_t *journal)
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
-int __jbd2_log_start_commit(journal_t *journal, tid_t target)
+int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -485,7 +485,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -496,12 +497,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int jbd2_log_start_commit(journal_t *journal, tid_t tid)
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	write_lock(&journal->j_state_lock);
-	ret = __jbd2_log_start_commit(journal, tid);
+	ret = __jbd2_log_start_commit(journal, tid, will_wait);
 	write_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -524,7 +525,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
 	read_lock(&journal->j_state_lock);
 	if (journal->j_running_transaction && !current->journal_info) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -544,7 +545,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -552,7 +553,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__jbd2_log_start_commit(journal, tid);
+		__jbd2_log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1559,7 +1560,7 @@ int jbd2_journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1675,7 +1676,7 @@ void __jbd2_journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JBD2_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	write_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index faad2bd..c48e6e8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -222,7 +222,7 @@ repeat:
 		atomic_sub(nblocks, &transaction->t_outstanding_credits);
 		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
 				TASK_UNINTERRUPTIBLE);
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 		read_unlock(&journal->j_state_lock);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -465,7 +465,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
-	__jbd2_log_start_commit(journal, transaction->t_tid);
+	__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	read_unlock(&journal->j_state_lock);
 
 	lock_map_release(&handle->h_lockdep_map);
@@ -1361,8 +1361,6 @@ int jbd2_journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
@@ -1383,15 +1381,16 @@ int jbd2_journal_stop(handle_t *handle)
 
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
-		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
-
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
 		 * to wait for the commit to complete.
 		 */
 		if (handle->h_sync && !(current->flags & PF_MEMALLOC))
 			wait_for_commit = 1;
+
+		/* This is non-blocking */
+		jbd2_log_start_commit(journal, transaction->t_tid,
+				      wait_for_commit);
 	}
 
 	/*
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..d493f32 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
 		goto out;
 	}
 
-	if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
+	if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
 		jbd2_log_wait_commit(osb->journal->j_journal, target);
 		ret = 1;
 	}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 38f986d..45d9f82 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
 	}
 
 	if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal,
-				      &target)) {
+				      &target, wait)) {
 		if (wait)
 			jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal,
 					     target);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 27e79c2..46aaf45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -631,11 +631,6 @@ struct transaction_s
 	 */
 	atomic_t		t_handle_count;
 
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 	unsigned int t_flushed_data_blocks:1;
 
 	/*
@@ -900,6 +895,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t			j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -1200,9 +1202,9 @@ extern void	jbd2_journal_switch_revoke_table(journal_t *journal);
  */
 
 int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
-int jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
-- 
1.7.1


[-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 9843 bytes --]

>From ff2be0c11564d1253d9420ee0f805f2ba8f62e9f Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 21 Feb 2011 17:25:37 +0100
Subject: [PATCH 2/2] jbd: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/fsync.c      |    2 +-
 fs/ext3/super.c      |    2 +-
 fs/jbd/checkpoint.c  |    2 +-
 fs/jbd/commit.c      |    4 ++--
 fs/jbd/journal.c     |   19 ++++++++++---------
 fs/jbd/transaction.c |    9 ++++-----
 include/linux/jbd.h  |   21 ++++++++++++---------
 7 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 09b13bb..5396dd6 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync)
 	if (test_opt(inode->i_sb, BARRIER) &&
 	    !journal_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = 1;
-	log_start_commit(journal, commit_tid);
+	log_start_commit(journal, commit_tid, true);
 	ret = log_wait_commit(journal, commit_tid);
 
 	/*
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 85c8cc8..58a4424 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
 {
 	tid_t target;
 
-	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
+	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) {
 		if (wait)
 			log_wait_commit(EXT3_SB(sb)->s_journal, target);
 	}
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index e4b87bc..26ca2c3 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 
 		spin_unlock(&journal->j_list_lock);
 		jbd_unlock_bh_state(bh);
-		log_start_commit(journal, tid);
+		log_start_commit(journal, tid, true);
 		log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..ac188cb 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index da1b5e4..5743b4c 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal)
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
-int __log_start_commit(journal_t *journal, tid_t target)
+int __log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int log_start_commit(journal_t *journal, tid_t tid)
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	spin_lock(&journal->j_state_lock);
-	ret = __log_start_commit(journal, tid);
+	ret = __log_start_commit(journal, tid, will_wait);
 	spin_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal)
 	spin_lock(&journal->j_state_lock);
 	if (journal->j_running_transaction && !current->journal_info) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int journal_start_commit(journal_t *journal, tid_t *ptid)
+int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__log_start_commit(journal, tid);
+		__log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JFS_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 5b2e4c3..a12c0a3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -178,7 +178,7 @@ repeat_locked:
 		spin_unlock(&transaction->t_handle_lock);
 		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
 				TASK_UNINTERRUPTIBLE);
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 		spin_unlock(&journal->j_state_lock);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
-	__log_start_commit(journal, transaction->t_tid);
+	__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 
 	lock_map_release(&handle->h_lockdep_map);
@@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
@@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle)
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
 		/* This is non-blocking */
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid,
+			handle->h_sync && !(current->flags & PF_MEMALLOC));
 		spin_unlock(&journal->j_state_lock);
 
 		/*
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index e069650..c38f73e 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,12 +541,6 @@ struct transaction_s
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
-
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 };
 
 /**
@@ -594,6 +588,8 @@ struct transaction_s
  *  transaction
  * @j_commit_request: Sequence number of the most recent transaction wanting
  *     commit
+ * @j_commit_waited: Sequence number of the most recent transaction someone
+ *     is waiting for to commit.
  * @j_uuid: Uuid of client object.
  * @j_task: Pointer to the current commit thread for this journal
  * @j_max_transaction_buffers:  Maximum number of metadata buffers to allow in a
@@ -762,6 +758,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t                   j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -985,9 +988,9 @@ extern void	journal_switch_revoke_table(journal_t *journal);
  */
 
 int __log_space_left(journal_t *); /* Called with journal locked */
-int log_start_commit(journal_t *journal, tid_t tid);
-int __log_start_commit(journal_t *journal, tid_t tid);
-int journal_start_commit(journal_t *journal, tid_t *tid);
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int journal_force_commit_nested(journal_t *journal);
 int log_wait_commit(journal_t *journal, tid_t tid);
 int log_do_checkpoint(journal_t *journal);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-21 16:49               ` Jan Kara
@ 2011-02-23  8:24                 ` Alex,Shi
  2011-02-24 12:13                   ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-02-23  8:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso, jaxboe,
	linux-kernel, Chen, Tim C

Though these patches can not totally recovered the problem, but they are
quite helpful with ccache enabled situation. It increase 10% performance
on 38-rc1 kernel. 
I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: 
with patches:
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1.5 24.7 0.0 64316199.8 17240.8 153376.9 0.0 0.0 1777.1 1788.0 6479.0 2605.3 1.7 0.9 89.1 8.2 0.0
original 38-rc1 kernel:
2.4 32.3 0.0 63653302.9 17170.6 153125.3 0.0 0.0 1579.7 1834.1 6016.4 2407.0 1.5 0.7 86.6 10.1 0.0

It reduce write blocks clearly. 

> > can you send a patch with both ext3 and ext4 changes? Our test shows
> > your patch has positive effect, but need confirm with the ext3 change.
>   Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a
> while to get to this.
> 
> 									Honza



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-23  8:24                 ` Alex,Shi
@ 2011-02-24 12:13                   ` Jan Kara
  2011-02-25  0:44                     ` Alex Shi
  2011-02-26 14:45                     ` Corrado Zoccolo
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-02-24 12:13 UTC (permalink / raw)
  To: Alex,Shi
  Cc: Jan Kara, Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

[-- Attachment #1: Type: text/plain, Size: 673 bytes --]

On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> Though these patches can not totally recovered the problem, but they are
> quite helpful with ccache enabled situation. It increase 10% performance
> on 38-rc1 kernel. 
  OK and what was the original performance drop with WRITE_SYNC change?

> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: 
> with patches:
  I'm attaching patches rebased on top of latest Linus's tree. 
Corrado, could you possibly run your fsync-heavy tests so that we see
whether there isn't negative impact of my patches on your fsync-heavy
workload? Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 10946 bytes --]

>From 555d33d45bd38a8da6b187a15ad5eb9e9c4f1ab9 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 13:45:04 +0100
Subject: [PATCH 1/2] jbd2: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c       |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/checkpoint.c  |    2 +-
 fs/jbd2/commit.c      |    4 ++--
 fs/jbd2/journal.c     |   19 ++++++++++---------
 fs/jbd2/transaction.c |   15 +++++++++------
 fs/ocfs2/aops.c       |    2 +-
 fs/ocfs2/super.c      |    2 +-
 include/linux/jbd2.h  |   18 ++++++++++--------
 9 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 7829b28..19434da 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
+	if (jbd2_log_start_commit(journal, commit_tid, true)) {
 		/*
 		 * When the journal is on a different device than the
 		 * fs data disk, we need to issue the barrier in
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f6a318f..bfce0d6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4115,7 +4115,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 
 	trace_ext4_sync_fs(sb, wait);
 	flush_workqueue(sbi->dio_unwritten_wq);
-	if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
+	if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
 		if (wait)
 			jbd2_log_wait_commit(sbi->s_journal, target);
 	}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 6a79fd0..3436d53 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 			       "Waiting for Godot: block %llu\n",
 			       journal->j_devname,
 			       (unsigned long long) bh->b_blocknr);
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 		jbd2_log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..19973eb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 97e7346..04835d6 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -476,7 +476,7 @@ int __jbd2_log_space_left(journal_t *journal)
  * Called with j_state_lock locked for writing.
  * Returns true if a transaction commit was started.
  */
-int __jbd2_log_start_commit(journal_t *journal, tid_t target)
+int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -486,7 +486,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -497,12 +498,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int jbd2_log_start_commit(journal_t *journal, tid_t tid)
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	write_lock(&journal->j_state_lock);
-	ret = __jbd2_log_start_commit(journal, tid);
+	ret = __jbd2_log_start_commit(journal, tid, will_wait);
 	write_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -539,7 +540,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
 	tid = transaction->t_tid;
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 	jbd2_log_wait_commit(journal, tid);
 	return 1;
 }
@@ -549,7 +550,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -557,7 +558,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__jbd2_log_start_commit(journal, tid);
+		__jbd2_log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1564,7 +1565,7 @@ int jbd2_journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1680,7 +1681,7 @@ void __jbd2_journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JBD2_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	write_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 1d11910..340dd5e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -226,7 +226,7 @@ repeat:
 		need_to_start = !tid_geq(journal->j_commit_request, tid);
 		read_unlock(&journal->j_state_lock);
 		if (need_to_start)
-			jbd2_log_start_commit(journal, tid);
+			jbd2_log_start_commit(journal, tid, false);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
 		goto repeat;
@@ -469,8 +469,12 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
+<<<<<<< HEAD
 	tid = transaction->t_tid;
 	need_to_start = !tid_geq(journal->j_commit_request, tid);
+=======
+	__jbd2_log_start_commit(journal, transaction->t_tid, false);
+>>>>>>> jbd2: Refine commit writeout logic
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
@@ -1368,8 +1372,6 @@ int jbd2_journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
@@ -1390,15 +1392,16 @@ int jbd2_journal_stop(handle_t *handle)
 
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
-		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
-
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
 		 * to wait for the commit to complete.
 		 */
 		if (handle->h_sync && !(current->flags & PF_MEMALLOC))
 			wait_for_commit = 1;
+
+		/* This is non-blocking */
+		jbd2_log_start_commit(journal, transaction->t_tid,
+				      wait_for_commit);
 	}
 
 	/*
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..d493f32 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
 		goto out;
 	}
 
-	if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
+	if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
 		jbd2_log_wait_commit(osb->journal->j_journal, target);
 		ret = 1;
 	}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 38f986d..45d9f82 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
 	}
 
 	if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal,
-				      &target)) {
+				      &target, wait)) {
 		if (wait)
 			jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal,
 					     target);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 27e79c2..46aaf45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -631,11 +631,6 @@ struct transaction_s
 	 */
 	atomic_t		t_handle_count;
 
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 	unsigned int t_flushed_data_blocks:1;
 
 	/*
@@ -900,6 +895,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t			j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -1200,9 +1202,9 @@ extern void	jbd2_journal_switch_revoke_table(journal_t *journal);
  */
 
 int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
-int jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
-- 
1.7.1


[-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 9843 bytes --]

>From 3e4f42155513df7b6d36584e7b8383b86f425467 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 21 Feb 2011 17:25:37 +0100
Subject: [PATCH 2/2] jbd: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/fsync.c      |    2 +-
 fs/ext3/super.c      |    2 +-
 fs/jbd/checkpoint.c  |    2 +-
 fs/jbd/commit.c      |    4 ++--
 fs/jbd/journal.c     |   19 ++++++++++---------
 fs/jbd/transaction.c |    9 ++++-----
 include/linux/jbd.h  |   21 ++++++++++++---------
 7 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 09b13bb..5396dd6 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync)
 	if (test_opt(inode->i_sb, BARRIER) &&
 	    !journal_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = 1;
-	log_start_commit(journal, commit_tid);
+	log_start_commit(journal, commit_tid, true);
 	ret = log_wait_commit(journal, commit_tid);
 
 	/*
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 85c8cc8..58a4424 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
 {
 	tid_t target;
 
-	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
+	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) {
 		if (wait)
 			log_wait_commit(EXT3_SB(sb)->s_journal, target);
 	}
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index e4b87bc..26ca2c3 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 
 		spin_unlock(&journal->j_list_lock);
 		jbd_unlock_bh_state(bh);
-		log_start_commit(journal, tid);
+		log_start_commit(journal, tid, true);
 		log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..ac188cb 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index da1b5e4..5743b4c 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal)
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
-int __log_start_commit(journal_t *journal, tid_t target)
+int __log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int log_start_commit(journal_t *journal, tid_t tid)
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	spin_lock(&journal->j_state_lock);
-	ret = __log_start_commit(journal, tid);
+	ret = __log_start_commit(journal, tid, will_wait);
 	spin_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal)
 	spin_lock(&journal->j_state_lock);
 	if (journal->j_running_transaction && !current->journal_info) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int journal_start_commit(journal_t *journal, tid_t *ptid)
+int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__log_start_commit(journal, tid);
+		__log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JFS_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 5b2e4c3..a12c0a3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -178,7 +178,7 @@ repeat_locked:
 		spin_unlock(&transaction->t_handle_lock);
 		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
 				TASK_UNINTERRUPTIBLE);
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 		spin_unlock(&journal->j_state_lock);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
-	__log_start_commit(journal, transaction->t_tid);
+	__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 
 	lock_map_release(&handle->h_lockdep_map);
@@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
@@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle)
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
 		/* This is non-blocking */
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid,
+			handle->h_sync && !(current->flags & PF_MEMALLOC));
 		spin_unlock(&journal->j_state_lock);
 
 		/*
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index e069650..c38f73e 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,12 +541,6 @@ struct transaction_s
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
-
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 };
 
 /**
@@ -594,6 +588,8 @@ struct transaction_s
  *  transaction
  * @j_commit_request: Sequence number of the most recent transaction wanting
  *     commit
+ * @j_commit_waited: Sequence number of the most recent transaction someone
+ *     is waiting for to commit.
  * @j_uuid: Uuid of client object.
  * @j_task: Pointer to the current commit thread for this journal
  * @j_max_transaction_buffers:  Maximum number of metadata buffers to allow in a
@@ -762,6 +758,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t                   j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -985,9 +988,9 @@ extern void	journal_switch_revoke_table(journal_t *journal);
  */
 
 int __log_space_left(journal_t *); /* Called with journal locked */
-int log_start_commit(journal_t *journal, tid_t tid);
-int __log_start_commit(journal_t *journal, tid_t tid);
-int journal_start_commit(journal_t *journal, tid_t *tid);
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int journal_force_commit_nested(journal_t *journal);
 int log_wait_commit(journal_t *journal, tid_t tid);
 int log_do_checkpoint(journal_t *journal);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-24 12:13                   ` Jan Kara
@ 2011-02-25  0:44                     ` Alex Shi
  2011-02-26 14:45                     ` Corrado Zoccolo
  1 sibling, 0 replies; 38+ messages in thread
From: Alex Shi @ 2011-02-25  0:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso, jaxboe,
	linux-kernel, Chen, Tim C

Jan Kara wrote:
> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>   
>> Though these patches can not totally recovered the problem, but they are
>> quite helpful with ccache enabled situation. It increase 10% performance
>> on 38-rc1 kernel. 
>>     
>   OK and what was the original performance drop with WRITE_SYNC change?
>   
The original drop is 30%.
>   
>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: 
>> with patches:
>>     
>   I'm attaching patches rebased on top of latest Linus's tree. 
> Corrado, could you possibly run your fsync-heavy tests so that we see
> whether there isn't negative impact of my patches on your fsync-heavy
> workload? Thanks.
>
> 								Honza
>   


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-24 12:13                   ` Jan Kara
  2011-02-25  0:44                     ` Alex Shi
@ 2011-02-26 14:45                     ` Corrado Zoccolo
  2011-03-01 19:56                       ` Jeff Moyer
  1 sibling, 1 reply; 38+ messages in thread
From: Corrado Zoccolo @ 2011-02-26 14:45 UTC (permalink / raw)
  To: Jan Kara, Jeff Moyer
  Cc: Alex,Shi, Li, Shaohua, Vivek Goyal, tytso, jaxboe, linux-kernel,
	Chen, Tim C

On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> Though these patches can not totally recovered the problem, but they are
>> quite helpful with ccache enabled situation. It increase 10% performance
>> on 38-rc1 kernel.
>  OK and what was the original performance drop with WRITE_SYNC change?
>
>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> with patches:
>  I'm attaching patches rebased on top of latest Linus's tree.
> Corrado, could you possibly run your fsync-heavy tests so that we see
> whether there isn't negative impact of my patches on your fsync-heavy
> workload? Thanks.
The workload was actually Jeff's, and the stalls that my change tried
to mitigate showed up on his enterprise class storage. Adding him so
he can test it.

Corrado
>                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-02-26 14:45                     ` Corrado Zoccolo
@ 2011-03-01 19:56                       ` Jeff Moyer
  2011-03-02  9:42                         ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2011-03-01 19:56 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jan Kara, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso, jaxboe,
	linux-kernel, Chen, Tim C

Corrado Zoccolo <czoccolo@gmail.com> writes:

> On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
>> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>>> Though these patches can not totally recovered the problem, but they are
>>> quite helpful with ccache enabled situation. It increase 10% performance
>>> on 38-rc1 kernel.
>>  OK and what was the original performance drop with WRITE_SYNC change?
>>
>>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>>> with patches:
>>  I'm attaching patches rebased on top of latest Linus's tree.
>> Corrado, could you possibly run your fsync-heavy tests so that we see
>> whether there isn't negative impact of my patches on your fsync-heavy
>> workload? Thanks.
> The workload was actually Jeff's, and the stalls that my change tried
> to mitigate showed up on his enterprise class storage. Adding him so
> he can test it.

Sorry for the late reply.  You can use either fs_mark or iozone to
generate an fsync-heavy workload.  The test I did was to mix this with a
sequential reader.  If you can point me at patches, I should be able to
test this.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-01 19:56                       ` Jeff Moyer
@ 2011-03-02  9:42                         ` Jan Kara
  2011-03-02 16:13                           ` Jeff Moyer
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-03-02  9:42 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Corrado Zoccolo, Jan Kara, Alex,Shi, Li, Shaohua, Vivek Goyal,
	tytso, jaxboe, linux-kernel, Chen, Tim C

On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
> Corrado Zoccolo <czoccolo@gmail.com> writes:
> 
> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> >>> Though these patches can not totally recovered the problem, but they are
> >>> quite helpful with ccache enabled situation. It increase 10% performance
> >>> on 38-rc1 kernel.
> >>  OK and what was the original performance drop with WRITE_SYNC change?
> >>
> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
> >>> with patches:
> >>  I'm attaching patches rebased on top of latest Linus's tree.
> >> Corrado, could you possibly run your fsync-heavy tests so that we see
> >> whether there isn't negative impact of my patches on your fsync-heavy
> >> workload? Thanks.
> > The workload was actually Jeff's, and the stalls that my change tried
> > to mitigate showed up on his enterprise class storage. Adding him so
> > he can test it.
> 
> Sorry for the late reply.  You can use either fs_mark or iozone to
> generate an fsync-heavy workload.  The test I did was to mix this with a
> sequential reader.  If you can point me at patches, I should be able to
> test this.
  The latest version of patches is attached to:
https://lkml.org/lkml/2011/2/24/125

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-02  9:42                         ` Jan Kara
@ 2011-03-02 16:13                           ` Jeff Moyer
  2011-03-02 21:17                             ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2011-03-02 16:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jan Kara <jack@suse.cz> writes:

> On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
>> Corrado Zoccolo <czoccolo@gmail.com> writes:
>> 
>> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
>> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> >>> Though these patches can not totally recovered the problem, but they are
>> >>> quite helpful with ccache enabled situation. It increase 10% performance
>> >>> on 38-rc1 kernel.
>> >>  OK and what was the original performance drop with WRITE_SYNC change?
>> >>
>> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> >>> with patches:
>> >>  I'm attaching patches rebased on top of latest Linus's tree.
>> >> Corrado, could you possibly run your fsync-heavy tests so that we see
>> >> whether there isn't negative impact of my patches on your fsync-heavy
>> >> workload? Thanks.
>> > The workload was actually Jeff's, and the stalls that my change tried
>> > to mitigate showed up on his enterprise class storage. Adding him so
>> > he can test it.
>> 
>> Sorry for the late reply.  You can use either fs_mark or iozone to
>> generate an fsync-heavy workload.  The test I did was to mix this with a
>> sequential reader.  If you can point me at patches, I should be able to
>> test this.
>   The latest version of patches is attached to:
> https://lkml.org/lkml/2011/2/24/125

Perhaps you should fix up the merge conflicts, first?  ;-)

+<<<<<<< HEAD
        tid = transaction->t_tid;
        need_to_start = !tid_geq(journal->j_commit_request, tid);
+=======
+       __jbd2_log_start_commit(journal, transaction->t_tid, false);
+>>>>>>> jbd2: Refine commit writeout logic

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-02 16:13                           ` Jeff Moyer
@ 2011-03-02 21:17                             ` Jan Kara
  2011-03-02 21:20                               ` Jeff Moyer
  2011-03-03  1:14                               ` Jeff Moyer
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-03-02 21:17 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal,
	tytso, jaxboe, linux-kernel, Chen, Tim C

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Wed 02-03-11 11:13:53, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> > On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
> >> Corrado Zoccolo <czoccolo@gmail.com> writes:
> >> 
> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
> >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> >> >>> Though these patches can not totally recovered the problem, but they are
> >> >>> quite helpful with ccache enabled situation. It increase 10% performance
> >> >>> on 38-rc1 kernel.
> >> >>  OK and what was the original performance drop with WRITE_SYNC change?
> >> >>
> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
> >> >>> with patches:
> >> >>  I'm attaching patches rebased on top of latest Linus's tree.
> >> >> Corrado, could you possibly run your fsync-heavy tests so that we see
> >> >> whether there isn't negative impact of my patches on your fsync-heavy
> >> >> workload? Thanks.
> >> > The workload was actually Jeff's, and the stalls that my change tried
> >> > to mitigate showed up on his enterprise class storage. Adding him so
> >> > he can test it.
> >> 
> >> Sorry for the late reply.  You can use either fs_mark or iozone to
> >> generate an fsync-heavy workload.  The test I did was to mix this with a
> >> sequential reader.  If you can point me at patches, I should be able to
> >> test this.
> >   The latest version of patches is attached to:
> > https://lkml.org/lkml/2011/2/24/125
> 
> Perhaps you should fix up the merge conflicts, first?  ;-)
> 
> +<<<<<<< HEAD
>         tid = transaction->t_tid;
>         need_to_start = !tid_geq(journal->j_commit_request, tid);
> +=======
> +       __jbd2_log_start_commit(journal, transaction->t_tid, false);
> +>>>>>>> jbd2: Refine commit writeout logic
  Doh, how embarrassing ;). Attached is a new version which compiles and
seems to run OK.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 10820 bytes --]

>From 2d4018f7fe647a440f488d71cf4ceffccaecb02e Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 13:45:04 +0100
Subject: [PATCH 1/2] jbd2: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/fsync.c       |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/checkpoint.c  |    2 +-
 fs/jbd2/commit.c      |    4 ++--
 fs/jbd2/journal.c     |   19 ++++++++++---------
 fs/jbd2/transaction.c |   13 ++++++-------
 fs/ocfs2/aops.c       |    2 +-
 fs/ocfs2/super.c      |    2 +-
 include/linux/jbd2.h  |   18 ++++++++++--------
 9 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 7829b28..19434da 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
+	if (jbd2_log_start_commit(journal, commit_tid, true)) {
 		/*
 		 * When the journal is on a different device than the
 		 * fs data disk, we need to issue the barrier in
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f6a318f..bfce0d6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4115,7 +4115,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 
 	trace_ext4_sync_fs(sb, wait);
 	flush_workqueue(sbi->dio_unwritten_wq);
-	if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
+	if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
 		if (wait)
 			jbd2_log_wait_commit(sbi->s_journal, target);
 	}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 6a79fd0..3436d53 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 			       "Waiting for Godot: block %llu\n",
 			       journal->j_devname,
 			       (unsigned long long) bh->b_blocknr);
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 		jbd2_log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..19973eb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 97e7346..04835d6 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -476,7 +476,7 @@ int __jbd2_log_space_left(journal_t *journal)
  * Called with j_state_lock locked for writing.
  * Returns true if a transaction commit was started.
  */
-int __jbd2_log_start_commit(journal_t *journal, tid_t target)
+int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -486,7 +486,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -497,12 +498,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int jbd2_log_start_commit(journal_t *journal, tid_t tid)
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	write_lock(&journal->j_state_lock);
-	ret = __jbd2_log_start_commit(journal, tid);
+	ret = __jbd2_log_start_commit(journal, tid, will_wait);
 	write_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -539,7 +540,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
 	tid = transaction->t_tid;
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, true);
 	jbd2_log_wait_commit(journal, tid);
 	return 1;
 }
@@ -549,7 +550,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -557,7 +558,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__jbd2_log_start_commit(journal, tid);
+		__jbd2_log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1564,7 +1565,7 @@ int jbd2_journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1680,7 +1681,7 @@ void __jbd2_journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JBD2_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__jbd2_log_start_commit(journal, transaction->t_tid);
+		__jbd2_log_start_commit(journal, transaction->t_tid, false);
 	write_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 1d11910..2211dae 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -226,7 +226,7 @@ repeat:
 		need_to_start = !tid_geq(journal->j_commit_request, tid);
 		read_unlock(&journal->j_state_lock);
 		if (need_to_start)
-			jbd2_log_start_commit(journal, tid);
+			jbd2_log_start_commit(journal, tid, false);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
 		goto repeat;
@@ -473,7 +473,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
 	need_to_start = !tid_geq(journal->j_commit_request, tid);
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
-		jbd2_log_start_commit(journal, tid);
+		jbd2_log_start_commit(journal, tid, false);
 
 	lock_map_release(&handle->h_lockdep_map);
 	handle->h_buffer_credits = nblocks;
@@ -1368,8 +1368,6 @@ int jbd2_journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
@@ -1390,15 +1388,16 @@ int jbd2_journal_stop(handle_t *handle)
 
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
-		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
-
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
 		 * to wait for the commit to complete.
 		 */
 		if (handle->h_sync && !(current->flags & PF_MEMALLOC))
 			wait_for_commit = 1;
+
+		/* This is non-blocking */
+		jbd2_log_start_commit(journal, transaction->t_tid,
+				      wait_for_commit);
 	}
 
 	/*
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..d493f32 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
 		goto out;
 	}
 
-	if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
+	if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
 		jbd2_log_wait_commit(osb->journal->j_journal, target);
 		ret = 1;
 	}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 38f986d..45d9f82 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
 	}
 
 	if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal,
-				      &target)) {
+				      &target, wait)) {
 		if (wait)
 			jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal,
 					     target);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 27e79c2..46aaf45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -631,11 +631,6 @@ struct transaction_s
 	 */
 	atomic_t		t_handle_count;
 
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 	unsigned int t_flushed_data_blocks:1;
 
 	/*
@@ -900,6 +895,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t			j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -1200,9 +1202,9 @@ extern void	jbd2_journal_switch_revoke_table(journal_t *journal);
  */
 
 int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
-int jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
-- 
1.7.1


[-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --]
[-- Type: text/x-patch, Size: 9843 bytes --]

>From 4334f23305e74c6b050c701d97803790dc04cbe0 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 21 Feb 2011 17:25:37 +0100
Subject: [PATCH 2/2] jbd: Refine commit writeout logic

Currently we write out all journal buffers in WRITE_SYNC mode. This improves
performance for fsync heavy workloads but hinders performance when writes
are mostly asynchronous. So add possibility for callers starting a transaction
commit to specify whether they are going to wait for the commit and submit
journal writes in WRITE_SYNC mode only in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/fsync.c      |    2 +-
 fs/ext3/super.c      |    2 +-
 fs/jbd/checkpoint.c  |    2 +-
 fs/jbd/commit.c      |    4 ++--
 fs/jbd/journal.c     |   19 ++++++++++---------
 fs/jbd/transaction.c |    9 ++++-----
 include/linux/jbd.h  |   21 ++++++++++++---------
 7 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 09b13bb..5396dd6 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync)
 	if (test_opt(inode->i_sb, BARRIER) &&
 	    !journal_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = 1;
-	log_start_commit(journal, commit_tid);
+	log_start_commit(journal, commit_tid, true);
 	ret = log_wait_commit(journal, commit_tid);
 
 	/*
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 85c8cc8..58a4424 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
 {
 	tid_t target;
 
-	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
+	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) {
 		if (wait)
 			log_wait_commit(EXT3_SB(sb)->s_journal, target);
 	}
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index e4b87bc..26ca2c3 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 
 		spin_unlock(&journal->j_list_lock);
 		jbd_unlock_bh_state(bh);
-		log_start_commit(journal, tid);
+		log_start_commit(journal, tid, true);
 		log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..ac188cb 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal)
 	 * we unplug the device. We don't do explicit unplugging in here,
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
-	if (commit_transaction->t_synchronous_commit)
+	if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid))
 		write_op = WRITE_SYNC_PLUG;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index da1b5e4..5743b4c 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal)
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
-int __log_start_commit(journal_t *journal, tid_t target)
+int __log_start_commit(journal_t *journal, tid_t target, bool will_wait)
 {
 	/*
 	 * Are we already doing a recent enough commit?
@@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target)
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
 		 */
-
+		if (will_wait && !tid_geq(journal->j_commit_waited, target))
+			journal->j_commit_waited = target;
 		journal->j_commit_request = target;
 		jbd_debug(1, "JBD: requesting commit %d/%d\n",
 			  journal->j_commit_request,
@@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target)
 	return 0;
 }
 
-int log_start_commit(journal_t *journal, tid_t tid)
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait)
 {
 	int ret;
 
 	spin_lock(&journal->j_state_lock);
-	ret = __log_start_commit(journal, tid);
+	ret = __log_start_commit(journal, tid, will_wait);
 	spin_unlock(&journal->j_state_lock);
 	return ret;
 }
@@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal)
 	spin_lock(&journal->j_state_lock);
 	if (journal->j_running_transaction && !current->journal_info) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int journal_start_commit(journal_t *journal, tid_t *ptid)
+int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait)
 {
 	int ret = 0;
 
@@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		__log_start_commit(journal, tid);
+		__log_start_commit(journal, tid, will_wait);
 		/* There's a running transaction and we've just made sure
 		 * it's commit has been scheduled. */
 		if (ptid)
@@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal)
 	/* Force everything buffered to the log... */
 	if (journal->j_running_transaction) {
 		transaction = journal->j_running_transaction;
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, true);
 	} else if (journal->j_committing_transaction)
 		transaction = journal->j_committing_transaction;
 
@@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal)
 	journal->j_flags |= JFS_ABORT;
 	transaction = journal->j_running_transaction;
 	if (transaction)
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 5b2e4c3..a12c0a3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -178,7 +178,7 @@ repeat_locked:
 		spin_unlock(&transaction->t_handle_lock);
 		prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
 				TASK_UNINTERRUPTIBLE);
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid, false);
 		spin_unlock(&journal->j_state_lock);
 		schedule();
 		finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks)
 	spin_unlock(&transaction->t_handle_lock);
 
 	jbd_debug(2, "restarting handle %p\n", handle);
-	__log_start_commit(journal, transaction->t_tid);
+	__log_start_commit(journal, transaction->t_tid, false);
 	spin_unlock(&journal->j_state_lock);
 
 	lock_map_release(&handle->h_lockdep_map);
@@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle)
 		}
 	}
 
-	if (handle->h_sync)
-		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
@@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle)
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
 		/* This is non-blocking */
-		__log_start_commit(journal, transaction->t_tid);
+		__log_start_commit(journal, transaction->t_tid,
+			handle->h_sync && !(current->flags & PF_MEMALLOC));
 		spin_unlock(&journal->j_state_lock);
 
 		/*
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index e069650..c38f73e 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,12 +541,6 @@ struct transaction_s
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
-
-	/*
-	 * This transaction is being forced and some process is
-	 * waiting for it to finish.
-	 */
-	unsigned int t_synchronous_commit:1;
 };
 
 /**
@@ -594,6 +588,8 @@ struct transaction_s
  *  transaction
  * @j_commit_request: Sequence number of the most recent transaction wanting
  *     commit
+ * @j_commit_waited: Sequence number of the most recent transaction someone
+ *     is waiting for to commit.
  * @j_uuid: Uuid of client object.
  * @j_task: Pointer to the current commit thread for this journal
  * @j_max_transaction_buffers:  Maximum number of metadata buffers to allow in a
@@ -762,6 +758,13 @@ struct journal_s
 	tid_t			j_commit_request;
 
 	/*
+	 * Sequence number of the most recent transaction someone is waiting
+	 * for to commit.
+	 * [j_state_lock]
+	 */
+	tid_t                   j_commit_waited;
+
+	/*
 	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
 	 * backed by this journal.  This will eventually be replaced by an array
 	 * of uuids, allowing us to index multiple devices within a single
@@ -985,9 +988,9 @@ extern void	journal_switch_revoke_table(journal_t *journal);
  */
 
 int __log_space_left(journal_t *); /* Called with journal locked */
-int log_start_commit(journal_t *journal, tid_t tid);
-int __log_start_commit(journal_t *journal, tid_t tid);
-int journal_start_commit(journal_t *journal, tid_t *tid);
+int log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait);
+int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait);
 int journal_force_commit_nested(journal_t *journal);
 int log_wait_commit(journal_t *journal, tid_t tid);
 int log_do_checkpoint(journal_t *journal);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-02 21:17                             ` Jan Kara
@ 2011-03-02 21:20                               ` Jeff Moyer
  2011-03-03  1:14                               ` Jeff Moyer
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Moyer @ 2011-03-02 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jan Kara <jack@suse.cz> writes:

> On Wed 02-03-11 11:13:53, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>> > On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
>> >> Corrado Zoccolo <czoccolo@gmail.com> writes:
>> >> 
>> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote:
>> >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> >> >>> Though these patches can not totally recovered the problem, but they are
>> >> >>> quite helpful with ccache enabled situation. It increase 10% performance
>> >> >>> on 38-rc1 kernel.
>> >> >>  OK and what was the original performance drop with WRITE_SYNC change?
>> >> >>
>> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> >> >>> with patches:
>> >> >>  I'm attaching patches rebased on top of latest Linus's tree.
>> >> >> Corrado, could you possibly run your fsync-heavy tests so that we see
>> >> >> whether there isn't negative impact of my patches on your fsync-heavy
>> >> >> workload? Thanks.
>> >> > The workload was actually Jeff's, and the stalls that my change tried
>> >> > to mitigate showed up on his enterprise class storage. Adding him so
>> >> > he can test it.
>> >> 
>> >> Sorry for the late reply.  You can use either fs_mark or iozone to
>> >> generate an fsync-heavy workload.  The test I did was to mix this with a
>> >> sequential reader.  If you can point me at patches, I should be able to
>> >> test this.
>> >   The latest version of patches is attached to:
>> > https://lkml.org/lkml/2011/2/24/125
>> 
>> Perhaps you should fix up the merge conflicts, first?  ;-)
>> 
>> +<<<<<<< HEAD
>>         tid = transaction->t_tid;
>>         need_to_start = !tid_geq(journal->j_commit_request, tid);
>> +=======
>> +       __jbd2_log_start_commit(journal, transaction->t_tid, false);
>> +>>>>>>> jbd2: Refine commit writeout logic
>   Doh, how embarrassing ;). Attached is a new version which compiles and
> seems to run OK.
>
> 								Honza

Thanks, Jan.  I should have results for you tomorrow.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-02 21:17                             ` Jan Kara
  2011-03-02 21:20                               ` Jeff Moyer
@ 2011-03-03  1:14                               ` Jeff Moyer
  2011-03-04 15:32                                 ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2011-03-03  1:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Hi, Jan,

So, the results are in.  The test workload is an fs_mark process writing
out 64k files and fsyncing each file after it's written.  Concurrently
with this is a fio job running a buffered sequential reader (bsr).  Each
data point is the average of 10 runs, after throwing out the first run.
File system mount options are left at their defaults, which means that
barriers are on.  The storage is an HP EVA, connected to the host via a
single 4Gb FC path.

ext3 looks marginally better with your patches.  We get better files/sec
AND better throughput from the buffered reader.  For ext4, the results
are less encouraging.  We see a drop in files/sec, and an increase in
throughput for the sequential reader.  So, the fsync-ing is being
starved a bit more than before.

        ||       ext3        ||       ext4        ||
        || fs_mark | fio bsr || fs_mark | fio bsr ||
--------++---------+---------++---------+---------||
vanilla || 517.535 | 178187  || 408.547 | 277130  ||
patched || 540.34  | 182312  || 342.813 | 294655  ||
====================================================
%diff   ||  +4.4% |    +2.3% ||  -16.1% |  +6.3%  ||

I'm tired right now, but I'll have a look at your ext4 patch in the
morning and see if I can come up with a good reason for this drop.

Let me know if you have any questions.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-03  1:14                               ` Jeff Moyer
@ 2011-03-04 15:32                                 ` Jan Kara
  2011-03-04 15:40                                   ` Jeff Moyer
  2011-03-04 15:50                                   ` Jeff Moyer
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-03-04 15:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal,
	tytso, jaxboe, linux-kernel, Chen, Tim C

  Hi Jeff,
On Wed 02-03-11 20:14:13, Jeff Moyer wrote:
> So, the results are in.  The test workload is an fs_mark process writing
> out 64k files and fsyncing each file after it's written.  Concurrently
> with this is a fio job running a buffered sequential reader (bsr).  Each
> data point is the average of 10 runs, after throwing out the first run.
> File system mount options are left at their defaults, which means that
> barriers are on.  The storage is an HP EVA, connected to the host via a
> single 4Gb FC path.
  Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you
use more threads?

> ext3 looks marginally better with your patches.  We get better files/sec
> AND better throughput from the buffered reader.  For ext4, the results
> are less encouraging.  We see a drop in files/sec, and an increase in
> throughput for the sequential reader.  So, the fsync-ing is being
> starved a bit more than before.
> 
>         ||       ext3        ||       ext4        ||
>         || fs_mark | fio bsr || fs_mark | fio bsr ||
> --------++---------+---------++---------+---------||
> vanilla || 517.535 | 178187  || 408.547 | 277130  ||
> patched || 540.34  | 182312  || 342.813 | 294655  ||
> ====================================================
> %diff   ||  +4.4% |    +2.3% ||  -16.1% |  +6.3%  ||
Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more
than happy with ext3 results since I just wanted to verify that fsync load
doesn't degrade too much with the improved logic preferring non-fsync load
more than we used to.

I'm not so happy with ext4 results. The difference between ext3 and ext4
might be that amount of data written by kjournald in ext3 is considerably
larger if it ends up pushing out data (because of data=ordered mode) as
well. With ext4, all data are written by filemap_fdatawrite() from fsync
because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
is hurting us with your fast storage and small amount of written data? With
WRITE_SYNC, data would be already on it's way to storage before we get to
wait for them...

Or it could be that we really send more data in WRITE mode rather than in
WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
blktrace). But I wonder how that could happen...

							Bye
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-04 15:32                                 ` Jan Kara
@ 2011-03-04 15:40                                   ` Jeff Moyer
  2011-03-04 15:50                                   ` Jeff Moyer
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Moyer @ 2011-03-04 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jan Kara <jack@suse.cz> writes:

>   Hi Jeff,
> On Wed 02-03-11 20:14:13, Jeff Moyer wrote:
>> So, the results are in.  The test workload is an fs_mark process writing
>> out 64k files and fsyncing each file after it's written.  Concurrently
>> with this is a fio job running a buffered sequential reader (bsr).  Each
>> data point is the average of 10 runs, after throwing out the first run.
>> File system mount options are left at their defaults, which means that
>> barriers are on.  The storage is an HP EVA, connected to the host via a
>> single 4Gb FC path.
>   Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you
> use more threads?

I use a single fs_mark thread.  FWIW, I also tested just fs_mark, and
those numbers look good.

>> ext3 looks marginally better with your patches.  We get better files/sec
>> AND better throughput from the buffered reader.  For ext4, the results
>> are less encouraging.  We see a drop in files/sec, and an increase in
>> throughput for the sequential reader.  So, the fsync-ing is being
>> starved a bit more than before.
>> 
>>         ||       ext3        ||       ext4        ||
>>         || fs_mark | fio bsr || fs_mark | fio bsr ||
>> --------++---------+---------++---------+---------||
>> vanilla || 517.535 | 178187  || 408.547 | 277130  ||
>> patched || 540.34  | 182312  || 342.813 | 294655  ||
>> ====================================================
>> %diff   ||  +4.4% |    +2.3% ||  -16.1% |  +6.3%  ||
> Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more
> than happy with ext3 results since I just wanted to verify that fsync load
> doesn't degrade too much with the improved logic preferring non-fsync load
> more than we used to.
>
> I'm not so happy with ext4 results. The difference between ext3 and ext4
> might be that amount of data written by kjournald in ext3 is considerably
> larger if it ends up pushing out data (because of data=ordered mode) as
> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> is hurting us with your fast storage and small amount of written data? With
> WRITE_SYNC, data would be already on it's way to storage before we get to
> wait for them...
>
> Or it could be that we really send more data in WRITE mode rather than in
> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> blktrace). But I wonder how that could happen...

Yeah, I've collected blktrace data and I'll get to evaluating that.
Sorry, I ran out of time yesterday.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-04 15:32                                 ` Jan Kara
  2011-03-04 15:40                                   ` Jeff Moyer
@ 2011-03-04 15:50                                   ` Jeff Moyer
  2011-03-04 18:27                                     ` Jeff Moyer
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2011-03-04 15:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jan Kara <jack@suse.cz> writes:

> I'm not so happy with ext4 results. The difference between ext3 and ext4
> might be that amount of data written by kjournald in ext3 is considerably
> larger if it ends up pushing out data (because of data=ordered mode) as
> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> is hurting us with your fast storage and small amount of written data? With
> WRITE_SYNC, data would be already on it's way to storage before we get to
> wait for them...

> Or it could be that we really send more data in WRITE mode rather than in
> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> blktrace). But I wonder how that could happen...

It looks like this is the case, the I/O isn't coming down as
synchronous.  I'm seeing a lot of writes, very few write sync's, which
means that the write stream will be preempted by the incoming reads.

Time to audit that fsync path and make sure it's marked properly, I
guess.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-04 15:50                                   ` Jeff Moyer
@ 2011-03-04 18:27                                     ` Jeff Moyer
  2011-03-22  7:38                                       ` Alex,Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2011-03-04 18:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jeff Moyer <jmoyer@redhat.com> writes:

> Jan Kara <jack@suse.cz> writes:
>
>> I'm not so happy with ext4 results. The difference between ext3 and ext4
>> might be that amount of data written by kjournald in ext3 is considerably
>> larger if it ends up pushing out data (because of data=ordered mode) as
>> well. With ext4, all data are written by filemap_fdatawrite() from fsync
>> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
>> is hurting us with your fast storage and small amount of written data? With
>> WRITE_SYNC, data would be already on it's way to storage before we get to
>> wait for them...
>
>> Or it could be that we really send more data in WRITE mode rather than in
>> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
>> blktrace). But I wonder how that could happen...
>
> It looks like this is the case, the I/O isn't coming down as
> synchronous.  I'm seeing a lot of writes, very few write sync's, which
> means that the write stream will be preempted by the incoming reads.
>
> Time to audit that fsync path and make sure it's marked properly, I
> guess.

OK, I spoke too soon.  Here's the blktrace summary information (I re-ran
the tests using 3 samples, the blktrace is from the last run of the
three in each case):

Vanilla
-------
fs_mark: 307.288 files/sec
fio: 286509 KB/s

Total (sde):
 Reads Queued:     341,558,   84,994MiB  Writes Queued:       1,561K,    6,244MiB
 Read Dispatches:  341,493,   84,994MiB  Write Dispatches:  648,046,    6,244MiB
 Reads Requeued:         0               Writes Requeued:        27
 Reads Completed:  341,491,   84,994MiB  Writes Completed:  648,021,    6,244MiB
 Read Merges:           65,    2,780KiB  Write Merges:      913,076,    3,652MiB
 IO unplugs:       578,102               Timer unplugs:           0

Throughput (R/W): 282,797KiB/s / 20,776KiB/s
Events (sde): 16,724,303 entries

Patched
-------
fs_mark: 278.587 files/sec
fio: 298007 KB/s

Total (sde):
 Reads Queued:     345,407,   86,834MiB  Writes Queued:       1,566K,    6,264MiB
 Read Dispatches:  345,391,   86,834MiB  Write Dispatches:  327,404,    6,264MiB
 Reads Requeued:         0               Writes Requeued:        33
 Reads Completed:  345,391,   86,834MiB  Writes Completed:  327,371,    6,264MiB
 Read Merges:           16,    1,576KiB  Write Merges:        1,238K,    4,954MiB
 IO unplugs:       580,308               Timer unplugs:           0

Throughput (R/W): 288,771KiB/s / 20,832KiB/s
Events (sde): 14,030,610 entries

So, it appears we flush out writes much more aggressively without the
patch in place.  I'm not sure why the write bandwidth looks to be higher
in the patched case... odd.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-04 18:27                                     ` Jeff Moyer
@ 2011-03-22  7:38                                       ` Alex,Shi
  2011-03-22 16:14                                         ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Alex,Shi @ 2011-03-22  7:38 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C


On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote:
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> > Jan Kara <jack@suse.cz> writes:
> >
> >> I'm not so happy with ext4 results. The difference between ext3 and ext4
> >> might be that amount of data written by kjournald in ext3 is considerably
> >> larger if it ends up pushing out data (because of data=ordered mode) as
> >> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> >> is hurting us with your fast storage and small amount of written data? With
> >> WRITE_SYNC, data would be already on it's way to storage before we get to
> >> wait for them...
> >
> >> Or it could be that we really send more data in WRITE mode rather than in
> >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> >> blktrace). But I wonder how that could happen...
> >
> > It looks like this is the case, the I/O isn't coming down as
> > synchronous.  I'm seeing a lot of writes, very few write sync's, which
> > means that the write stream will be preempted by the incoming reads.
> >
> > Time to audit that fsync path and make sure it's marked properly, I
> > guess.
> 
> OK, I spoke too soon.  Here's the blktrace summary information (I re-ran
> the tests using 3 samples, the blktrace is from the last run of the
> three in each case):
> 
> Vanilla
> -------
> fs_mark: 307.288 files/sec
> fio: 286509 KB/s
> 
> Total (sde):
>  Reads Queued:     341,558,   84,994MiB  Writes Queued:       1,561K,    6,244MiB
>  Read Dispatches:  341,493,   84,994MiB  Write Dispatches:  648,046,    6,244MiB
>  Reads Requeued:         0               Writes Requeued:        27
>  Reads Completed:  341,491,   84,994MiB  Writes Completed:  648,021,    6,244MiB
>  Read Merges:           65,    2,780KiB  Write Merges:      913,076,    3,652MiB
>  IO unplugs:       578,102               Timer unplugs:           0
> 
> Throughput (R/W): 282,797KiB/s / 20,776KiB/s
> Events (sde): 16,724,303 entries
> 
> Patched
> -------
> fs_mark: 278.587 files/sec
> fio: 298007 KB/s
> 
> Total (sde):
>  Reads Queued:     345,407,   86,834MiB  Writes Queued:       1,566K,    6,264MiB
>  Read Dispatches:  345,391,   86,834MiB  Write Dispatches:  327,404,    6,264MiB
>  Reads Requeued:         0               Writes Requeued:        33
>  Reads Completed:  345,391,   86,834MiB  Writes Completed:  327,371,    6,264MiB
>  Read Merges:           16,    1,576KiB  Write Merges:        1,238K,    4,954MiB
>  IO unplugs:       580,308               Timer unplugs:           0
> 
> Throughput (R/W): 288,771KiB/s / 20,832KiB/s
> Events (sde): 14,030,610 entries
> 
> So, it appears we flush out writes much more aggressively without the
> patch in place.  I'm not sure why the write bandwidth looks to be higher
> in the patched case... odd.
> 

Jan:
Do you have new idea on this? 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-22  7:38                                       ` Alex,Shi
@ 2011-03-22 16:14                                         ` Jan Kara
  2011-03-22 17:46                                           ` Jeff Moyer
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-03-22 16:14 UTC (permalink / raw)
  To: Alex,Shi
  Cc: Jeff Moyer, Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal,
	tytso, jaxboe, linux-kernel, Chen, Tim C

[-- Attachment #1: Type: text/plain, Size: 4560 bytes --]

On Tue 22-03-11 15:38:19, Alex,Shi wrote:
> On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote:
> > Jeff Moyer <jmoyer@redhat.com> writes:
> > 
> > > Jan Kara <jack@suse.cz> writes:
> > >
> > >> I'm not so happy with ext4 results. The difference between ext3 and ext4
> > >> might be that amount of data written by kjournald in ext3 is considerably
> > >> larger if it ends up pushing out data (because of data=ordered mode) as
> > >> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> > >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> > >> is hurting us with your fast storage and small amount of written data? With
> > >> WRITE_SYNC, data would be already on it's way to storage before we get to
> > >> wait for them...
> > >
> > >> Or it could be that we really send more data in WRITE mode rather than in
> > >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> > >> blktrace). But I wonder how that could happen...
> > >
> > > It looks like this is the case, the I/O isn't coming down as
> > > synchronous.  I'm seeing a lot of writes, very few write sync's, which
> > > means that the write stream will be preempted by the incoming reads.
> > >
> > > Time to audit that fsync path and make sure it's marked properly, I
> > > guess.
> > 
> > OK, I spoke too soon.  Here's the blktrace summary information (I re-ran
> > the tests using 3 samples, the blktrace is from the last run of the
> > three in each case):
> > 
> > Vanilla
> > -------
> > fs_mark: 307.288 files/sec
> > fio: 286509 KB/s
> > 
> > Total (sde):
> >  Reads Queued:     341,558,   84,994MiB  Writes Queued:       1,561K,    6,244MiB
> >  Read Dispatches:  341,493,   84,994MiB  Write Dispatches:  648,046,    6,244MiB
> >  Reads Requeued:         0               Writes Requeued:        27
> >  Reads Completed:  341,491,   84,994MiB  Writes Completed:  648,021,    6,244MiB
> >  Read Merges:           65,    2,780KiB  Write Merges:      913,076,    3,652MiB
> >  IO unplugs:       578,102               Timer unplugs:           0
> > 
> > Throughput (R/W): 282,797KiB/s / 20,776KiB/s
> > Events (sde): 16,724,303 entries
> > 
> > Patched
> > -------
> > fs_mark: 278.587 files/sec
> > fio: 298007 KB/s
> > 
> > Total (sde):
> >  Reads Queued:     345,407,   86,834MiB  Writes Queued:       1,566K,    6,264MiB
> >  Read Dispatches:  345,391,   86,834MiB  Write Dispatches:  327,404,    6,264MiB
> >  Reads Requeued:         0               Writes Requeued:        33
> >  Reads Completed:  345,391,   86,834MiB  Writes Completed:  327,371,    6,264MiB
> >  Read Merges:           16,    1,576KiB  Write Merges:        1,238K,    4,954MiB
> >  IO unplugs:       580,308               Timer unplugs:           0
> > 
> > Throughput (R/W): 288,771KiB/s / 20,832KiB/s
> > Events (sde): 14,030,610 entries
> > 
> > So, it appears we flush out writes much more aggressively without the
> > patch in place.  I'm not sure why the write bandwidth looks to be higher
> > in the patched case... odd.
> 
> Jan:
> Do you have new idea on this? 
  I was looking at the block traces for quite some time but I couldn't find
the reason why fs_mark is slower with my patch. Actually, looking at the
data now, I don't even understand how fs_mark can report lower files/sec
values.

Both block traces were taken for 300 seconds. From the above stats, we see
that on both kernels, we wrote 6.2 GB over that time. Looking at more
detailed stats I made, fs_mark processes wrote 4094 MB on vanilla kernel
and 4107 MB on the patched kernel. Given that they just sequentially create
and fsync 64 KB files, files/sec ratio should be about the same with both
kernels. So I'm really missing how fs_mark arrives at different files/sec
values or how with such different values it happens that the amount written
is actually the same. Anyone has any idea? 

Looking at how fs_mark works and at differences in trace files - could it
be that the difference is caused by a difference in how log files each
fs_mark thread is writing are flushed? Or possibly by IO caused unlinking
of created files somehow leaking to the time of the next measured fs_mark
run in one case and not the other one? Jeff, I suppose the log files of
fs_mark processes are on the same device as the test directory, aren't
they - that might explain the flusher thread doing IO? The patch below
should limit the interactions. If you have time to test fs_mark with this
patch applied - does it make any difference?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: fs_mark.c.diff --]
[-- Type: text/x-patch, Size: 462 bytes --]

--- a/fs_mark.c.orig	2011-03-22 17:04:52.194716633 +0100
+++ b/fs_mark.c	2011-03-22 17:06:11.910716645 +0100
@@ -1353,6 +1353,11 @@ int main(int argc, char **argv, char **e
 		print_iteration_stats(log_file_fp, &iteration_stats,
 				      files_written);
 		loops_done++;
+		/*
+		 * Flush dirty data to avoid interaction between unlink / log
+		 * file handling and the next iteration
+		 */
+		sync();
 
 	} while (do_fill_fs || (loop_count > loops_done));
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-22 16:14                                         ` Jan Kara
@ 2011-03-22 17:46                                           ` Jeff Moyer
  2011-03-24  6:45                                             ` Alex,Shi
  2011-03-28 19:48                                             ` Jan Kara
  0 siblings, 2 replies; 38+ messages in thread
From: Jeff Moyer @ 2011-03-22 17:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alex,Shi, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

Jan Kara <jack@suse.cz> writes:

> Jeff, I suppose the log files of fs_mark processes are on the same
> device as the test directory, aren't they - that might explain the
> flusher thread doing IO? The patch below

No.  My test environments are sane.  The file system is on a completely
separate disk and it is unmounted and remounted in between each run.
That should take care of any lost flushes.

Did you receive my last couple of private emails commenting on the trace
data?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-22 17:46                                           ` Jeff Moyer
@ 2011-03-24  6:45                                             ` Alex,Shi
  2011-03-28 19:48                                             ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Alex,Shi @ 2011-03-24  6:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso,
	jaxboe, linux-kernel, Chen, Tim C

On Wed, 2011-03-23 at 01:46 +0800, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Jeff, I suppose the log files of fs_mark processes are on the same
> > device as the test directory, aren't they - that might explain the
> > flusher thread doing IO? The patch below
> 
> No.  My test environments are sane.  The file system is on a completely
> separate disk and it is unmounted and remounted in between each run.
> That should take care of any lost flushes.

Jeff, why not give a try with Jan's fs_mark patch? Maybe the result show
something change. :) 

> 
> Did you receive my last couple of private emails commenting on the trace
> data?
> 
> Cheers,
> Jeff



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [performance bug] kernel building regression on 64 LCPUs machine
  2011-03-22 17:46                                           ` Jeff Moyer
  2011-03-24  6:45                                             ` Alex,Shi
@ 2011-03-28 19:48                                             ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-03-28 19:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Alex,Shi, Corrado Zoccolo, Li, Shaohua, Vivek Goyal,
	tytso, jaxboe, linux-kernel, Chen, Tim C

On Tue 22-03-11 13:46:06, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Jeff, I suppose the log files of fs_mark processes are on the same
> > device as the test directory, aren't they - that might explain the
> > flusher thread doing IO? The patch below
> 
> No.  My test environments are sane.  The file system is on a completely
> separate disk and it is unmounted and remounted in between each run.
> That should take care of any lost flushes.
  OK, I understand but let me clear out one thing. What fs_mark seems to be
doing is:
for given number of iterations { (-L option, default 1)
  fork each thread {
    start timer
    for i = 1 .. 10000 { (could be changed by option -n)
      write file i
    }
    stop timer
    for i = 1 .. 10000 {
      unlink file i
    }
    write statistics to log file
    exit
  }
  read all log files, write combined results to another log file
}

I see from blktrace data that indeed you are running more than one
iteration of the fs_mark process so the problem I was wondering about is
whether unlinking or writing of log files cannot interefere with the IO
happening while the timer is running.

Now log files are rather small so I don't really think they cause any
problem but they could be the data flusher thread writes - they are stored
in the current directory (unless you use -l option). So I my question was
aiming at where these files are stored - do you specify -l option and if
not, what is the current directory of fs_mark process in your setup?

I think unlinks could possibly cause problems and that's why I suggested we
add sync() to the parent process before it runs the next iteration...

> Did you receive my last couple of private emails commenting on the trace
> data?
  Yes, I did get them. I'll reply to them in a while.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-03-28 19:49 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-19  1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi
2011-01-19  2:03 ` Shaohua Li
2011-01-19 12:56   ` Jan Kara
2011-01-20  7:52     ` Alex,Shi
2011-01-20 15:16   ` Vivek Goyal
2011-01-21  7:17     ` Shaohua Li
2011-01-26  8:15     ` Shaohua Li
2011-02-12  9:21       ` Alex,Shi
2011-02-12 18:25         ` Corrado Zoccolo
2011-02-14  2:25           ` Alex,Shi
2011-02-15  1:10             ` Shaohua Li
2011-02-21 16:49               ` Jan Kara
2011-02-23  8:24                 ` Alex,Shi
2011-02-24 12:13                   ` Jan Kara
2011-02-25  0:44                     ` Alex Shi
2011-02-26 14:45                     ` Corrado Zoccolo
2011-03-01 19:56                       ` Jeff Moyer
2011-03-02  9:42                         ` Jan Kara
2011-03-02 16:13                           ` Jeff Moyer
2011-03-02 21:17                             ` Jan Kara
2011-03-02 21:20                               ` Jeff Moyer
2011-03-03  1:14                               ` Jeff Moyer
2011-03-04 15:32                                 ` Jan Kara
2011-03-04 15:40                                   ` Jeff Moyer
2011-03-04 15:50                                   ` Jeff Moyer
2011-03-04 18:27                                     ` Jeff Moyer
2011-03-22  7:38                                       ` Alex,Shi
2011-03-22 16:14                                         ` Jan Kara
2011-03-22 17:46                                           ` Jeff Moyer
2011-03-24  6:45                                             ` Alex,Shi
2011-03-28 19:48                                             ` Jan Kara
2011-01-19 14:32 ` Ted Ts'o
2011-01-20  2:12   ` Shaohua Li
2011-01-21  7:23 ` Corrado Zoccolo
2011-01-21  7:47   ` Alex,Shi
2011-01-21  7:52     ` Alex,Shi
2011-01-21  8:13       ` Corrado Zoccolo
2011-01-21  8:20   ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).