LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/7] cpuset writeback throttling
@ 2008-10-30 19:23 David Rientjes
  2008-10-30 19:23 ` [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (8 more replies)
  0 siblings, 9 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

Andrew,

This is the revised cpuset writeback throttling patchset posted to LKML 
on Tuesday, October 27.

The comments from Peter Zijlstra have been addressed.  His concurrent 
page cache patchset is not currently in -mm, so we can still serialize 
updating a struct address_space's dirty_nodes on its tree_lock.  When his 
patchset is merged, the patch at the end of this message can be used to 
introduce the necessary synchronization.

This patchset applies nicely to 2.6.28-rc2-mm1 with the exception of the 
first patch due to the alloc_inode() refactoring to inode_init_always() in
e9110864c440736beb484c2c74dedc307168b14e from linux-next and additions to 
include/linux/cpuset.h from 
oom-print-triggering-tasks-cpuset-and-mems-allowed.patch (oops :).

Please consider this for inclusion in the -mm tree.

A simple way of testing this change is to create a large file that exceeds 
the amount of memory allocated to a specific cpuset.  Then, mmap and 
modify the large file (such as in the following program) while running a 
latency sensitive task in a disjoint cpuset.  Notice the writeout 
throttling that doesn't interfere with the latency sensitive task.

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
	void *addr;
	unsigned long length;
	unsigned long i;
	int fd;

	if (argc != 3) {
		fprintf(stderr, "usage: %s <filename> <length>\n",
			argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDWR, 0644);
	if (fd < 0) {
		fprintf(stderr, "Cannot open file %s\n", argv[1]);
		exit(1);
	}

	length = strtoul(argv[2], NULL, 0);
	if (!length) {
		fprintf(stderr, "Invalid length %s\n", argv[2]);
		exit(1);
	}

	addr = mmap(0, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
		    0);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		exit(1);
	}

	for (;;) {
		for (i = 0; i < length; i++)
			(*(char *)(addr + i))++;
		msync(addr, length, MS_SYNC);
	}
	return 0;
}



The following patch can be applied once the struct address_space's 
tree_lock is removed to protect the attachment of mapping->dirty_nodes.
---
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,9 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->inotify_watches);
 	mutex_init(&inode->inotify_mutex);
 #endif
+#if MAX_NUMNODES > BITS_PER_LONG
+	spin_lock_init(&inode->i_data.dirty_nodes_lock);
+#endif
 }
 
 EXPORT_SYMBOL(inode_init_once);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -554,6 +554,7 @@ struct address_space {
 	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
 #else
 	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+	spinlock_t		dirty_nodes_lock; /* protects the above */
 #endif
 #endif
 } __attribute__((aligned(sizeof(long))));
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2413,25 +2413,27 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 #if MAX_NUMNODES > BITS_PER_LONG
 /*
  * Special functions for NUMA systems with a large number of nodes.  The
- * nodemask is pointed to from the address_space structure.  The attachment of
- * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
- * freed only when the inode is cleared (and therefore unused, thus no locking
- * is necessary).
+ * nodemask is pointed to from the address_space structure.
  */
 void cpuset_update_dirty_nodes(struct address_space *mapping,
 			       struct page *page)
 {
-	nodemask_t *nodes = mapping->dirty_nodes;
+	nodemask_t *nodes;
 	int node = page_to_nid(page);
 
+	spin_lock_irq(&mapping->dirty_nodes_lock);
+	nodes = mapping->dirty_nodes;
 	if (!nodes) {
 		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
-		if (!nodes)
+		if (!nodes) {
+			spin_unlock_irq(&mapping->dirty_nodes_lock);
 			return;
+		}
 
 		*nodes = NODE_MASK_NONE;
 		mapping->dirty_nodes = nodes;
 	}
+	spin_unlock_irq(&mapping->dirty_nodes_lock);
 	node_set(node, *nodes);
 }
 
@@ -2446,8 +2448,8 @@ void cpuset_clear_dirty_nodes(struct address_space *mapping)
 }
 
 /*
- * Called without tree_lock.  The nodemask is only freed when the inode is
- * cleared and therefore this is safe.
+ * The nodemask is only freed when the inode is cleared and therefore this
+ * requires no locking.
  */
 int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 				  nodemask_t *mask)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-11-04 21:09   ` Andrew Morton
  2008-10-30 19:23 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located.  That way we will be able to implement writeout for
applications that are constrained to a portion of the memory of the
system as required by cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
					map

cpuset_init_dirty_nodes(mapping)	Initialization of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated for
systems with a large number of nodes (f.e. ia64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations.  Therefore, we protect the allocation and setting of a
node in the map through the tree_lock.  The tree_lock is already taken
when a page is dirtied so there is no additional locking overhead if we
insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.  At
that point no pages are attached to the inode anymore and therefore it
can be done without any locking. The dirty map therefore records all nodes
that have been used for dirty pages by that inode until the inode is no
longer used.

If the mapping for an inode's set of dirty nodes does not intersect the
set under writeback, the inode is requeued for later.  This is the
mechanism for enforcing the writeback control's set of nodes policy.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c               |    2 +
 fs/fs-writeback.c         |    7 ++++++
 fs/inode.c                |    3 ++
 include/linux/cpuset.h    |   51 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h        |    7 ++++++
 include/linux/writeback.h |    2 +
 kernel/cpuset.c           |   53 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c       |    2 +
 8 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -719,6 +720,7 @@ static int __set_page_dirty(struct page *page,
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	cpuset_update_dirty_nodes(mapping, page);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 
@@ -487,6 +488,12 @@ void generic_sync_sb_inodes(struct super_block *sb,
 			continue;		/* blockdev has wrong queue */
 		}
 
+		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+			/* No node pages under writeback */
+			requeue_io(inode);
+			continue;
+		}
+
 		/* Was this inode dirtied after sync_sb_inodes was called? */
 		if (time_after(inode->dirtied_when, start))
 			break;
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/cpuset.h>
 
 /*
  * This is needed for the following functions:
@@ -167,6 +168,7 @@ static struct inode *alloc_inode(struct super_block *sb)
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 		mapping->writeback_index = 0;
+		cpuset_init_dirty_nodes(mapping);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -271,6 +273,7 @@ void clear_inode(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	cpuset_clear_dirty_nodes(inode->i_mapping);
 	inode->i_state = I_CLEAR;
 }
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,36 @@ extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_update_dirty_nodes(__mapping, __page)			\
+	node_set(page_to_nid(__page), (__mapping)->dirty_nodes);
+
+#define cpuset_clear_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr)	\
+	(!(__nodemask_ptr) ||						\
+	 nodes_intersects((__mapping)->dirty_nodes, *(__nodemask_ptr)))
+
+#else
+struct address_space;
+
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NULL
+
+extern void cpuset_update_dirty_nodes(struct address_space *mapping,
+				      struct page *page);
+extern void cpuset_clear_dirty_nodes(struct address_space *mapping);
+extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+					 nodemask_t *mask);
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -163,6 +193,27 @@ static inline void rebuild_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+struct address_space;
+
+static inline void cpuset_init_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline void cpuset_update_dirty_nodes(struct address_space *mapping,
+					     struct page *page)
+{
+}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+						nodemask_t *mask)
+{
+	return 1;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -544,6 +544,13 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
+#else
+	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+#endif
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -72,6 +72,8 @@ struct writeback_control {
 	 * so we use a single control to update them
 	 */
 	unsigned no_nrwrite_index_update:1;
+
+	nodemask_t *nodes;		/* Nodemask to writeback */
 };
 
 /*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -16,6 +16,7 @@
  *  2006 Rework by Paul Menage to use generic cgroups
  *  2008 Rework of the scheduler domains and CPU hotplug handling
  *       by Max Krasnyansky
+ *  2008 Cpuset writeback by Christoph Lameter
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -2323,6 +2324,58 @@ int cpuset_mem_spread_node(void)
 }
 EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 
+#if MAX_NUMNODES > BITS_PER_LONG
+/*
+ * Special functions for NUMA systems with a large number of nodes.  The
+ * nodemask is pointed to from the address_space structure.  The attachment of
+ * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
+ * freed only when the inode is cleared (and therefore unused, thus no locking
+ * is necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+			       struct page *page)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+	int node = page_to_nid(page);
+
+	if (!nodes) {
+		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+		if (!nodes)
+			return;
+
+		*nodes = NODE_MASK_NONE;
+		mapping->dirty_nodes = nodes;
+	}
+	node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+
+	if (nodes) {
+		mapping->dirty_nodes = NULL;
+		kfree(nodes);
+	}
+}
+
+/*
+ * Called without tree_lock.  The nodemask is only freed when the inode is
+ * cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+				  nodemask_t *mask)
+{
+	nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+	if (!mask)
+		return 1;
+	if (!dirty_nodes)
+		return 0;
+	return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
 /**
  * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/cpuset.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1104,6 +1105,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		cpuset_update_dirty_nodes(mapping, page);
 		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 2/7] pdflush: allow the passing of a nodemask parameter
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
  2008-10-30 19:23 ` [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 19:23 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

If we want to support nodemask specific writeout, then we need a way to
communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also store
the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c               |    2 +-
 fs/super.c                |    4 ++--
 fs/sync.c                 |    8 ++++----
 include/linux/writeback.h |    5 +++--
 mm/page-writeback.c       |   18 +++++++++---------
 mm/pdflush.c              |   15 ++++++++++++---
 mm/vmscan.c               |    2 +-
 7 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -364,7 +364,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_pdflush(1024, NULL);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/super.c b/fs/super.c
--- a/fs/super.c
+++ b/fs/super.c
@@ -646,7 +646,7 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
 	struct super_block *sb;
 
@@ -674,7 +674,7 @@ static void do_emergency_remount(unsigned long foo)
 
 void emergency_remount(void)
 {
-	pdflush_operation(do_emergency_remount, 0);
+	pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff --git a/fs/sync.c b/fs/sync.c
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-	wakeup_pdflush(0);
+	wakeup_pdflush(0, NULL);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	DQUOT_SYNC(NULL);
 	sync_supers();		/* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-	do_sync(1);
+	do_sync(1, NULL);
 	return 0;
 }
 
 void emergency_sync(void)
 {
-	pdflush_operation(do_sync, 0);
+	pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -102,7 +102,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -143,7 +143,8 @@ balance_dirty_pages_ratelimited(struct address_space *mapping)
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 				void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+		      unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -106,7 +106,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -528,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -616,7 +616,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -658,12 +658,12 @@ static void background_writeout(unsigned long _min_pages)
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush threads were busy.
  */
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
 {
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	return pdflush_operation(background_writeout, nr_pages, nodes);
 }
 
 static void wb_timer_fn(unsigned long unused);
@@ -687,7 +687,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
 {
 	unsigned long oldest_jif;
 	unsigned long start_jif;
@@ -746,18 +746,18 @@ int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 
 static void wb_timer_fn(unsigned long unused)
 {
-	if (pdflush_operation(wb_kupdate, 0) < 0)
+	if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
 		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
 }
 
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
 {
 	sys_sync();
 }
 
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	pdflush_operation(laptop_flush, 0, NULL);
 }
 
 /*
diff --git a/mm/pdflush.c b/mm/pdflush.c
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
  */
 struct pdflush_work {
 	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
+	void (*fn)(unsigned long, nodemask_t *); /* A callback function */
 	unsigned long arg0;		/* An argument to the callback */
 	struct list_head list;		/* On pdflush_list, when idle */
 	unsigned long when_i_went_to_sleep;
+	int have_nodes;			/* Nodes were specified */
+	nodemask_t nodes;		/* Nodes to writeback */
 };
 
 static int __pdflush(struct pdflush_work *my_work)
@@ -124,7 +126,8 @@ static int __pdflush(struct pdflush_work *my_work)
 		}
 		spin_unlock_irq(&pdflush_lock);
 
-		(*my_work->fn)(my_work->arg0);
+		(*my_work->fn)(my_work->arg0,
+			       my_work->have_nodes ? &my_work->nodes : NULL);
 
 		/*
 		 * Thread creation: For how long have there been zero
@@ -198,7 +201,8 @@ static int pdflush(void *dummy)
  * Returns zero if it indeed managed to find a worker thread, and passed your
  * payload to it.
  */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *),
+		      unsigned long arg0, nodemask_t *nodes)
 {
 	unsigned long flags;
 	int ret = 0;
@@ -217,6 +221,11 @@ int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
 			last_empty_jifs = jiffies;
 		pdf->fn = fn;
 		pdf->arg0 = arg0;
+		if (nodes) {
+			pdf->nodes = *nodes;
+			pdf->have_nodes = 1;
+		} else
+			pdf->have_nodes = 0;
 		wake_up_process(pdf->who);
 	}
 	spin_unlock_irqrestore(&pdflush_lock, flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1400,7 +1400,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
  2008-10-30 19:23 ` [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
  2008-10-30 19:23 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 19:23 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

Currently dirty throttling does not work properly in a cpuset.

If, for example, a cpuset contains only 1/10th of available memory then
all of the memory of a cpuset can be dirtied without any writes being
triggered.  If all of the cpuset's memory is dirty then only 10% of total
memory is dirty.  The background writeback threshold is usually set at
10% and the synchrononous threshold at 40%.  So we are still below the
global limits while the dirty ratio in the cpuset is 100%!  Writeback
throttling and background writeout do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware.  When determining the dirty
limits in get_dirty_limits(), we calculate values based on the nodes that
are reachable from the current process (that has been dirtying the page).
Then we can trigger writeout based on the dirty ratio of the memory in
the cpuset.

We trigger writeout in a a cpuset specific way.  We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset.  If an inode fulfills that requirement then we begin
writeout of the dirty pages of that inode.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/writeback.h |   13 +++-
 mm/backing-dev.c          |   12 ++--
 mm/page-writeback.c       |  164 ++++++++++++++++++++++++++++++---------------
 3 files changed, 126 insertions(+), 63 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -76,6 +76,15 @@ struct writeback_control {
 	nodemask_t *nodes;		/* Nodemask to writeback */
 };
 
+struct dirty_limits {
+	long thresh_background;
+	long thresh_dirty;
+	long thresh_bdi_dirty;
+	unsigned long nr_dirty;
+	unsigned long nr_unstable;
+	unsigned long nr_writeback;
+};
+
 /*
  * fs/fs-writeback.c
  */	
@@ -127,8 +136,8 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
 				      void __user *, size_t *, loff_t *);
 
-void get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
-		 struct backing_dev_info *bdi);
+int get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
+		     nodemask_t *nodes);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -24,11 +24,9 @@ static void bdi_debug_init(void)
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
-	long background_thresh;
-	long dirty_thresh;
-	long bdi_thresh;
+	struct dirty_limits dl;
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+	get_dirty_limits(&dl, bdi, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
@@ -39,9 +37,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh: %8lu kB\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh),
-		   K(dirty_thresh),
-		   K(background_thresh));
+		   K(dl.thresh_bdi_dirty),
+		   K(dl.thresh_dirty),
+		   K(dl.thresh_background));
 #undef K
 
 	return 0;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -320,13 +320,16 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes,
+					      unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
 	int node;
 	unsigned long x = 0;
 
-	for_each_node_state(node, N_HIGH_MEMORY) {
+	if (!nodes)
+		nodes = &node_states[N_HIGH_MEMORY];
+	for_each_node_mask(node, *nodes) {
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
@@ -357,21 +360,75 @@ unsigned long determine_dirtyable_memory(void)
 	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
+		x -= highmem_dirtyable_memory(NULL, x);
 
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-void
-get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
-		 struct backing_dev_info *bdi)
+static void populate_global_dirty_limits(struct dirty_limits *dl,
+					 unsigned long *dirtyable_memory,
+					 unsigned long *nr_mapped)
+{
+	dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
+	dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	dl->nr_writeback = global_page_state(NR_WRITEBACK);
+	*dirtyable_memory = determine_dirtyable_memory();
+	*nr_mapped = global_page_state(NR_FILE_MAPPED) +
+		     global_page_state(NR_ANON_PAGES);
+}
+
+static void populate_nodemask_dirty_limits(struct dirty_limits *dl,
+					   unsigned long *dirtyable_memory,
+					   unsigned long *nr_mapped,
+					   nodemask_t *nodes)
+{
+	int node;
+
+	/*
+	 * We do not disregard highmem because all nodes (except maybe node 0)
+	 * have either all memory in HIGHMEM (32-bit) or all memory in
+	 * non-HIGHMEM (64-bit).  If we would disregard highmem, then cpuset
+	 * throttling would not work on 32-bit.
+	 */
+	for_each_node_mask(node, *nodes) {
+		if (!node_online(node))
+			continue;
+		dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+		dl->nr_unstable += node_page_state(node, NR_UNSTABLE_NFS);
+		dl->nr_writeback += node_page_state(node, NR_WRITEBACK);
+		*dirtyable_memory +=
+			node_page_state(node, NR_ACTIVE_ANON) +
+			node_page_state(node, NR_ACTIVE_FILE) +
+			node_page_state(node, NR_INACTIVE_ANON) +
+			node_page_state(node, NR_INACTIVE_FILE) +
+			node_page_state(node, NR_FREE_PAGES);
+		*nr_mapped +=
+			node_page_state(node, NR_FILE_MAPPED);
+			node_page_state(node, NR_ANON_PAGES);
+	}
+	*dirtyable_memory -= highmem_dirtyable_memory(nodes, *dirtyable_memory);
+}
+
+int
+get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
+		 nodemask_t *nodes)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
 	long background;
 	long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long dirtyable_memory = 0;
+	unsigned long nr_mapped = 0;
 	struct task_struct *tsk;
+	int is_subset;
+
+	memset(dl, 0, sizeof(struct dirty_limits));
+	is_subset = nodes && !nodes_subset(node_online_map, *nodes);
+	if (unlikely(is_subset))
+		populate_nodemask_dirty_limits(dl, &dirtyable_memory,
+					       &nr_mapped, nodes);
+	else
+		populate_global_dirty_limits(dl, &dirtyable_memory, &nr_mapped);
 
 	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio < 5)
@@ -381,15 +438,15 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 	if (background_ratio >= dirty_ratio)
 		background_ratio = dirty_ratio / 2;
 
-	background = (background_ratio * available_memory) / 100;
-	dirty = (dirty_ratio * available_memory) / 100;
+	background = (background_ratio * dirtyable_memory) / 100;
+	dirty = (dirty_ratio * dirtyable_memory) / 100;
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;
 		dirty += dirty / 4;
 	}
-	*pbackground = background;
-	*pdirty = dirty;
+	dl->thresh_background = background;
+	dl->thresh_dirty = dirty;
 
 	if (bdi) {
 		u64 bdi_dirty;
@@ -407,10 +464,11 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 		if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
-		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
-		task_dirty_limit(current, pbdi_dirty);
+		dl->thresh_bdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, &dl->thresh_bdi_dirty);
+		task_dirty_limit(current, &dl->thresh_bdi_dirty);
 	}
+	return is_subset;
 }
 
 /*
@@ -424,9 +482,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
-	long background_thresh;
-	long dirty_thresh;
-	long bdi_thresh;
+	struct dirty_limits dl;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -441,17 +497,16 @@ static void balance_dirty_pages(struct address_space *mapping)
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
-
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		if (get_dirty_limits(&dl, bdi, &cpuset_current_mems_allowed))
+			wbc.nodes = &cpuset_current_mems_allowed;
+		nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+		nr_writeback = dl.nr_writeback;
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+							dl.thresh_bdi_dirty)
 			break;
 
 		/*
@@ -460,7 +515,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
+				(dl.thresh_background + dl.thresh_dirty) / 2)
 			break;
 
 		if (!bdi->dirty_exceeded)
@@ -475,8 +530,12 @@ static void balance_dirty_pages(struct address_space *mapping)
 		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
+			get_dirty_limits(&dl, bdi,
+					 &cpuset_current_mems_allowed);
+			nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+			if (nr_reclaimable + dl.nr_writeback <=
+							dl.thresh_dirty)
+				break;
 		}
 
 		/*
@@ -489,7 +548,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+		if (dl.thresh_bdi_dirty < 2*bdi_stat_error(bdi)) {
 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
 			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else if (bdi_nr_reclaimable) {
@@ -497,7 +556,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+							dl.thresh_bdi_dirty)
 			break;
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
@@ -505,8 +565,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	if (bdi->dirty_exceeded && nr_reclaimable + dl.nr_writeback <=
+							dl.thresh_dirty)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -521,10 +581,9 @@ static void balance_dirty_pages(struct address_space *mapping)
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS)
-					  > background_thresh)))
-		pdflush_operation(background_writeout, 0, NULL);
+		(!laptop_mode && (nr_reclaimable > dl.thresh_background)))
+		pdflush_operation(background_writeout, 0,
+				  &cpuset_current_mems_allowed);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -581,22 +640,20 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
-	long background_thresh;
-	long dirty_thresh;
+	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+		get_dirty_limits(&dl, NULL, &node_states[N_HIGH_MEMORY]);
 
-                /*
-                 * Boost the allowable dirty threshold a bit for page
-                 * allocators so they don't get DoS'ed by heavy writers
-                 */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
+		/*
+		 * Boost the allowable dirty threshold a bit for page
+		 * allocators so they don't get DoS'ed by heavy writers
+		 */
+		dl.thresh_dirty += dl.thresh_dirty / 10; /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(WRITE, HZ/10);
+		if (dl.nr_unstable + dl.nr_writeback <= dl.thresh_dirty)
+			break;
+		congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -612,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -625,13 +682,12 @@ static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 	};
 
 	for ( ; ; ) {
-		long background_thresh;
-		long dirty_thresh;
+		struct dirty_limits dl;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
+		if (get_dirty_limits(&dl, NULL, nodes))
+			wbc.nodes = nodes;
+		if (dl.nr_dirty + dl.nr_unstable < dl.thresh_background &&
+		    min_pages <= 0)
 			break;
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 4/7] mm: cpuset aware reclaim writeout
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (2 preceding siblings ...)
  2008-10-30 19:23 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 19:23 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if it's a member of the active cpuset.  But then we
call pdflush without enforcing the same restrictions.  In a larger system
this may have the effect of a massive amount of pages being dirtied and
then either

 - no writeout occurs because global dirty limits have not been reached,
   or

 - writeout starts randomly for some dirty inode in the system.  pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases, dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset.  Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1604,7 +1604,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				       &cpuset_current_mems_allowed);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 5/7] mm: throttle writeout with cpuset awareness
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (3 preceding siblings ...)
  2008-10-30 19:23 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 19:23 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

This bases the VM throttling from the reclaim path on the dirty ratio of
the cpuset.  Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine.  VM
throttling will only work during synchrononous reclaim and not from
kswapd.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |    4 ++--
 mm/vmscan.c               |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -114,7 +114,7 @@ static inline void inode_sync_wait(struct inode *inode)
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -638,12 +638,12 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
 	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&dl, NULL, &node_states[N_HIGH_MEMORY]);
+		get_dirty_limits(&dl, NULL, nodes);
 
 		/*
 		 * Boost the allowable dirty threshold a bit for page
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	else if (!scan_global_lru(sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
-	throttle_vm_writeout(sc->gfp_mask);
+	throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 	return nr_reclaimed;
 }
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (4 preceding siblings ...)
  2008-10-30 19:23 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 19:23 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

This implements dirty ratios per cpuset.  Two new files are added to the
cpuset directories:

dirty_background_ratio	Percentage at which background writeback starts

dirty_ratio		Percentage at which the application is throttled
			and we start synchrononous writeout

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/cpuset.h |    7 ++++
 kernel/cpuset.c        |   91 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c    |   12 ++++---
 3 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -76,6 +76,8 @@ static inline int cpuset_do_slab_mem_spread(void)
 
 extern void cpuset_track_online_nodes(void);
 
+extern void cpuset_get_current_dirty_ratios(int *background, int *throttle);
+
 extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
@@ -183,6 +185,11 @@ static inline int cpuset_do_slab_mem_spread(void)
 
 static inline void cpuset_track_online_nodes(void) {}
 
+static inline void cpuset_get_current_dirty_ratios(int *background,
+						   int *throttle)
+{
+}
+
 static inline int current_cpuset_is_being_rebound(void)
 {
 	return 0;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -53,6 +53,7 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
@@ -105,6 +106,9 @@ struct cpuset {
 
 	/* used for walking a cpuset heirarchy */
 	struct list_head stack_list;
+
+	int dirty_background_ratio;
+	int cpuset_dirty_ratio;
 };
 
 /* Retrieve the cpuset for a cgroup */
@@ -197,6 +201,8 @@ static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.dirty_background_ratio = -1,
+	.cpuset_dirty_ratio = -1,
 };
 
 /*
@@ -1198,6 +1204,42 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return 0;
 }
 
+static int update_int(int *cs_int, int val, int min, int max)
+{
+	if (val < min || val > max)
+		return -EINVAL;
+	mutex_lock(&callback_mutex);
+	*cs_int = val;
+	mutex_unlock(&callback_mutex);
+	return 0;
+}
+
+static u64 get_dirty_background_ratio(struct cpuset *cs)
+{
+	int ret;
+
+	mutex_lock(&callback_mutex);
+	ret = cs->dirty_background_ratio;
+	mutex_unlock(&callback_mutex);
+
+	if (ret == -1)
+		ret = dirty_background_ratio;
+	return (u64)ret;
+}
+
+static u64 get_dirty_ratio(struct cpuset *cs)
+{
+	int ret;
+
+	mutex_lock(&callback_mutex);
+	ret = cs->cpuset_dirty_ratio;
+	mutex_unlock(&callback_mutex);
+
+	if (ret == -1)
+		ret = vm_dirty_ratio;
+	return (u64)ret;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -1362,6 +1404,8 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_DIRTY_BACKGROUND_RATIO,
+	FILE_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1424,6 +1468,12 @@ static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		retval = update_relax_domain_level(cs, val);
 		break;
+	case FILE_DIRTY_BACKGROUND_RATIO:
+		retval = update_int(&cs->dirty_background_ratio, val, -1, 100);
+		break;
+	case FILE_DIRTY_RATIO:
+		retval = update_int(&cs->cpuset_dirty_ratio, val, -1, 100);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1551,6 +1601,10 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_DIRTY_BACKGROUND_RATIO:
+		return get_dirty_background_ratio(cs);
+	case FILE_DIRTY_RATIO:
+		return get_dirty_ratio(cs);
 	default:
 		BUG();
 	}
@@ -1658,6 +1712,20 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = cpuset_read_u64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_DIRTY_BACKGROUND_RATIO,
+	},
+
+	{
+		.name = "dirty_ratio",
+		.read_u64 = cpuset_read_u64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_DIRTY_RATIO,
+	},
 };
 
 static struct cftype cft_memory_pressure_enabled = {
@@ -1753,6 +1821,8 @@ static struct cgroup_subsys_state *cpuset_create(
 	cs->mems_generation = cpuset_mems_generation++;
 	fmeter_init(&cs->fmeter);
 	cs->relax_domain_level = -1;
+	cs->dirty_background_ratio = parent->dirty_background_ratio;
+	cs->cpuset_dirty_ratio = parent->cpuset_dirty_ratio;
 
 	cs->parent = parent;
 	number_of_cpusets++;
@@ -2021,6 +2091,27 @@ void cpuset_track_online_nodes(void)
 }
 #endif
 
+/*
+ * Determine the dirty ratios for the currently active cpuset
+ */
+void cpuset_get_current_dirty_ratios(int *background, int *throttle)
+{
+	if (task_cs(current) == &top_cpuset) {
+		*background = top_cpuset.dirty_background_ratio;
+		*throttle = top_cpuset.cpuset_dirty_ratio;
+	} else {
+		rcu_read_lock();
+		*background = task_cs(current)->dirty_background_ratio;
+		*throttle = task_cs(current)->cpuset_dirty_ratio;
+		rcu_read_unlock();
+	}
+
+	if (*background == -1)
+		*background = dirty_background_ratio;
+	if (*throttle == -1)
+		*throttle = vm_dirty_ratio;
+}
+
 /**
  * cpuset_init_smp - initialize cpus_allowed
  *
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -424,17 +424,19 @@ get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
 
 	memset(dl, 0, sizeof(struct dirty_limits));
 	is_subset = nodes && !nodes_subset(node_online_map, *nodes);
-	if (unlikely(is_subset))
+	if (unlikely(is_subset)) {
 		populate_nodemask_dirty_limits(dl, &dirtyable_memory,
 					       &nr_mapped, nodes);
-	else
+		cpuset_get_current_dirty_ratios(&background_ratio,
+						&dirty_ratio);
+	} else {
 		populate_global_dirty_limits(dl, &dirtyable_memory, &nr_mapped);
+		dirty_ratio = vm_dirty_ratio;
+		background_ratio = dirty_background_ratio;
+	}
 
-	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio < 5)
 		dirty_ratio = 5;
-
-	background_ratio = dirty_background_ratio;
 	if (background_ratio >= dirty_ratio)
 		background_ratio = dirty_ratio / 2;
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 7/7] cpusets: update documentation for writeback throttling
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (5 preceding siblings ...)
  2008-10-30 19:23 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  2008-10-30 21:08 ` [patch 0/7] cpuset " Dave Chinner
  2008-11-04 20:47 ` Andrew Morton
  8 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, Randy Dunlap, linux-kernel

Update Documentation/cpusets.txt for per-cpuset writeback throttling.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/cpusets.txt          |   58 +++++++++++++++++++++++++++--------
 Documentation/filesystems/proc.txt |    6 ++++
 2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -14,15 +14,17 @@ CONTENTS:
 =========
 
 1. Cpusets
-  1.1 What are cpusets ?
-  1.2 Why are cpusets needed ?
-  1.3 How are cpusets implemented ?
-  1.4 What are exclusive cpusets ?
-  1.5 What is memory_pressure ?
-  1.6 What is memory spread ?
-  1.7 What is sched_load_balance ?
-  1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.1  What are cpusets ?
+  1.2  Why are cpusets needed ?
+  1.3  How are cpusets implemented ?
+  1.4  What are exclusive cpusets ?
+  1.5  What is memory_pressure ?
+  1.6  What is memory spread ?
+  1.7  What is sched_load_balance ?
+  1.8  What is sched_relax_domain_level ?
+  1.9  What is dirty_background_ratio ?
+  1.10 What is dirty_ratio ?
+  1.11 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -148,6 +150,8 @@ into the rest of the kernel, none in performance critical paths:
    Memory Nodes by what's allowed in that tasks cpuset.
  - in page_alloc.c, to restrict memory to allowed nodes.
  - in vmscan.c, to restrict page recovery to the current cpuset.
+ - in pdflush.c, to restrict pdflush's nodes of interest to those
+   of the current task's cpuset.
 
 You should mount the "cgroup" filesystem type in order to enable
 browsing and modifying the cpusets presently known to the kernel.  No
@@ -572,9 +576,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is dirty_background_ratio ?
+------------------------------------
+
+The per-cpuset percentage of dirtyable memory (free pages + mapped
+pages + file cache, not including locked or huge pages), the number of
+pages at which the pdflush background writeout daemon will start writing
+out dirtied data.
+
+The default value is /proc/sys/vm/dirty_background_ratio.  Set the
+cpuset's dirty_background_ratio to -1 to accept this system-wide
+default.
+
+
+1.10 What is dirty_ratio ?
 --------------------------
 
+The per-cpuset percentage of dirtyable memory (free pages + mapped
+pages + file cache, not including locked or huge pages), the number of
+pages at which a process generating disk writes will itself start
+writing out dirtied data.
+
+The default value is /proc/sys/vm/dirty_ratio.  Set the cpuset's
+dirty_ratio to -1 to accept this system-wide default.
+
+
+1.11 How do I use cpusets ?
+---------------------------
+
 In order to minimize the impact of cpusets on critical kernel
 code, such as the scheduler, and due to the fact that the kernel
 does not support one task updating the memory placement of another
@@ -716,10 +745,11 @@ Now you want to do something with this cpuset.
 
 In this directory you can find several files:
 # ls
-cpu_exclusive  memory_migrate      mems                      tasks
-cpus           memory_pressure     notify_on_release
-mem_exclusive  memory_spread_page  sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
+cpu_exclusive		mem_hardwall		mems
+cpus			memory_migrate		notify_on_release
+dirty_background_ratio	memory_pressure		sched_load_balance
+dirty_ratio		memory_spread_page	sched_relax_domain_level
+mem_exclusive		memory_spread_slab	tasks
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1389,6 +1389,9 @@ pages + file cache, not including locked pages and HugePages), the number of
 pages at which the pdflush background writeback daemon will start writing out
 dirty data.
 
+This ratio can be superseded for memory belonging to a cpuset; see
+Documentation/cpusets.txt for more information.
+
 dirty_ratio
 -----------------
 
@@ -1397,6 +1400,9 @@ pages + file cache, not including locked pages and HugePages), the number of
 pages at which a process which is generating disk writes will itself start
 writing out dirty data.
 
+This ratio can be superseded for memory belonging to a cpuset; see
+Documentation/cpusets.txt for more information.
+
 dirty_writeback_centisecs
 -------------------------
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (6 preceding siblings ...)
  2008-10-30 19:23 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
@ 2008-10-30 21:08 ` Dave Chinner
  2008-10-30 21:33   ` Christoph Lameter
  2008-11-04 20:47 ` Andrew Morton
  8 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2008-10-30 21:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	Paul Menage, Derek Fults, linux-kernel

On Thu, Oct 30, 2008 at 12:23:10PM -0700, David Rientjes wrote:
> Andrew,
> 
> This is the revised cpuset writeback throttling patchset posted to LKML 
> on Tuesday, October 27.
> 
> The comments from Peter Zijlstra have been addressed.  His concurrent 
> page cache patchset is not currently in -mm, so we can still serialize 
> updating a struct address_space's dirty_nodes on its tree_lock.  When his 
> patchset is merged, the patch at the end of this message can be used to 
> introduce the necessary synchronization.
> 
> This patchset applies nicely to 2.6.28-rc2-mm1 with the exception of the 
> first patch due to the alloc_inode() refactoring to inode_init_always() in
> e9110864c440736beb484c2c74dedc307168b14e from linux-next and additions to 
> include/linux/cpuset.h from 
> oom-print-triggering-tasks-cpuset-and-mems-allowed.patch (oops :).
> 
> Please consider this for inclusion in the -mm tree.
> 
> A simple way of testing this change is to create a large file that exceeds 
> the amount of memory allocated to a specific cpuset.  Then, mmap and 
> modify the large file (such as in the following program) while running a 
> latency sensitive task in a disjoint cpuset.  Notice the writeout 
> throttling that doesn't interfere with the latency sensitive task.

What sort of validation/regression testing has this been through?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 21:08 ` [patch 0/7] cpuset " Dave Chinner
@ 2008-10-30 21:33   ` Christoph Lameter
  2008-10-30 22:03     ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-10-30 21:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Rientjes, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Paul Menage, Derek Fults, linux-kernel

On Fri, 31 Oct 2008, Dave Chinner wrote:

> What sort of validation/regression testing has this been through?

There were several tests run more than a year ago by me on large SGI 
machines. Then there was Solomita who did various tests and posts of 
the patchset over the last year.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 21:33   ` Christoph Lameter
@ 2008-10-30 22:03     ` Dave Chinner
  2008-10-31 13:47       ` Christoph Lameter
  2008-10-31 16:36       ` David Rientjes
  0 siblings, 2 replies; 46+ messages in thread
From: Dave Chinner @ 2008-10-30 22:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Paul Menage, Derek Fults, linux-kernel

On Thu, Oct 30, 2008 at 04:33:34PM -0500, Christoph Lameter wrote:
> On Fri, 31 Oct 2008, Dave Chinner wrote:
>
>> What sort of validation/regression testing has this been through?
>
> There were several tests run more than a year ago by me on large SGI  
> machines. Then there was Solomita who did various tests and posts of the 
> patchset over the last year.

Yes, I know about those tests at SGI - they were to demonstrate the
feature worked (i.e. proof-of-concept demonstration), not regression
test the change.

This is a fairly major change in behaviour to the writeback path on
NUMA systems and so has the potential to introduce subtle new
issues. Hence I'm asking about the level of testing and exposure
it has had. It doesn't really sound like it has had much coverage
to me....

I'm concerned right now because Nick and others (including myself)
have recently found lots of nasty data integrity problems in the
writeback path that we are currently trying to test fixes for.
It's not a good time to introduce new behaviours as that will
definitely perturb any regression testing we are running....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 22:03     ` Dave Chinner
@ 2008-10-31 13:47       ` Christoph Lameter
  2008-10-31 16:36       ` David Rientjes
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2008-10-31 13:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Rientjes, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Paul Menage, Derek Fults, linux-kernel

On Fri, 31 Oct 2008, Dave Chinner wrote:

> Yes, I know about those tests at SGI - they were to demonstrate the
> feature worked (i.e. proof-of-concept demonstration), not regression
> test the change.

I did some regression tests at that time before publishing the initial 
patchsets. That was mainly using standard tests (AIM7). There have been 
some changes since then though.

> This is a fairly major change in behaviour to the writeback path on
> NUMA systems and so has the potential to introduce subtle new
> issues. Hence I'm asking about the level of testing and exposure
> it has had. It doesn't really sound like it has had much coverage
> to me....

Sure we need more testing. Solomita did some testing as evident from his 
messages. See f.e. 
http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-09/msg02939.html

> I'm concerned right now because Nick and others (including myself)
> have recently found lots of nasty data integrity problems in the
> writeback path that we are currently trying to test fixes for.
> It's not a good time to introduce new behaviours as that will
> definitely perturb any regression testing we are running....

The writeback throttling that is addressed in the patchset mainly changes 
the calculation of the limits that determine when writeback is started and 
select inodes based on their page allocation on certain nodes. That is 
relatively independent of the locking scheme for writeback.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 22:03     ` Dave Chinner
  2008-10-31 13:47       ` Christoph Lameter
@ 2008-10-31 16:36       ` David Rientjes
  1 sibling, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-31 16:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Lameter, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Paul Menage, Derek Fults, linux-kernel

On Fri, 31 Oct 2008, Dave Chinner wrote:

> This is a fairly major change in behaviour to the writeback path on
> NUMA systems and so has the potential to introduce subtle new
> issues. Hence I'm asking about the level of testing and exposure
> it has had. It doesn't really sound like it has had much coverage
> to me....
> 

I've regression tested it to ensure that there is no significant 
degradation for unconstrained (non-cpuset) configurations, which is 
identical to single system-wide cpuset configurations with ratios that 
match the global sysctls.

So there should be no noticeable difference for non-cpuset users or users 
who attach all system tasks to the root cpuset with this new feature.

To simulate large NUMA systems, numa=fake=32 was used on my small 4G 
machine for 32 system nodes.  A 4G file was created from /dev/zero and a 
non-iterative version of my aforementioned test program (see the end of 
this message) was used to mmap(), dirty, and msync().  This was done with 
Linus' latest -git both with and without this patchset.

Although only 32 nodes were online, these kernels both had a 
CONFIG_NODES_SHIFT of 8 to use the patchset's slowpath (the kmalloc() for 
a struct address_space's dirty_nodes) since MAX_NUMNODES > BITS_PER_LONG.

Without this patchset (five iterations):

	real	2m43.824s   2m41.842s   2m44.103s   2m42.176s   2m43.692s
	user	0m18.946s   0m18.875s   0m18.747s   0m18.969s   0m19.091s
	sys	0m7.722s    0m7.817s    0m7.618s    0m7.635s    0m7.450s

	Device:        tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
	sda         376.68        18.60        17.90       4269       4108
	sda         367.37        18.13        17.45       4269       4108
	sda         388.59        19.19        18.46       4269       4107
	sda         399.44        19.73        18.98       4269       4106
	sda         407.15        20.08        19.33       4271       4110

With this patchset (five iterations):

	real	2m46.337s   2m44.321s   2m46.406s   2m43.237s   2m47.165s
	user	0m19.014s   0m18.985s   0m19.146s   0m18.778s   0m18.937s
	sys	0m7.008s    0m7.180s    0m7.067s    0m7.116s    0m7.188s

	Device:        tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
	sda         399.10        19.71        18.97       4269       4109
	sda         396.85        19.55        18.78       4272       4111
	sda         404.72        19.96        19.23       4270       4112
	sda         403.51        19.93        19.18       4269       4108
	sda         396.33        19.54        18.80       4276       4114

> I'm concerned right now because Nick and others (including myself)
> have recently found lots of nasty data integrity problems in the
> writeback path that we are currently trying to test fixes for.
> It's not a good time to introduce new behaviours as that will
> definitely perturb any regression testing we are running....
> 

Do we need a development freeze of the writeback path for features that 
would be targeted to 2.6.29 at the earliest?

Nasty data integrity problems seem like something that would be addressed 
during the -rc stage.

		David
---
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
	void *addr;
	unsigned long length;
	unsigned long i;
	int fd;

	if (argc != 3) {
		fprintf(stderr, "usage: %s <filename> <length>\n",
			argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDWR, 0644);
	if (fd < 0) {
		fprintf(stderr, "Cannot open file %s\n", argv[1]);
		exit(1);
	}

	length = strtoul(argv[2], NULL, 0);
	if (!length) {
		fprintf(stderr, "Invalid length %s\n", argv[2]);
		exit(1);
	}

	addr = mmap(0, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
		    0);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		exit(1);
	}

	for (i = 0; i < length; i++)
		(*(char *)(addr + i))++;
	msync(addr, length, MS_SYNC);
	return 0;
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
                   ` (7 preceding siblings ...)
  2008-10-30 21:08 ` [patch 0/7] cpuset " Dave Chinner
@ 2008-11-04 20:47 ` Andrew Morton
  2008-11-04 20:53   ` Peter Zijlstra
  8 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 20:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: cl, npiggin, peterz, menage, dfults, linux-kernel

On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> This is the revised cpuset writeback throttling patchset

I'm all confused about why this is a cpuset thing rather than a cgroups
thing.  What are the relationships here?

I mean, writeback throttling _should_ operate upon a control group
(more specifically: a memcg), yes?  I guess you're assuming a 1:1
relationship here?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 20:47 ` Andrew Morton
@ 2008-11-04 20:53   ` Peter Zijlstra
  2008-11-04 20:58     ` Christoph Lameter
                       ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Peter Zijlstra @ 2008-11-04 20:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, cl, npiggin, menage, dfults, linux-kernel

On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
> On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > This is the revised cpuset writeback throttling patchset
> 
> I'm all confused about why this is a cpuset thing rather than a cgroups
> thing.  What are the relationships here?
> 
> I mean, writeback throttling _should_ operate upon a control group
> (more specifically: a memcg), yes?  I guess you're assuming a 1:1
> relationship here?

I think the main reason is that we have per-node vmstats so the cpuset
extention is relatively easy. Whereas we do not currently maintain
vmstats on a cgroup level - although I imagine that could be remedied.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 20:53   ` Peter Zijlstra
@ 2008-11-04 20:58     ` Christoph Lameter
  2008-11-04 21:10     ` David Rientjes
  2008-11-04 21:16     ` Andrew Morton
  2 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2008-11-04 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, David Rientjes, npiggin, menage, dfults, linux-kernel

On Tue, 4 Nov 2008, Peter Zijlstra wrote:

> I think the main reason is that we have per-node vmstats so the cpuset
> extention is relatively easy. Whereas we do not currently maintain
> vmstats on a cgroup level - although I imagine that could be remedied.

We have a separate LRU for the control groups so we are on the way 
there.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-30 19:23 ` [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
@ 2008-11-04 21:09   ` Andrew Morton
  2008-11-04 21:20     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 21:09 UTC (permalink / raw)
  To: David Rientjes; +Cc: cl, npiggin, peterz, menage, dfults, linux-kernel

On Thu, 30 Oct 2008 12:23:11 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> From: Christoph Lameter <cl@linux-foundation.org>
> 
> In a NUMA system it is helpful to know where the dirty pages of a mapping
> are located.  That way we will be able to implement writeout for
> applications that are constrained to a portion of the memory of the
> system as required by cpusets.
> 
> This patch implements the management of dirty node maps for an address
> space through the following functions:
> 
> cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes
> 
> cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
> 					map
> 
> cpuset_init_dirty_nodes(mapping)	Initialization of the map
> 
> 
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated for
> systems with a large number of nodes (f.e. ia64 with 1024 nodes).
> 
> Updating the dirty map may involve allocating it first for large
> configurations.  Therefore, we protect the allocation and setting of a
> node in the map through the tree_lock.  The tree_lock is already taken
> when a page is dirtied so there is no additional locking overhead if we
> insert the updating of the nodemask there.
> 
> The dirty map is only cleared (or freed) when the inode is cleared.  At
> that point no pages are attached to the inode anymore and therefore it
> can be done without any locking. The dirty map therefore records all nodes
> that have been used for dirty pages by that inode until the inode is no
> longer used.
> 
> If the mapping for an inode's set of dirty nodes does not intersect the
> set under writeback, the inode is requeued for later.  This is the
> mechanism for enforcing the writeback control's set of nodes policy.
> 
> ...
> 
> +/*
> + * We need macros since struct address_space is not defined yet

sigh.  Thanks for adding the comment.

> + */
> +#if MAX_NUMNODES <= BITS_PER_LONG
> +#define cpuset_init_dirty_nodes(__mapping)				\
> +	(__mapping)->dirty_nodes = NODE_MASK_NONE
> +
> +#define cpuset_update_dirty_nodes(__mapping, __page)			\
> +	node_set(page_to_nid(__page), (__mapping)->dirty_nodes);
> +
> +#define cpuset_clear_dirty_nodes(__mapping)				\
> +	(__mapping)->dirty_nodes = NODE_MASK_NONE
> +
> +#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr)	\
> +	(!(__nodemask_ptr) ||						\
> +	 nodes_intersects((__mapping)->dirty_nodes, *(__nodemask_ptr)))

This macro can evaluate __nodemask_ptr one or two times, possibly
causing code bloat, performance loss or bugs.  Copying it into a local
would fix that.

> +#else
> +struct address_space;

When forward-declaring structures like this I'd suggest that the
declaration be put right at the top of the header file, outside ifdefs.

Because putting it inside an ifdef is just inviting compilation errors
later on when someone assumes that address_space was declared and
didn't test all config options.

And put it at the top of the file because some unobservant person might
later add more code to the header ahead of the address_space
declaration which needs address_space.  So they go and add a second
forward declaration of the same thing.  We've had that happen before.

> +#define cpuset_init_dirty_nodes(__mapping)				\
> +	(__mapping)->dirty_nodes = NULL

I think there are ways in which this sort of macro can cause
compilation errors or even bugs.  I forget all the scenarios, but we do
know that whapping a `do { ...  } while (0)' around it fixes them.

> +extern void cpuset_update_dirty_nodes(struct address_space *mapping,
> +				      struct page *page);
> +extern void cpuset_clear_dirty_nodes(struct address_space *mapping);
> +extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> +					 nodemask_t *mask);
> +#endif
> +
>  #else /* !CONFIG_CPUSETS */
>  
>  static inline int cpuset_init_early(void) { return 0; }
> @@ -163,6 +193,27 @@ static inline void rebuild_sched_domains(void)
>  	partition_sched_domains(1, NULL, NULL);
>  }
>  
> +struct address_space;
> +
> +static inline void cpuset_init_dirty_nodes(struct address_space *mapping)
> +{
> +}
> +
> +static inline void cpuset_update_dirty_nodes(struct address_space *mapping,
> +					     struct page *page)
> +{
> +}

OK, what's ging on here?  I count _three_ definitions of
cpuset_update_dirty_nodes().  One is a macro, one is a static inline
stub, one is an out-of-line C function.

What's up with that?  Maybe `#if MAX_NUMNODES > BITS_PER_LONG' added
the third leg?

> +static inline void cpuset_clear_dirty_nodes(struct address_space *mapping)
> +{
> +}
> +
> +static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> +						nodemask_t *mask)
> +{
> +	return 1;
> +}
> +
>  #endif /* !CONFIG_CPUSETS */
>  
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -544,6 +544,13 @@ struct address_space {
>  	spinlock_t		private_lock;	/* for use by the address_space */
>  	struct list_head	private_list;	/* ditto */
>  	struct address_space	*assoc_mapping;	/* ditto */
> +#ifdef CONFIG_CPUSETS
> +#if MAX_NUMNODES <= BITS_PER_LONG
> +	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
> +#else
> +	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
> +#endif
> +#endif
>  } __attribute__((aligned(sizeof(long))));

eek.  Increasing the size of the address_space (and hence of the inode)
is a moderately big deal - there can be millions of these in memory.

Which gets me thinking...

Many inodes don't have any pagecache.  ext3 directories, anything which
was read via stat(), for a start.  Would it be worthwhile optimising
for this?

Options would include

- remove inode.i_data and dynamically allocate it.

- move your dirty_nodes thing into radix_tree_node, dynamically
  allocate address_space.page_tree (and tree_lock).

- etc.

There's a little project for someone to be going on with.


>  	/*
>  	 * On most architectures that alignment is already the case; but
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -72,6 +72,8 @@ struct writeback_control {
>  	 * so we use a single control to update them
>  	 */
>  	unsigned no_nrwrite_index_update:1;
> +
> +	nodemask_t *nodes;		/* Nodemask to writeback */

This one doesn't get ifdefs?

>  };
>  
>  /*
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -16,6 +16,7 @@
>   *  2006 Rework by Paul Menage to use generic cgroups
>   *  2008 Rework of the scheduler domains and CPU hotplug handling
>   *       by Max Krasnyansky
> + *  2008 Cpuset writeback by Christoph Lameter
>   *
>   *  This file is subject to the terms and conditions of the GNU General Public
>   *  License.  See the file COPYING in the main directory of the Linux
> @@ -2323,6 +2324,58 @@ int cpuset_mem_spread_node(void)
>  }
>  EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>  
> +#if MAX_NUMNODES > BITS_PER_LONG
> +/*
> + * Special functions for NUMA systems with a large number of nodes.  The
> + * nodemask is pointed to from the address_space structure.  The attachment of
> + * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
> + * freed only when the inode is cleared (and therefore unused, thus no locking
> + * is necessary).
> + */
> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> +			       struct page *page)
> +{
> +	nodemask_t *nodes = mapping->dirty_nodes;
> +	int node = page_to_nid(page);
> +
> +	if (!nodes) {
> +		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);

erk, OK, called from __set_page_dirty, needs to be atomic.

What are the consequences when this allocation fails?

I prefer `foo = kmalloc(sizeof(*foo), ...)', personally.  It avoids
duplicating information and means that PoorOldReviewer doesn't have to
crawl back through the code checking that NastyYoungCoder got the type
right.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 20:53   ` Peter Zijlstra
  2008-11-04 20:58     ` Christoph Lameter
@ 2008-11-04 21:10     ` David Rientjes
  2008-11-04 21:16     ` Andrew Morton
  2 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-11-04 21:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, cl, npiggin, menage, dfults, linux-kernel

On Tue, 4 Nov 2008, Peter Zijlstra wrote:

> I think the main reason is that we have per-node vmstats so the cpuset
> extention is relatively easy. Whereas we do not currently maintain
> vmstats on a cgroup level - although I imagine that could be remedied.
> 

That's correct, the ZVC values from each node assigned to a particular 
cpuset is used to determine the dirty ratios.  We don't currently have the 
infrastructure to do that with any other cgroup.

So extracting this out to a separate cgroup is fine, but it's still only 
applicable to cpusets at this time.  Let me know what the prerequisites 
are for -mm inclusion.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 20:53   ` Peter Zijlstra
  2008-11-04 20:58     ` Christoph Lameter
  2008-11-04 21:10     ` David Rientjes
@ 2008-11-04 21:16     ` Andrew Morton
  2008-11-04 21:21       ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 21:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rientjes, cl, npiggin, menage, dfults, linux-kernel

On Tue, 04 Nov 2008 21:53:08 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
> > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > This is the revised cpuset writeback throttling patchset
> > 
> > I'm all confused about why this is a cpuset thing rather than a cgroups
> > thing.  What are the relationships here?
> > 
> > I mean, writeback throttling _should_ operate upon a control group
> > (more specifically: a memcg), yes?  I guess you're assuming a 1:1
> > relationship here?
> 
> I think the main reason is that we have per-node vmstats so the cpuset
> extention is relatively easy. Whereas we do not currently maintain
> vmstats on a cgroup level - although I imagine that could be remedied.

It didn't look easy to me - it added a lot more code in places which are
already wicked complex.

I'm trying to understand where this is all coming from and what fits
into where.  Fiddling with a cpuset's mems_allowed for purposes of
memory partitioning is all nasty 2007 technology, isn't it?  Does a raw
cpuset-based control such as this have a future?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-11-04 21:09   ` Andrew Morton
@ 2008-11-04 21:20     ` Christoph Lameter
  2008-11-04 21:42       ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-04 21:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, npiggin, peterz, menage, dfults, linux-kernel

On Tue, 4 Nov 2008, Andrew Morton wrote:

>> +#ifdef CONFIG_CPUSETS
>> +#if MAX_NUMNODES <= BITS_PER_LONG
>> +	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
>> +#else
>> +	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
>> +#endif
>> +#endif
>>  } __attribute__((aligned(sizeof(long))));
>
> eek.  Increasing the size of the address_space (and hence of the inode)
> is a moderately big deal - there can be millions of these in memory.

Well this is adding only a single word to the inode structure.

>> @@ -72,6 +72,8 @@ struct writeback_control {
>>  	 * so we use a single control to update them
>>  	 */
>>  	unsigned no_nrwrite_index_update:1;
>> +
>> +	nodemask_t *nodes;		/* Nodemask to writeback */
>
> This one doesn't get ifdefs?

The structure is typically allocated temporarily on the stack.

>> +	nodemask_t *nodes = mapping->dirty_nodes;
>> +	int node = page_to_nid(page);
>> +
>> +	if (!nodes) {
>> +		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
>
> erk, OK, called from __set_page_dirty, needs to be atomic.
>
> What are the consequences when this allocation fails?

Dirty tracking will not occur. All nodes are assumed to be dirty.

We discussed this earlier 
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/2291.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 21:16     ` Andrew Morton
@ 2008-11-04 21:21       ` Peter Zijlstra
  2008-11-04 21:50         ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2008-11-04 21:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rientjes, cl, npiggin, menage, dfults, linux-kernel

On Tue, 2008-11-04 at 13:16 -0800, Andrew Morton wrote:
> On Tue, 04 Nov 2008 21:53:08 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
> > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
> > > David Rientjes <rientjes@google.com> wrote:
> > > 
> > > > This is the revised cpuset writeback throttling patchset
> > > 
> > > I'm all confused about why this is a cpuset thing rather than a cgroups
> > > thing.  What are the relationships here?
> > > 
> > > I mean, writeback throttling _should_ operate upon a control group
> > > (more specifically: a memcg), yes?  I guess you're assuming a 1:1
> > > relationship here?
> > 
> > I think the main reason is that we have per-node vmstats so the cpuset
> > extention is relatively easy. Whereas we do not currently maintain
> > vmstats on a cgroup level - although I imagine that could be remedied.
> 
> It didn't look easy to me - it added a lot more code in places which are
> already wicked complex.
> 
> I'm trying to understand where this is all coming from and what fits
> into where.  Fiddling with a cpuset's mems_allowed for purposes of
> memory partitioning is all nasty 2007 technology, isn't it?  Does a raw
> cpuset-based control such as this have a future?

Yes, cpusets are making a come-back on the embedded multi-core Real-Time
side. Folks love to isolate stuff..

Not saying I really like it...

Also, there seems to be talk about node aware pdflush from the
filesystems folks, not sure we need cpusets for that, but this does seem
to add some node information into it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-11-04 21:20     ` Christoph Lameter
@ 2008-11-04 21:42       ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 21:42 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: rientjes, npiggin, peterz, menage, dfults, linux-kernel

On Tue, 4 Nov 2008 15:20:56 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> >> +#ifdef CONFIG_CPUSETS
> >> +#if MAX_NUMNODES <= BITS_PER_LONG
> >> +	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
> >> +#else
> >> +	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
> >> +#endif
> >> +#endif
> >>  } __attribute__((aligned(sizeof(long))));
> >
> > eek.  Increasing the size of the address_space (and hence of the inode)
> > is a moderately big deal - there can be millions of these in memory.
> 
> Well this is adding only a single word to the inode structure.

multiplied by millions.

> >> @@ -72,6 +72,8 @@ struct writeback_control {
> >>  	 * so we use a single control to update them
> >>  	 */
> >>  	unsigned no_nrwrite_index_update:1;
> >> +
> >> +	nodemask_t *nodes;		/* Nodemask to writeback */
> >
> > This one doesn't get ifdefs?
> 
> The structure is typically allocated temporarily on the stack.

An ifdef also helps catch the presence of unnecesary code.

> >> +	nodemask_t *nodes = mapping->dirty_nodes;
> >> +	int node = page_to_nid(page);
> >> +
> >> +	if (!nodes) {
> >> +		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
> >
> > erk, OK, called from __set_page_dirty, needs to be atomic.
> >
> > What are the consequences when this allocation fails?
> 
> Dirty tracking will not occur. All nodes are assumed to be dirty.

It won't oops?

> We discussed this earlier 
> http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/2291.html

Well, if it isn't in the changelog and isn't in code comments, we get
to discuss it again.

A great amount of mailing list discussion is a Huge Honking Big Fat
Hint that the original code was insufficently understandable.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 21:21       ` Peter Zijlstra
@ 2008-11-04 21:50         ` Andrew Morton
  2008-11-04 22:17           ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 21:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rientjes, cl, npiggin, menage, dfults, linux-kernel, containers

On Tue, 04 Nov 2008 22:21:50 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2008-11-04 at 13:16 -0800, Andrew Morton wrote:
> > On Tue, 04 Nov 2008 21:53:08 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
> > > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
> > > > David Rientjes <rientjes@google.com> wrote:
> > > > 
> > > > > This is the revised cpuset writeback throttling patchset
> > > > 
> > > > I'm all confused about why this is a cpuset thing rather than a cgroups
> > > > thing.  What are the relationships here?
> > > > 
> > > > I mean, writeback throttling _should_ operate upon a control group
> > > > (more specifically: a memcg), yes?  I guess you're assuming a 1:1
> > > > relationship here?
> > > 
> > > I think the main reason is that we have per-node vmstats so the cpuset
> > > extention is relatively easy. Whereas we do not currently maintain
> > > vmstats on a cgroup level - although I imagine that could be remedied.
> > 
> > It didn't look easy to me - it added a lot more code in places which are
> > already wicked complex.
> > 
> > I'm trying to understand where this is all coming from and what fits
> > into where.  Fiddling with a cpuset's mems_allowed for purposes of
> > memory partitioning is all nasty 2007 technology, isn't it?  Does a raw
> > cpuset-based control such as this have a future?
> 
> Yes, cpusets are making a come-back on the embedded multi-core Real-Time
> side. Folks love to isolate stuff..
> 
> Not saying I really like it...
> 
> Also, there seems to be talk about node aware pdflush from the
> filesystems folks, not sure we need cpusets for that, but this does seem
> to add some node information into it.

Sorry, but I'm not seeing enough solid justification here for merging a
fairly large amount of fairly tricksy code into core kernel.  Code
which, afaict, is heading in the opposite direction from where we've
all been going for a year or two.

What are the alternatives here?  What do we need to do to make
throttling a per-memcg thing?


The patchset is badly misnamed, btw.  It doesn't throttle writeback -
in fact several people are working on IO bandwidth controllers and
calling this thing "writeback throttling" risks confusion.

What we're in fact throttling is rate-of-memory-dirtying.  The last
thing we want to throttle is writeback - we want it to go as fast as
possible!

Only I can't think of a suitable handy-dandy moniker for this concept.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 21:50         ` Andrew Morton
@ 2008-11-04 22:17           ` Christoph Lameter
  2008-11-04 22:35             ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-04 22:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> What are the alternatives here?  What do we need to do to make
> throttling a per-memcg thing?

Add statistics to the memcg lru and then you need some kind of sets of 
memcgs that are represented by bitmaps or so attached to an inode.

> The patchset is badly misnamed, btw.  It doesn't throttle writeback -
> in fact several people are working on IO bandwidth controllers and
> calling this thing "writeback throttling" risks confusion.

It is limiting dirty pages and throttling the dirty rate of applications 
in a  NUMA system (same procedure as we do in non NUMA). The excessive 
dirtying without this patchset can cause OOMs to occur on NUMA systems.

> What we're in fact throttling is rate-of-memory-dirtying.  The last
> thing we want to throttle is writeback - we want it to go as fast as
> possible!

We want to limit the amount of dirty pages such that I/O is progressing 
with optimal speeds for an application that significantly dirties memory. 
Other processes need to be still able to do their work without being 
swapped out because of excessive memory use for dirty pages by the first 
process.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:17           ` Christoph Lameter
@ 2008-11-04 22:35             ` Andrew Morton
  2008-11-04 22:52               ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008 16:17:52 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > What are the alternatives here?  What do we need to do to make
> > throttling a per-memcg thing?
> 
> Add statistics to the memcg lru and then you need some kind of sets of 
> memcgs that are represented by bitmaps or so attached to an inode.
> 
> > The patchset is badly misnamed, btw.  It doesn't throttle writeback -
> > in fact several people are working on IO bandwidth controllers and
> > calling this thing "writeback throttling" risks confusion.
> 
> It is limiting dirty pages and throttling the dirty rate of applications 
> in a  NUMA system (same procedure as we do in non NUMA). The excessive 
> dirtying without this patchset can cause OOMs to occur on NUMA systems.

yup.

To fix this with a memcg-based throttling, the operator would need to
be able to create memcg's which have pages only from particular nodes. 
(That's a bit indirect relative to what they want to do, but is
presumably workable).

But do we even have that capability now?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:35             ` Andrew Morton
@ 2008-11-04 22:52               ` Christoph Lameter
  2008-11-04 23:36                 ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-04 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> To fix this with a memcg-based throttling, the operator would need to
> be able to create memcg's which have pages only from particular nodes.
> (That's a bit indirect relative to what they want to do, but is
> presumably workable).

The system would need to have the capability to find the memcg groups that 
have dirty pages for a certain inode. Files are not constrained to nodes 
or memcg groups.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:52               ` Christoph Lameter
@ 2008-11-04 23:36                 ` Andrew Morton
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
  2008-11-05  2:45                   ` Christoph Lameter
  0 siblings, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2008-11-04 23:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > To fix this with a memcg-based throttling, the operator would need to
> > be able to create memcg's which have pages only from particular nodes.
> > (That's a bit indirect relative to what they want to do, but is
> > presumably workable).
> 
> The system would need to have the capability to find the memcg groups that 
> have dirty pages for a certain inode. Files are not constrained to nodes 
> or memcg groups.

Ah, we're talking about different things.

In a memcg implementation what we would implement is "throttle
page-dirtying tasks in this memcg when the memcg's dirty memory reaches
40% of its total".

But that doesn't solve the problem which this patchset is trying to
solve, which is "don't let all the memory in all this group of nodes
get dirty".


Yes?  Someone help me out here.  I don't yet have my head around the
overlaps and incompatibilities here.  Perhaps the containers guys will
wake up and put their thinking caps on?



What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 23:36                 ` Andrew Morton
@ 2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
  2008-11-05  3:09                     ` Andrew Morton
  2008-11-05  2:45                   ` Christoph Lameter
  1 sibling, 1 reply; 46+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, npiggin, dfults, linux-kernel, rientjes,
	containers, menage

On Tue, 4 Nov 2008 15:36:10 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
> Christoph Lameter <cl@linux-foundation.org> wrote:
> 
> > On Tue, 4 Nov 2008, Andrew Morton wrote:
> > 
> > > To fix this with a memcg-based throttling, the operator would need to
> > > be able to create memcg's which have pages only from particular nodes.
> > > (That's a bit indirect relative to what they want to do, but is
> > > presumably workable).
> > 
> > The system would need to have the capability to find the memcg groups that 
> > have dirty pages for a certain inode. Files are not constrained to nodes 
> > or memcg groups.
> 
> Ah, we're talking about different things.
> 
> In a memcg implementation what we would implement is "throttle
> page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> 40% of its total".
> 
yes. Andrea posted that.


> But that doesn't solve the problem which this patchset is trying to
> solve, which is "don't let all the memory in all this group of nodes
> get dirty".
> 
yes. but this patch doesn't help the case you mentioned below.

> 
> Yes?  Someone help me out here.  I don't yet have my head around the
> overlaps and incompatibilities here.  Perhaps the containers guys will
> wake up and put their thinking caps on?
> 
> 
> 
> What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> 
For help this, per-node-dirty-ratio-throttoling is necessary.

Shouldn't we just have a new parameter as /proc/sys/vm/dirty_ratio_per_node.

/proc/sys/vm/dirty_ratio works for throttling the whole system dirty pages.
/proc/sys/vm/dirty_ratio_per_node works for throttling dirty pages in a node.

Implementation will not be difficult and works enough against OOM.

Thanks,
-Kame









^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 23:36                 ` Andrew Morton
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
@ 2008-11-05  2:45                   ` Christoph Lameter
  2008-11-05  3:05                     ` Andrew Morton
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-05  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> In a memcg implementation what we would implement is "throttle
> page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> 40% of its total".

Right that is similar to what this patch does for cpusets. A memcg 
implementation would need to figure out if we are currently part of a 
memcg and then determine the percentage of memory that is dirty.

That is one aspect. When performing writeback then we need to figure out 
which inodes have dirty pages in the memcg and we need to start writeout 
on those inodes and not on others that have their dirty pages elsewhere. 
There are two components of this that are in this patch and that would 
also have to be implemented for a memcg.

> But that doesn't solve the problem which this patchset is trying to
> solve, which is "don't let all the memory in all this group of nodes
> get dirty".

This patch would solve the problem if the calculation of the dirty pages 
would consider the active memcg and be able to determine the amount of 
dirty pages (through some sort of additional memcg counters). That is just 
the first part though. The second part of finding the inodes that have 
dirty pages for writeback would require an association between memcgs and 
inodes.

> What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?

Yes if the activities of cpuset A cause all pages to be dirtied in cpuset 
B and then cpuset B attempts to do writeback. This will fail to acquire 
enough memory for writeback and make reclaim impossible.

Typically cpusets are not overlapped like that but used to segment the 
system.

The system would work correctly if the dirty ratio calculation would be 
done on all overlapping cpusets/memcg groups that contain nodes from 
which allocations are permitted.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  2:45                   ` Christoph Lameter
@ 2008-11-05  3:05                     ` Andrew Morton
  2008-11-05  4:31                       ` KAMEZAWA Hiroyuki
  2008-11-05 13:52                       ` Christoph Lameter
  0 siblings, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2008-11-05  3:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008 20:45:17 -0600 (CST) Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > In a memcg implementation what we would implement is "throttle
> > page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> > 40% of its total".
> 
> Right that is similar to what this patch does for cpusets. A memcg 
> implementation would need to figure out if we are currently part of a 
> memcg and then determine the percentage of memory that is dirty.
> 
> That is one aspect. When performing writeback then we need to figure out 
> which inodes have dirty pages in the memcg and we need to start writeout 
> on those inodes and not on others that have their dirty pages elsewhere. 
> There are two components of this that are in this patch and that would 
> also have to be implemented for a memcg.

Doable.  lru->page->mapping->host is a good start.

> > But that doesn't solve the problem which this patchset is trying to
> > solve, which is "don't let all the memory in all this group of nodes
> > get dirty".
> 
> This patch would solve the problem if the calculation of the dirty pages 
> would consider the active memcg and be able to determine the amount of 
> dirty pages (through some sort of additional memcg counters). That is just 
> the first part though. The second part of finding the inodes that have 
> dirty pages for writeback would require an association between memcgs and 
> inodes.

We presently have that via the LRU.  It has holes, but so does this per-cpuset
scheme.

> > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> > uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> 
> Yes if the activities of cpuset A cause all pages to be dirtied in cpuset 
> B and then cpuset B attempts to do writeback. This will fail to acquire 
> enough memory for writeback and make reclaim impossible.
> 
> Typically cpusets are not overlapped like that but used to segment the 
> system.
> 
> The system would work correctly if the dirty ratio calculation would be 
> done on all overlapping cpusets/memcg groups that contain nodes from 
> which allocations are permitted.

That.


Generally, I worry that this is a specific fix to a specific problem
encountered on specific machines with specific setups and specific
workloads, and that it's just all too low-level and myopic.

And now we're back in the usual position where there's existing code and
everyone says it's terribly wonderful and everyone is reluctant to step
back and look at the big picture.  Am I wrong?


Plus: we need per-memcg dirty-memory throttling, and this is more
important than per-cpuset, I suspect.  How will the (already rather
buggy) code look once we've stuffed both of them in there?


I agree that there's a problem here, although given the amount of time
that it's been there, I suspect that it is a very small problem. 
Someone please convince me that in three years time we will agree that
merging this fix to that problem was a correct decision?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
@ 2008-11-05  3:09                     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2008-11-05  3:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, npiggin, dfults, linux-kernel, rientjes,
	containers, menage

On Wed, 5 Nov 2008 10:31:23 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > 
> > Yes?  Someone help me out here.  I don't yet have my head around the
> > overlaps and incompatibilities here.  Perhaps the containers guys will
> > wake up and put their thinking caps on?
> > 
> > 
> > 
> > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> > uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> > 
> For help this, per-node-dirty-ratio-throttoling is necessary.
> 
> Shouldn't we just have a new parameter as /proc/sys/vm/dirty_ratio_per_node.

I guess that would work.  But it is a general solution and will be less
efficient for the particular setups which are triggering this problem.

> /proc/sys/vm/dirty_ratio works for throttling the whole system dirty pages.
> /proc/sys/vm/dirty_ratio_per_node works for throttling dirty pages in a node.
> 
> Implementation will not be difficult and works enough against OOM.

Yup.  Just track per-node dirtiness and walk the LRU when it is over
threshold.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  3:05                     ` Andrew Morton
@ 2008-11-05  4:31                       ` KAMEZAWA Hiroyuki
  2008-11-10  9:02                         ` Andrea Righi
  2008-11-05 13:52                       ` Christoph Lameter
  1 sibling, 1 reply; 46+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, peterz, rientjes, npiggin, menage, dfults,
	linux-kernel, containers, righi.andrea

On Tue, 4 Nov 2008 19:05:05 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> Generally, I worry that this is a specific fix to a specific problem
> encountered on specific machines with specific setups and specific
> workloads, and that it's just all too low-level and myopic.
> 
> And now we're back in the usual position where there's existing code and
> everyone says it's terribly wonderful and everyone is reluctant to step
> back and look at the big picture.  Am I wrong?
> 
> 
> Plus: we need per-memcg dirty-memory throttling, and this is more
> important than per-cpuset, I suspect.  How will the (already rather
> buggy) code look once we've stuffed both of them in there?
> 
> 
IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
in early October.

  (1) patch for adding dirty_ratio_pcm. (1/100000)
  (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
 
(1) should be just posted again.

Because we have changed page_cgroup implementation, (2) should be reworked.
"rework" itself will not be very difficult.
(.... we tend to be stick to "what interface is the best" discussion ;) 

But memcg itself is not so weak against dirty_pages because we don't call
try_to_free_pages() becasue of memory shortage but because of memory limitation.

BTW, in my current stack, followings are queued.
   a. handle SwapCache in proper way in memcg.
   b. handle swap_cgroup (if configured)
   c. make LRU handling easier

For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  3:05                     ` Andrew Morton
  2008-11-05  4:31                       ` KAMEZAWA Hiroyuki
@ 2008-11-05 13:52                       ` Christoph Lameter
  2008-11-05 18:41                         ` Andrew Morton
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-05 13:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

>> That is one aspect. When performing writeback then we need to figure out
>> which inodes have dirty pages in the memcg and we need to start writeout
>> on those inodes and not on others that have their dirty pages elsewhere.
>> There are two components of this that are in this patch and that would
>> also have to be implemented for a memcg.
>
> Doable.  lru->page->mapping->host is a good start.

The block layer has a list of inodes that are dirty. From that we need to 
select ones that will improve the situation from the cpuset/memcg. How 
does the LRU come into this?

>> This patch would solve the problem if the calculation of the dirty pages
>> would consider the active memcg and be able to determine the amount of
>> dirty pages (through some sort of additional memcg counters). That is just
>> the first part though. The second part of finding the inodes that have
>> dirty pages for writeback would require an association between memcgs and
>> inodes.
>
> We presently have that via the LRU.  It has holes, but so does this per-cpuset
> scheme.

How do I get to the LRU from the dirtied list of inodes?

> Generally, I worry that this is a specific fix to a specific problem
> encountered on specific machines with specific setups and specific
> workloads, and that it's just all too low-level and myopic.
>
> And now we're back in the usual position where there's existing code and
> everyone says it's terribly wonderful and everyone is reluctant to step
> back and look at the big picture.  Am I wrong?

Well everyone is just reluctant to do work it seems. Thus they fall back 
to a solution that I provided when memcg groups were not yet available. It 
would be best if someone could find a general scheme or generalize this 
patchset.

> Plus: we need per-memcg dirty-memory throttling, and this is more
> important than per-cpuset, I suspect.  How will the (already rather
> buggy) code look once we've stuffed both of them in there?

The basics will still be the same

1. One need to establish the dirty ratio of memcgs and monitor them.
2. There needs to be mechanism to perform writeout on the right inodes.

> I agree that there's a problem here, although given the amount of time
> that it's been there, I suspect that it is a very small problem.

It used to be only a problem for NUMA systems. Now its also a problem for 
memcgs.

> Someone please convince me that in three years time we will agree that
> merging this fix to that problem was a correct decision?

At the mininum: It provides a basis on top of which memcg support 
can be developed. There are likely major modifications needed to VM 
statistics to get there for memcg groups.





^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 13:52                       ` Christoph Lameter
@ 2008-11-05 18:41                         ` Andrew Morton
  2008-11-05 20:21                           ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-05 18:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Wed, 5 Nov 2008 07:52:44 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> >> That is one aspect. When performing writeback then we need to figure out
> >> which inodes have dirty pages in the memcg and we need to start writeout
> >> on those inodes and not on others that have their dirty pages elsewhere.
> >> There are two components of this that are in this patch and that would
> >> also have to be implemented for a memcg.
> >
> > Doable.  lru->page->mapping->host is a good start.
> 
> The block layer has a list of inodes that are dirty. From that we need to 
> select ones that will improve the situation from the cpuset/memcg. How 
> does the LRU come into this?

In the simplest case, dirty-memory throttling can just walk the LRU
writing back pages in the same way that kswapd does.

There would probably be performance benefits in doing
address_space-ordered writeback, so the dirty-memory throttling could
pick a dirty page off the LRU, go find its inode and then feed that
into __sync_single_inode().

> >> This patch would solve the problem if the calculation of the dirty pages
> >> would consider the active memcg and be able to determine the amount of
> >> dirty pages (through some sort of additional memcg counters). That is just
> >> the first part though. The second part of finding the inodes that have
> >> dirty pages for writeback would require an association between memcgs and
> >> inodes.
> >
> > We presently have that via the LRU.  It has holes, but so does this per-cpuset
> > scheme.
> 
> How do I get to the LRU from the dirtied list of inodes?

Don't need it.

It'll be approximate and has obvious scenarios of great inaccuraracy
but it'll suffice for the workloads which this patchset addresses.



It sounds like any memcg-based approach just won't be suitable for the
people who are hitting this problem.

But _are_ people hitting this problem?  I haven't seen any real-looking
reports in ages.  Is there some workaround?  If so, what is it?  How
serious is this problem now?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 18:41                         ` Andrew Morton
@ 2008-11-05 20:21                           ` Christoph Lameter
  2008-11-05 20:31                             ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-05 20:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Wed, 5 Nov 2008, Andrew Morton wrote:

> > > Doable.  lru->page->mapping->host is a good start.
> >
> > The block layer has a list of inodes that are dirty. From that we need to
> > select ones that will improve the situation from the cpuset/memcg. How
> > does the LRU come into this?
>
> In the simplest case, dirty-memory throttling can just walk the LRU
> writing back pages in the same way that kswapd does.

That means running reclaim. But we are only interested in getting rid of
dirty pages. Plus the filesystem guys have repeatedly pointed out that
page sized I/O to random places in a file is not a good thing to do. There
was actually talk of stopping kswapd from writing out pages!

> There would probably be performance benefits in doing
> address_space-ordered writeback, so the dirty-memory throttling could
> pick a dirty page off the LRU, go find its inode and then feed that
> into __sync_single_inode().

We cannot call into the writeback functions for an inode from a reclaim
context. We can write back single pages but not a range of pages from an
inode due to various locking issues (see discussion on slab defrag
patchset).

> > How do I get to the LRU from the dirtied list of inodes?
>
> Don't need it.
>
> It'll be approximate and has obvious scenarios of great inaccuraracy
> but it'll suffice for the workloads which this patchset addresses.

Sounds like a wild hack that runs against known limitations in terms
of locking etc.

> It sounds like any memcg-based approach just won't be suitable for the
> people who are hitting this problem.

Why not? If you can determine which memcgs an inode has dirty pages on
then the same scheme as proposed here will work.

> But _are_ people hitting this problem?  I haven't seen any real-looking
> reports in ages.  Is there some workaround?  If so, what is it?  How
> serious is this problem now?

Are there people who are actually having memcg based solutions deployed?
No enterprise release includes it yet so I guess that there is not much of
a use yet.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:21                           ` Christoph Lameter
@ 2008-11-05 20:31                             ` Andrew Morton
  2008-11-05 20:40                               ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2008-11-05 20:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Wed, 5 Nov 2008 14:21:47 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Wed, 5 Nov 2008, Andrew Morton wrote:
> 
> > > > Doable.  lru->page->mapping->host is a good start.
> > >
> > > The block layer has a list of inodes that are dirty. From that we need to
> > > select ones that will improve the situation from the cpuset/memcg. How
> > > does the LRU come into this?
> >
> > In the simplest case, dirty-memory throttling can just walk the LRU
> > writing back pages in the same way that kswapd does.
> 
> That means running reclaim. But we are only interested in getting rid of
> dirty pages. Plus the filesystem guys have repeatedly pointed out that
> page sized I/O to random places in a file is not a good thing to do. There
> was actually talk of stopping kswapd from writing out pages!

They don't have to be reclaimed.

> > There would probably be performance benefits in doing
> > address_space-ordered writeback, so the dirty-memory throttling could
> > pick a dirty page off the LRU, go find its inode and then feed that
> > into __sync_single_inode().
> 
> We cannot call into the writeback functions for an inode from a reclaim
> context. We can write back single pages but not a range of pages from an
> inode due to various locking issues (see discussion on slab defrag
> patchset).

We're not in a reclaim context.  We're in sys_write() context.

> > But _are_ people hitting this problem?  I haven't seen any real-looking
> > reports in ages.  Is there some workaround?  If so, what is it?  How
> > serious is this problem now?
> 
> Are there people who are actually having memcg based solutions deployed?
> No enterprise release includes it yet so I guess that there is not much of
> a use yet.

If you know the answer then please provide it.  If you don't, please
say "I don't know".


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:31                             ` Andrew Morton
@ 2008-11-05 20:40                               ` Christoph Lameter
  2008-11-05 20:56                                 ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2008-11-05 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Wed, 5 Nov 2008, Andrew Morton wrote:

> > That means running reclaim. But we are only interested in getting rid of
> > dirty pages. Plus the filesystem guys have repeatedly pointed out that
> > page sized I/O to random places in a file is not a good thing to do. There
> > was actually talk of stopping kswapd from writing out pages!
>
> They don't have to be reclaimed.

Well the LRU is used for reclaim. If you step over it then its using the
existing reclaim logic in vmscan.c right?

> > > There would probably be performance benefits in doing
> > > address_space-ordered writeback, so the dirty-memory throttling could
> > > pick a dirty page off the LRU, go find its inode and then feed that
> > > into __sync_single_inode().
> >
> > We cannot call into the writeback functions for an inode from a reclaim
> > context. We can write back single pages but not a range of pages from an
> > inode due to various locking issues (see discussion on slab defrag
> > patchset).
>
> We're not in a reclaim context.  We're in sys_write() context.

Dirtying a page can occur from a variety of kernel contexts.

> > > But _are_ people hitting this problem?  I haven't seen any real-looking
> > > reports in ages.  Is there some workaround?  If so, what is it?  How
> > > serious is this problem now?
> >
> > Are there people who are actually having memcg based solutions deployed?
> > No enterprise release includes it yet so I guess that there is not much of
> > a use yet.
>
> If you know the answer then please provide it.  If you don't, please
> say "I don't know".

I thought we were talking about memcg related reports. I have dealt with
scores of the cpuset related ones in my prior job.

Workarounds are:

1. Reduce the global dirty ratios so that the number of dirty pages in a
cpuset cannot become too high.

2. Do not create small cpusets where the system can dirty all pages.

3. Find other ways to limit the dirty pages (run sync once in a while or
so).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:40                               ` Christoph Lameter
@ 2008-11-05 20:56                                 ` Andrew Morton
  2008-11-05 21:28                                   ` Christoph Lameter
                                                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Andrew Morton @ 2008-11-05 20:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel, containers

On Wed, 5 Nov 2008 14:40:05 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Wed, 5 Nov 2008, Andrew Morton wrote:
> 
> > > That means running reclaim. But we are only interested in getting rid of
> > > dirty pages. Plus the filesystem guys have repeatedly pointed out that
> > > page sized I/O to random places in a file is not a good thing to do. There
> > > was actually talk of stopping kswapd from writing out pages!
> >
> > They don't have to be reclaimed.
> 
> Well the LRU is used for reclaim. If you step over it then its using the
> existing reclaim logic in vmscan.c right?

Only if you use it that way.

I imagine that a suitable implementation would start IO on the page
then move it to the other end of the LRU.  ie: treat it as referenced. 
Pretty simple stuff.

If we were to do writeout on the page's inode instead then we'd need
to move the page out of the way somehow, presumably by rotating it.

It's all workable outable.

> > > > There would probably be performance benefits in doing
> > > > address_space-ordered writeback, so the dirty-memory throttling could
> > > > pick a dirty page off the LRU, go find its inode and then feed that
> > > > into __sync_single_inode().
> > >
> > > We cannot call into the writeback functions for an inode from a reclaim
> > > context. We can write back single pages but not a range of pages from an
> > > inode due to various locking issues (see discussion on slab defrag
> > > patchset).
> >
> > We're not in a reclaim context.  We're in sys_write() context.
> 
> Dirtying a page can occur from a variety of kernel contexts.

This writeback will occur from one quite specific place:
balance_dirty_pages().  That's called from sys_write() and pagefaults. 
Other scruffy places like splice too.

But none of that matters - the fact is that we're _already_ doing
writeback from balance_dirty_pages().  All we're talking about here is
alternative schemes for looking up the pages to write.

> > > > But _are_ people hitting this problem?  I haven't seen any real-looking
> > > > reports in ages.  Is there some workaround?  If so, what is it?  How
> > > > serious is this problem now?
> > >
> > > Are there people who are actually having memcg based solutions deployed?
> > > No enterprise release includes it yet so I guess that there is not much of
> > > a use yet.
> >
> > If you know the answer then please provide it.  If you don't, please
> > say "I don't know".
> 
> I thought we were talking about memcg related reports. I have dealt with
> scores of the cpuset related ones in my prior job.
> 
> Workarounds are:
> 
> 1. Reduce the global dirty ratios so that the number of dirty pages in a
> cpuset cannot become too high.

That would be less than the smallest node's memory capacity, I guess.

> 2. Do not create small cpusets where the system can dirty all pages.
> 
> 3. Find other ways to limit the dirty pages (run sync once in a while or
> so).

hm, OK.


See, here's my problem: we have a pile of new code which fixes some
problem.  But the problem seems to be fairly small - it only affects a
small number of sophisticated users and they already have workarounds
in place.

So the world wouldn't end if we just didn't merge it.  Those users
stick with their workarounds and the kernel remains simpler and
smaller.

How do we work out which is the best choice here?  I don't have enough
information to do this.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                 ` Andrew Morton
@ 2008-11-05 21:28                                   ` Christoph Lameter
  2008-11-05 21:55                                   ` Paul Menage
  2008-11-05 22:04                                   ` David Rientjes
  2 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2008-11-05 21:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers, travis, steiner

On Wed, 5 Nov 2008, Andrew Morton wrote:

> See, here's my problem: we have a pile of new code which fixes some
> problem.  But the problem seems to be fairly small - it only affects a
> small number of sophisticated users and they already have workarounds
> in place.

Well yes... Great situation with those workarounds.

> So the world wouldn't end if we just didn't merge it.  Those users
> stick with their workarounds and the kernel remains simpler and
> smaller.
>
> How do we work out which is the best choice here?  I don't have enough
> information to do this.

Not sure how to treat this. I am not involved with large NUMA at this
point. So the people who are interested need to speak up if they want
this.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                 ` Andrew Morton
  2008-11-05 21:28                                   ` Christoph Lameter
@ 2008-11-05 21:55                                   ` Paul Menage
  2008-11-05 22:04                                   ` David Rientjes
  2 siblings, 0 replies; 46+ messages in thread
From: Paul Menage @ 2008-11-05 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, peterz, rientjes, npiggin, dfults,
	linux-kernel, containers

On Wed, Nov 5, 2008 at 12:56 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>>
>> 1. Reduce the global dirty ratios so that the number of dirty pages in a
>> cpuset cannot become too high.
>
> That would be less than the smallest node's memory capacity, I guess.

Even that doesn't work - if there's a single global limit on dirty
pages, then any cpuset/cgroup with access to enough memory can exhaust
that limit and cause other processes to block when they try to write
to disk. You need independent dirty counts to avoid that, whether it
be per-node or per-cgroup.

Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                 ` Andrew Morton
  2008-11-05 21:28                                   ` Christoph Lameter
  2008-11-05 21:55                                   ` Paul Menage
@ 2008-11-05 22:04                                   ` David Rientjes
  2008-11-06  1:34                                     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 46+ messages in thread
From: David Rientjes @ 2008-11-05 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, peterz, npiggin, Paul Menage, dfults,
	linux-kernel, containers, Andrea Righi

On Wed, 5 Nov 2008, Andrew Morton wrote:

> See, here's my problem: we have a pile of new code which fixes some
> problem.  But the problem seems to be fairly small - it only affects a
> small number of sophisticated users and they already have workarounds
> in place.
> 

The workarounds, while restrictive of how you configure your cpusets, are 
indeed effective.

> So the world wouldn't end if we just didn't merge it.  Those users
> stick with their workarounds and the kernel remains simpler and
> smaller.
> 

Agreed.  This patchset is admittedly from a different time when cpusets 
was the only relevant extension that needed to be done.

> How do we work out which is the best choice here?  I don't have enough
> information to do this.
> 

If we are to support memcg-specific dirty ratios, that requires the 
aforementioned statistics to be collected so that the calculation is even 
possible.  The series at 

	http://marc.info/?l=linux-kernel&m=122123225006571
	http://marc.info/?l=linux-kernel&m=122123241106902

is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
throttle_vm_writeout() can also use the new statistics.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 22:04                                   ` David Rientjes
@ 2008-11-06  1:34                                     ` KAMEZAWA Hiroyuki
  2008-11-06 20:35                                       ` David Rientjes
  0 siblings, 1 reply; 46+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  1:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, npiggin, Christoph Lameter, dfults, linux-kernel,
	containers, Paul Menage, Andrea Righi

On Wed, 5 Nov 2008 14:04:42 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> > So the world wouldn't end if we just didn't merge it.  Those users
> > stick with their workarounds and the kernel remains simpler and
> > smaller.
> > 
> 
> Agreed.  This patchset is admittedly from a different time when cpusets 
> was the only relevant extension that needed to be done.
> 
BTW, what is the problem this patch wants to fix ?
  1. avoid slow-down of memory allocation by triggering write-out earlier.
  2. avoid OOM by throttoling dirty pages.

About 1, memcg's diry_ratio can help if mounted as
   mount -t cgroup none /somewhere/  -o cpuset,memory
(If the user can accept overheads of memcg.)
If implemented.

About 2, A Google guy posted OOM handler cgroup to linux-mm.

> > How do we work out which is the best choice here?  I don't have enough
> > information to do this.
> > 
> 
> If we are to support memcg-specific dirty ratios, that requires the 
> aforementioned statistics to be collected so that the calculation is even 
> possible.  The series at 
> 
> 	http://marc.info/?l=linux-kernel&m=122123225006571
> 	http://marc.info/?l=linux-kernel&m=122123241106902
> 
yes. we(memcg) need this kind of.

> is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
> be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
> throttle_vm_writeout() can also use the new statistics.
> 
Thank you for input.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-06  1:34                                     ` KAMEZAWA Hiroyuki
@ 2008-11-06 20:35                                       ` David Rientjes
  0 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-11-06 20:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, npiggin, Christoph Lameter, dfults, linux-kernel,
	containers, Paul Menage, Andrea Righi

On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:

> > Agreed.  This patchset is admittedly from a different time when cpusets 
> > was the only relevant extension that needed to be done.
> > 
> BTW, what is the problem this patch wants to fix ?
>   1. avoid slow-down of memory allocation by triggering write-out earlier.
>   2. avoid OOM by throttoling dirty pages.
> 
> About 1, memcg's diry_ratio can help if mounted as
>    mount -t cgroup none /somewhere/  -o cpuset,memory
> (If the user can accept overheads of memcg.)
> If implemented.
> 

Yeah, it needs to be generalized to its own cgroup so that it doesn't 
depend on both CONFIG_CPUSETS or CONFIG_CGROUP_MEM_RES_CTLR.  If we get 
the dirty and writeback page statistics added to memcg, this becomes much 
simpler.

> About 2, A Google guy posted OOM handler cgroup to linux-mm.
> 

Yeah, this could enable one of the workarounds that Christoph earlier 
described: the oom handler has the ability to notify userspace and allows 
it to defer invoking the oom killer if there's an alternative way to 
remedy the situation.  So the oom handler posted to linux-mm could work by 
doing a sync anytime it ran low on memory, but the objective of this 
patchset is different.

The idea here is to implement per-cpuset (and now per-memcg) dirty and 
background dirty ratios to avoid using the global sysctls.  This is 
currently problematic for users of cpusets who divide their machine for 
batches of tasks, usually for NUMA optimizations: a cpuset, for example, 
can represent 40% of the system's memory and if the global dirty ratio is 
set to 50%, we still won't begin writeback even if all the memory in the 
cpuset is dirty.

> > If we are to support memcg-specific dirty ratios, that requires the 
> > aforementioned statistics to be collected so that the calculation is even 
> > possible.  The series at 
> > 
> > 	http://marc.info/?l=linux-kernel&m=122123225006571
> > 	http://marc.info/?l=linux-kernel&m=122123241106902
> > 
> yes. we(memcg) need this kind of.
> 

Andrea, what's the status of the patch to add dirty and writeback 
statistics to memcg?  I don't see it in the October 30 mmotm or any 
followup discussion on it.

> > is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
> > be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
> > throttle_vm_writeout() can also use the new statistics.
> > 

Is this possible in a second version?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  4:31                       ` KAMEZAWA Hiroyuki
@ 2008-11-10  9:02                         ` Andrea Righi
  2008-11-10 10:02                           ` David Rientjes
  0 siblings, 1 reply; 46+ messages in thread
From: Andrea Righi @ 2008-11-10  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, rientjes
  Cc: Andrew Morton, Christoph Lameter, peterz, npiggin, menage,
	dfults, linux-kernel, containers

On 2008-11-05 05:31, KAMEZAWA Hiroyuki wrote:
> On Tue, 4 Nov 2008 19:05:05 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>> Generally, I worry that this is a specific fix to a specific problem
>> encountered on specific machines with specific setups and specific
>> workloads, and that it's just all too low-level and myopic.
>>
>> And now we're back in the usual position where there's existing code and
>> everyone says it's terribly wonderful and everyone is reluctant to step
>> back and look at the big picture.  Am I wrong?
>>
>>
>> Plus: we need per-memcg dirty-memory throttling, and this is more
>> important than per-cpuset, I suspect.  How will the (already rather
>> buggy) code look once we've stuffed both of them in there?
>>
>>
> IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
> in early October.
> 
>   (1) patch for adding dirty_ratio_pcm. (1/100000)
>   (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
>  
> (1) should be just posted again.
> 
> Because we have changed page_cgroup implementation, (2) should be reworked.
> "rework" itself will not be very difficult.
> (.... we tend to be stick to "what interface is the best" discussion ;) 
> 
> But memcg itself is not so weak against dirty_pages because we don't call
> try_to_free_pages() becasue of memory shortage but because of memory limitation.
> 
> BTW, in my current stack, followings are queued.
>    a. handle SwapCache in proper way in memcg.
>    b. handle swap_cgroup (if configured)
>    c. make LRU handling easier
> 
> For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
> If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.
> 

Hi Kame,

sorry for my late. If it's not too late tonight I'll rebase and test (1)
to 2.6.28-rc2-mm1 and start to rework on (2), also considering the
David's suggestion (split NR_UNSTABLE_NFS from NR_FILE_DIRTY).

-Andrea

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-10  9:02                         ` Andrea Righi
@ 2008-11-10 10:02                           ` David Rientjes
  0 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-11-10 10:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Christoph Lameter, peterz,
	npiggin, menage, dfults, linux-kernel, containers

On Mon, 10 Nov 2008, Andrea Righi wrote:

> > IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
> > in early October.
> > 
> >   (1) patch for adding dirty_ratio_pcm. (1/100000)
> >   (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
> >  
> > (1) should be just posted again.
> > 
> > Because we have changed page_cgroup implementation, (2) should be reworked.
> > "rework" itself will not be very difficult.
> > (.... we tend to be stick to "what interface is the best" discussion ;) 
> > 
> > But memcg itself is not so weak against dirty_pages because we don't call
> > try_to_free_pages() becasue of memory shortage but because of memory limitation.
> > 
> > BTW, in my current stack, followings are queued.
> >    a. handle SwapCache in proper way in memcg.
> >    b. handle swap_cgroup (if configured)
> >    c. make LRU handling easier
> > 
> > For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
> > If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.
> > 
> 
> Hi Kame,
> 
> sorry for my late. If it's not too late tonight I'll rebase and test (1)
> to 2.6.28-rc2-mm1 and start to rework on (2), also considering the
> David's suggestion (split NR_UNSTABLE_NFS from NR_FILE_DIRTY).
> 

The dirty throttling change only depends on patch 1/2 from (2) above, 
which adds the necessary statistics to the memcg for calculating dirty 
ratios.  Patch 2/2 from that series will need to be moved out to a 
separate cgroup as the general consensus from this discussion has 
indicated is necessary.

If it's possible to get a rebased patch 1/2 to add the statistics to the 
memory controller, I'll take it and move the all dirty throttling to a 
separate cgroup so it supports both cpusets and memcg.  Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 4/7] mm: cpuset aware reclaim writeout
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  0 siblings, 0 replies; 46+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if it's a member of the active cpuset.  But then we
call pdflush without enforcing the same restrictions.  In a larger system
this may have the effect of a massive amount of pages being dirtied and
then either

 - no writeout occurs because global dirty limits have not been reached,
   or

 - writeout starts randomly for some dirty inode in the system.  pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases, dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset.  Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1604,7 +1604,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				       &cpuset_current_mems_allowed);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2008-11-10 10:03 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
2008-10-30 19:23 ` [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
2008-11-04 21:09   ` Andrew Morton
2008-11-04 21:20     ` Christoph Lameter
2008-11-04 21:42       ` Andrew Morton
2008-10-30 19:23 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
2008-10-30 19:23 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
2008-10-30 19:23 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
2008-10-30 19:23 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
2008-10-30 19:23 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
2008-10-30 19:23 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
2008-10-30 21:08 ` [patch 0/7] cpuset " Dave Chinner
2008-10-30 21:33   ` Christoph Lameter
2008-10-30 22:03     ` Dave Chinner
2008-10-31 13:47       ` Christoph Lameter
2008-10-31 16:36       ` David Rientjes
2008-11-04 20:47 ` Andrew Morton
2008-11-04 20:53   ` Peter Zijlstra
2008-11-04 20:58     ` Christoph Lameter
2008-11-04 21:10     ` David Rientjes
2008-11-04 21:16     ` Andrew Morton
2008-11-04 21:21       ` Peter Zijlstra
2008-11-04 21:50         ` Andrew Morton
2008-11-04 22:17           ` Christoph Lameter
2008-11-04 22:35             ` Andrew Morton
2008-11-04 22:52               ` Christoph Lameter
2008-11-04 23:36                 ` Andrew Morton
2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
2008-11-05  3:09                     ` Andrew Morton
2008-11-05  2:45                   ` Christoph Lameter
2008-11-05  3:05                     ` Andrew Morton
2008-11-05  4:31                       ` KAMEZAWA Hiroyuki
2008-11-10  9:02                         ` Andrea Righi
2008-11-10 10:02                           ` David Rientjes
2008-11-05 13:52                       ` Christoph Lameter
2008-11-05 18:41                         ` Andrew Morton
2008-11-05 20:21                           ` Christoph Lameter
2008-11-05 20:31                             ` Andrew Morton
2008-11-05 20:40                               ` Christoph Lameter
2008-11-05 20:56                                 ` Andrew Morton
2008-11-05 21:28                                   ` Christoph Lameter
2008-11-05 21:55                                   ` Paul Menage
2008-11-05 22:04                                   ` David Rientjes
2008-11-06  1:34                                     ` KAMEZAWA Hiroyuki
2008-11-06 20:35                                       ` David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
2008-10-28 16:08 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).