LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 1/7] cpusets: add dirty map to struct address_space
@ 2008-10-28 16:08 David Rientjes
  2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located.  That way we will be able to implement writeout for
applications that are constrained to a portion of the memory of the
system as required by cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
					map

cpuset_init_dirty_nodes(mapping)	Initialization of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated for
systems with a large number of nodes (f.e. ia64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations.  Therefore, we protect the allocation and setting of a
node in the map through the tree_lock.  The tree_lock is already taken
when a page is dirtied so there is no additional locking overhead if we
insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.  At
that point no pages are attached to the inode anymore and therefore it
can be done without any locking. The dirty map therefore records all nodes
that have been used for dirty pages by that inode until the inode is no
longer used.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c               |    2 +
 fs/fs-writeback.c         |    7 ++++++
 fs/inode.c                |    3 ++
 include/linux/cpuset.h    |   51 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h        |    7 ++++++
 include/linux/writeback.h |    2 +
 kernel/cpuset.c           |   53 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c       |    2 +
 8 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -719,6 +720,7 @@ static int __set_page_dirty(struct page *page,
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	cpuset_update_dirty_nodes(mapping, page);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 
@@ -487,6 +488,12 @@ void generic_sync_sb_inodes(struct super_block *sb,
 			continue;		/* blockdev has wrong queue */
 		}
 
+		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+			/* No node pages under writeback */
+			requeue_io(inode);
+			continue;
+		}
+
 		/* Was this inode dirtied after sync_sb_inodes was called? */
 		if (time_after(inode->dirtied_when, start))
 			break;
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/cpuset.h>
 
 /*
  * This is needed for the following functions:
@@ -167,6 +168,7 @@ static struct inode *alloc_inode(struct super_block *sb)
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 		mapping->writeback_index = 0;
+		cpuset_init_dirty_nodes(mapping);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -271,6 +273,7 @@ void clear_inode(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	cpuset_clear_dirty_nodes(inode->i_mapping);
 	inode->i_state = I_CLEAR;
 }
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,36 @@ extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_update_dirty_nodes(__mapping, __page)			\
+	node_set(page_to_nid(__page), (__mapping)->dirty_nodes);
+
+#define cpuset_clear_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr)	\
+	(!(__nodemask_ptr) ||						\
+	 nodes_intersects((__mapping)->dirty_nodes, *(__nodemask_ptr)))
+
+#else
+struct address_space;
+
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NULL
+
+extern void cpuset_update_dirty_nodes(struct address_space *mapping,
+				      struct page *page);
+extern void cpuset_clear_dirty_nodes(struct address_space *mapping);
+extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+					 nodemask_t *mask);
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -163,6 +193,27 @@ static inline void rebuild_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+struct address_space;
+
+static inline void cpuset_init_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline void cpuset_update_dirty_nodes(struct address_space *mapping,
+					     struct page *page)
+{
+}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+						nodemask_t *mask)
+{
+	return 1;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -544,6 +544,13 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
+#else
+	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+#endif
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -72,6 +72,8 @@ struct writeback_control {
 	 * so we use a single control to update them
 	 */
 	unsigned no_nrwrite_index_update:1;
+
+	nodemask_t *nodes;		/* Nodemask to writeback */
 };
 
 /*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -16,6 +16,7 @@
  *  2006 Rework by Paul Menage to use generic cgroups
  *  2008 Rework of the scheduler domains and CPU hotplug handling
  *       by Max Krasnyansky
+ *  2008 Cpuset writeback by Christoph Lameter
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -2323,6 +2324,58 @@ int cpuset_mem_spread_node(void)
 }
 EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 
+#if MAX_NUMNODES > BITS_PER_LONG
+/*
+ * Special functions for NUMA systems with a large number of nodes.  The
+ * nodemask is pointed to from the address_space structure.  The attachment of
+ * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
+ * freed only when the inode is cleared (and therefore unused, thus no locking
+ * is necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+			       struct page *page)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+	int node = page_to_nid(page);
+
+	if (!nodes) {
+		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+		if (!nodes)
+			return;
+
+		*nodes = NODE_MASK_NONE;
+		mapping->dirty_nodes = nodes;
+	}
+	node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+
+	if (nodes) {
+		mapping->dirty_nodes = NULL;
+		kfree(nodes);
+	}
+}
+
+/*
+ * Called without tree_lock.  The nodemask is only freed when the inode is
+ * cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+				  nodemask_t *mask)
+{
+	nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+	if (!mask)
+		return 1;
+	if (!dirty_nodes)
+		return 0;
+	return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
 /**
  * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/cpuset.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1104,6 +1105,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		cpuset_update_dirty_nodes(mapping, page);
 		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 2/7] pdflush: allow the passing of a nodemask parameter
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

If we want to support nodemask specific writeout, then we need a way to
communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also store
the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c               |    2 +-
 fs/super.c                |    4 ++--
 fs/sync.c                 |    8 ++++----
 include/linux/writeback.h |    5 +++--
 mm/page-writeback.c       |   18 +++++++++---------
 mm/pdflush.c              |   15 ++++++++++++---
 mm/vmscan.c               |    2 +-
 7 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -364,7 +364,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_pdflush(1024, NULL);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/super.c b/fs/super.c
--- a/fs/super.c
+++ b/fs/super.c
@@ -646,7 +646,7 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
 	struct super_block *sb;
 
@@ -674,7 +674,7 @@ static void do_emergency_remount(unsigned long foo)
 
 void emergency_remount(void)
 {
-	pdflush_operation(do_emergency_remount, 0);
+	pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff --git a/fs/sync.c b/fs/sync.c
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-	wakeup_pdflush(0);
+	wakeup_pdflush(0, NULL);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	DQUOT_SYNC(NULL);
 	sync_supers();		/* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-	do_sync(1);
+	do_sync(1, NULL);
 	return 0;
 }
 
 void emergency_sync(void)
 {
-	pdflush_operation(do_sync, 0);
+	pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -102,7 +102,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -143,7 +143,8 @@ balance_dirty_pages_ratelimited(struct address_space *mapping)
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 				void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+		      unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -106,7 +106,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -528,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -616,7 +616,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -658,12 +658,12 @@ static void background_writeout(unsigned long _min_pages)
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush threads were busy.
  */
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
 {
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	return pdflush_operation(background_writeout, nr_pages, nodes);
 }
 
 static void wb_timer_fn(unsigned long unused);
@@ -687,7 +687,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
 {
 	unsigned long oldest_jif;
 	unsigned long start_jif;
@@ -746,18 +746,18 @@ int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 
 static void wb_timer_fn(unsigned long unused)
 {
-	if (pdflush_operation(wb_kupdate, 0) < 0)
+	if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
 		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
 }
 
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
 {
 	sys_sync();
 }
 
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	pdflush_operation(laptop_flush, 0, NULL);
 }
 
 /*
diff --git a/mm/pdflush.c b/mm/pdflush.c
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
  */
 struct pdflush_work {
 	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
+	void (*fn)(unsigned long, nodemask_t *); /* A callback function */
 	unsigned long arg0;		/* An argument to the callback */
 	struct list_head list;		/* On pdflush_list, when idle */
 	unsigned long when_i_went_to_sleep;
+	int have_nodes;			/* Nodes were specified */
+	nodemask_t nodes;		/* Nodes to writeback */
 };
 
 static int __pdflush(struct pdflush_work *my_work)
@@ -124,7 +126,8 @@ static int __pdflush(struct pdflush_work *my_work)
 		}
 		spin_unlock_irq(&pdflush_lock);
 
-		(*my_work->fn)(my_work->arg0);
+		(*my_work->fn)(my_work->arg0,
+			       my_work->have_nodes ? &my_work->nodes : NULL);
 
 		/*
 		 * Thread creation: For how long have there been zero
@@ -198,7 +201,8 @@ static int pdflush(void *dummy)
  * Returns zero if it indeed managed to find a worker thread, and passed your
  * payload to it.
  */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *),
+		      unsigned long arg0, nodemask_t *nodes)
 {
 	unsigned long flags;
 	int ret = 0;
@@ -217,6 +221,11 @@ int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
 			last_empty_jifs = jiffies;
 		pdf->fn = fn;
 		pdf->arg0 = arg0;
+		if (nodes) {
+			pdf->nodes = *nodes;
+			pdf->have_nodes = 1;
+		} else
+			pdf->have_nodes = 0;
 		wake_up_process(pdf->who);
 	}
 	spin_unlock_irqrestore(&pdflush_lock, flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1400,7 +1400,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
  2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-28 17:31   ` Peter Zijlstra
  2008-10-28 17:32   ` Peter Zijlstra
  2008-10-28 16:08 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

Currently dirty throttling does not work properly in a cpuset.

If, for example, a cpuset contains only 1/10th of available memory then
all of the memory of a cpuset can be dirtied without any writes being
triggered.  If all of the cpuset's memory is dirty then only 10% of total
memory is dirty.  The background writeback threshold is usually set at
10% and the synchrononous threshold at 40%.  So we are still below the
global limits while the dirty ratio in the cpuset is 100%!  Writeback
throttling and background writeout do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware.  When determining the dirty
limits in get_dirty_limits(), we calculate values based on the nodes that
are reachable from the current process (that has been dirtying the page).
Then we can trigger writeout based on the dirty ratio of the memory in
the cpuset.

We trigger writeout in a a cpuset specific way.  We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset.  If an inode fulfills that requirement then we begin
writeout of the dirty pages of that inode.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/cpuset.h    |   14 +++++
 include/linux/writeback.h |   13 ++++-
 kernel/cpuset.c           |   36 +++++++++++++
 mm/backing-dev.c          |   12 ++---
 mm/page-writeback.c       |  126 +++++++++++++++++++++++++-------------------
 5 files changed, 138 insertions(+), 63 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/nodemask.h>
 #include <linux/cgroup.h>
+#include <linux/writeback.h>
 
 #ifdef CONFIG_CPUSETS
 
@@ -110,6 +111,11 @@ extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 					 nodemask_t *mask);
 #endif
 
+int cpuset_populate_dirty_limits(struct dirty_limits *dl,
+				 unsigned long *dirtyable_memory,
+				 unsigned long *nr_mapped,
+				 const nodemask_t *nodes);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -214,6 +220,14 @@ static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 	return 1;
 }
 
+static inline int cpuset_populate_dirty_limits(struct dirty_limits *dl,
+					       unsigned long *dirtyable_memory,
+					       unsigned long *nr_mapped,
+					       const nodemask_t *nodes)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -76,6 +76,15 @@ struct writeback_control {
 	nodemask_t *nodes;		/* Nodemask to writeback */
 };
 
+struct dirty_limits {
+	long thresh_background;
+	long thresh_dirty;
+	long thresh_bdi_dirty;
+	unsigned long nr_dirty;
+	unsigned long nr_unstable;
+	unsigned long nr_writeback;
+};
+
 /*
  * fs/fs-writeback.c
  */	
@@ -127,8 +136,8 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
 				      void __user *, size_t *, loff_t *);
 
-void get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
-		 struct backing_dev_info *bdi);
+int get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
+		     nodemask_t *nodes);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2375,6 +2375,42 @@ int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 }
 #endif
 
+/*
+ * Calculate the limits relative to the current cpuset
+ *
+ * We do not disregard highmem because all nodes (except maybe node 0) have
+ * either all memory in HIGHMEM (32-bit) or all memory in non-HIGHMEM (64-bit).
+ * If we would disregard highmem, then cpuset throttling would not work on
+ * 32-bit.
+ */
+int cpuset_populate_dirty_limits(struct dirty_limits *dl,
+				 unsigned long *dirtyable_memory,
+				 unsigned long *nr_mapped,
+				 const nodemask_t *nodes)
+{
+	int node;
+
+	if (likely(!nodes || nodes_subset(node_online_map, *nodes)))
+		return 0;
+	for_each_node_mask(node, *nodes) {
+		if (!node_online(node))
+			continue;
+		dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+		dl->nr_unstable += node_page_state(node, NR_UNSTABLE_NFS);
+		dl->nr_writeback += node_page_state(node, NR_WRITEBACK);
+		dirtyable_memory +=
+			node_page_state(node, NR_ACTIVE_ANON) +
+			node_page_state(node, NR_ACTIVE_FILE) +
+			node_page_state(node, NR_INACTIVE_ANON) +
+			node_page_state(node, NR_INACTIVE_FILE) +
+			node_page_state(node, NR_FREE_PAGES);
+		nr_mapped +=
+			node_page_state(node, NR_FILE_MAPPED) +
+			node_page_state(node, NR_ANON_PAGES);
+	}
+	return 1;
+}
+
 /**
  * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -24,11 +24,9 @@ static void bdi_debug_init(void)
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
-	long background_thresh;
-	long dirty_thresh;
-	long bdi_thresh;
+	struct dirty_limits dl;
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+	get_dirty_limits(&dl, bdi, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
@@ -39,9 +37,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh: %8lu kB\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh),
-		   K(dirty_thresh),
-		   K(background_thresh));
+		   K(dl.thresh_bdi_dirty),
+		   K(dl.thresh_dirty),
+		   K(dl.thresh_background));
 #undef K
 
 	return 0;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -320,13 +320,16 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes,
+					      unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
 	int node;
 	unsigned long x = 0;
 
-	for_each_node_state(node, N_HIGH_MEMORY) {
+	if (!nodes)
+		nodes = &node_states[N_HIGH_MEMORY];
+	for_each_node_mask(node, *nodes) {
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
@@ -357,21 +360,37 @@ unsigned long determine_dirtyable_memory(void)
 	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
+		x -= highmem_dirtyable_memory(NULL, x);
 
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-void
-get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
-		 struct backing_dev_info *bdi)
+int
+get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
+		 nodemask_t *nodes)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
 	long background;
 	long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long dirtyable_memory = 0;
+	unsigned long nr_mapped = 0;
 	struct task_struct *tsk;
+	int is_subset;
+
+	memset(dl, 0, sizeof(struct dirty_limits));
+	is_subset = cpuset_populate_dirty_limits(dl, &dirtyable_memory,
+						 &nr_mapped, nodes);
+	if (!is_subset) {
+		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
+		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+		dl->nr_writeback = global_page_state(NR_WRITEBACK);
+		dirtyable_memory = determine_dirtyable_memory();
+		nr_mapped = global_page_state(NR_FILE_MAPPED) +
+			global_page_state(NR_ANON_PAGES);
+	} else
+		dirtyable_memory -= highmem_dirtyable_memory(nodes,
+							dirtyable_memory);
 
 	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio < 5)
@@ -381,15 +400,15 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 	if (background_ratio >= dirty_ratio)
 		background_ratio = dirty_ratio / 2;
 
-	background = (background_ratio * available_memory) / 100;
-	dirty = (dirty_ratio * available_memory) / 100;
+	background = (background_ratio * dirtyable_memory) / 100;
+	dirty = (dirty_ratio * dirtyable_memory) / 100;
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;
 		dirty += dirty / 4;
 	}
-	*pbackground = background;
-	*pdirty = dirty;
+	dl->thresh_background = background;
+	dl->thresh_dirty = dirty;
 
 	if (bdi) {
 		u64 bdi_dirty;
@@ -407,10 +426,11 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 		if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
-		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
-		task_dirty_limit(current, pbdi_dirty);
+		dl->thresh_bdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, &dl->thresh_bdi_dirty);
+		task_dirty_limit(current, &dl->thresh_bdi_dirty);
 	}
+	return is_subset;
 }
 
 /*
@@ -424,9 +444,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
-	long background_thresh;
-	long dirty_thresh;
-	long bdi_thresh;
+	struct dirty_limits dl;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -441,17 +459,16 @@ static void balance_dirty_pages(struct address_space *mapping)
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
-
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		if (get_dirty_limits(&dl, bdi, &cpuset_current_mems_allowed))
+			wbc.nodes = &cpuset_current_mems_allowed;
+		nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+		nr_writeback = dl.nr_writeback;
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+							dl.thresh_bdi_dirty)
 			break;
 
 		/*
@@ -460,7 +477,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
+				(dl.thresh_background + dl.thresh_dirty) / 2)
 			break;
 
 		if (!bdi->dirty_exceeded)
@@ -475,8 +492,12 @@ static void balance_dirty_pages(struct address_space *mapping)
 		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
+			get_dirty_limits(&dl, bdi,
+					 &cpuset_current_mems_allowed);
+			nr_reclaimable = dl.nr_dirty + dl.nr_unstable;
+			if (nr_reclaimable + dl.nr_writeback <=
+							dl.thresh_dirty)
+				break;
 		}
 
 		/*
@@ -489,7 +510,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+		if (dl.thresh_bdi_dirty < 2*bdi_stat_error(bdi)) {
 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
 			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else if (bdi_nr_reclaimable) {
@@ -497,7 +518,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (bdi_nr_reclaimable + bdi_nr_writeback <=
+							dl.thresh_bdi_dirty)
 			break;
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
@@ -505,8 +527,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	if (bdi->dirty_exceeded && nr_reclaimable + dl.nr_writeback <=
+							dl.thresh_dirty)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -521,10 +543,9 @@ static void balance_dirty_pages(struct address_space *mapping)
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS)
-					  > background_thresh)))
-		pdflush_operation(background_writeout, 0, NULL);
+		(!laptop_mode && (nr_reclaimable > dl.thresh_background)))
+		pdflush_operation(background_writeout, 0,
+				  &cpuset_current_mems_allowed);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -581,22 +602,20 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
-	long background_thresh;
-	long dirty_thresh;
+	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+		get_dirty_limits(&dl, NULL, &node_states[N_HIGH_MEMORY]);
 
-                /*
-                 * Boost the allowable dirty threshold a bit for page
-                 * allocators so they don't get DoS'ed by heavy writers
-                 */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
+		/*
+		 * Boost the allowable dirty threshold a bit for page
+		 * allocators so they don't get DoS'ed by heavy writers
+		 */
+		dl.thresh_dirty += dl.thresh_dirty / 10; /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(WRITE, HZ/10);
+		if (dl.nr_unstable + dl.nr_writeback <= dl.thresh_dirty)
+			break;
+		congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -612,7 +631,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes)
 {
 	long min_pages = _min_pages;
 	struct writeback_control wbc = {
@@ -625,13 +644,12 @@ static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 	};
 
 	for ( ; ; ) {
-		long background_thresh;
-		long dirty_thresh;
+		struct dirty_limits dl;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
+		if (get_dirty_limits(&dl, NULL, nodes))
+			wbc.nodes = nodes;
+		if (dl.nr_dirty + dl.nr_unstable < dl.thresh_background &&
+		    min_pages <= 0)
 			break;
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 4/7] mm: cpuset aware reclaim writeout
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
  2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
  2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-28 16:08 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if it's a member of the active cpuset.  But then we
call pdflush without enforcing the same restrictions.  In a larger system
this may have the effect of a massive amount of pages being dirtied and
then either

 - no writeout occurs because global dirty limits have not been reached,
   or

 - writeout starts randomly for some dirty inode in the system.  pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases, dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset.  Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1604,7 +1604,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				       &cpuset_current_mems_allowed);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 5/7] mm: throttle writeout with cpuset awareness
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (2 preceding siblings ...)
  2008-10-28 16:08 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

This bases the VM throttling from the reclaim path on the dirty ratio of
the cpuset.  Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine.  VM
throttling will only work during synchrononous reclaim and not from
kswapd.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |    4 ++--
 mm/vmscan.c               |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -114,7 +114,7 @@ static inline void inode_sync_wait(struct inode *inode)
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -600,12 +600,12 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
 	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&dl, NULL, &node_states[N_HIGH_MEMORY]);
+		get_dirty_limits(&dl, NULL, nodes);
 
 		/*
 		 * Boost the allowable dirty threshold a bit for page
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	else if (!scan_global_lru(sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
-	throttle_vm_writeout(sc->gfp_mask);
+	throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 	return nr_reclaimed;
 }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (3 preceding siblings ...)
  2008-10-28 16:08 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-30  6:59   ` Paul Menage
  2008-10-30  8:44   ` Peter Zijlstra
  2008-10-28 16:08 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

This implements dirty ratios per cpuset.  Two new files are added to the
cpuset directories:

dirty_background_ratio	Percentage at which background writeback starts

dirty_ratio		Percentage at which the application is throttled
			and we start synchrononous writeout

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/cpuset.h |    7 ++++
 kernel/cpuset.c        |   87 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c    |   10 +++--
 3 files changed, 100 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -77,6 +77,8 @@ static inline int cpuset_do_slab_mem_spread(void)
 
 extern void cpuset_track_online_nodes(void);
 
+extern void cpuset_get_current_dirty_ratios(int *background, int *throttle);
+
 extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
@@ -189,6 +191,11 @@ static inline int cpuset_do_slab_mem_spread(void)
 
 static inline void cpuset_track_online_nodes(void) {}
 
+static inline void cpuset_get_current_dirty_ratios(int *background,
+						   int *throttle)
+{
+}
+
 static inline int current_cpuset_is_being_rebound(void)
 {
 	return 0;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -105,6 +105,9 @@ struct cpuset {
 
 	/* used for walking a cpuset heirarchy */
 	struct list_head stack_list;
+
+	int dirty_background_ratio;
+	int cpuset_dirty_ratio;
 };
 
 /* Retrieve the cpuset for a cgroup */
@@ -197,6 +200,8 @@ static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.dirty_background_ratio = -1,
+	.cpuset_dirty_ratio = -1,
 };
 
 /*
@@ -1198,6 +1203,42 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return 0;
 }
 
+static int update_int(int *cs_int, int val, int min, int max)
+{
+	if (val < min || val > max)
+		return -EINVAL;
+	mutex_lock(&callback_mutex);
+	*cs_int = val;
+	mutex_unlock(&callback_mutex);
+	return 0;
+}
+
+static u64 get_dirty_background_ratio(struct cpuset *cs)
+{
+	int ret;
+
+	mutex_lock(&callback_mutex);
+	ret = cs->dirty_background_ratio;
+	mutex_unlock(&callback_mutex);
+
+	if (ret == -1)
+		ret = dirty_background_ratio;
+	return (u64)ret;
+}
+
+static u64 get_dirty_ratio(struct cpuset *cs)
+{
+	int ret;
+
+	mutex_lock(&callback_mutex);
+	ret = cs->cpuset_dirty_ratio;
+	mutex_unlock(&callback_mutex);
+
+	if (ret == -1)
+		ret = vm_dirty_ratio;
+	return (u64)ret;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -1362,6 +1403,8 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_DIRTY_BACKGROUND_RATIO,
+	FILE_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1424,6 +1467,12 @@ static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
 	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
 		retval = update_relax_domain_level(cs, val);
 		break;
+	case FILE_DIRTY_BACKGROUND_RATIO:
+		retval = update_int(&cs->dirty_background_ratio, val, -1, 100);
+		break;
+	case FILE_DIRTY_RATIO:
+		retval = update_int(&cs->cpuset_dirty_ratio, val, -1, 100);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1551,6 +1600,10 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_DIRTY_BACKGROUND_RATIO:
+		return get_dirty_background_ratio(cs);
+	case FILE_DIRTY_RATIO:
+		return get_dirty_ratio(cs);
 	default:
 		BUG();
 	}
@@ -1658,6 +1711,20 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = cpuset_read_u64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_DIRTY_BACKGROUND_RATIO,
+	},
+
+	{
+		.name = "dirty_ratio",
+		.read_u64 = cpuset_read_u64,
+		.write_s64 = cpuset_write_s64,
+		.private = FILE_DIRTY_RATIO,
+	},
 };
 
 static struct cftype cft_memory_pressure_enabled = {
@@ -1753,6 +1820,8 @@ static struct cgroup_subsys_state *cpuset_create(
 	cs->mems_generation = cpuset_mems_generation++;
 	fmeter_init(&cs->fmeter);
 	cs->relax_domain_level = -1;
+	cs->dirty_background_ratio = parent->dirty_background_ratio;
+	cs->cpuset_dirty_ratio = parent->cpuset_dirty_ratio;
 
 	cs->parent = parent;
 	number_of_cpusets++;
@@ -2021,6 +2090,24 @@ void cpuset_track_online_nodes(void)
 }
 #endif
 
+/*
+ * Determine the dirty ratios for the currently active cpuset
+ */
+void cpuset_get_current_dirty_ratios(int *background, int *throttle)
+{
+	mutex_lock(&callback_mutex);
+	task_lock(current);
+	*background = task_cs(current)->dirty_background_ratio;
+	*throttle = task_cs(current)->cpuset_dirty_ratio;
+	task_unlock(current);
+	mutex_unlock(&callback_mutex);
+
+	if (*background == -1)
+		*background = dirty_background_ratio;
+	if (*throttle == -1)
+		*throttle = vm_dirty_ratio;
+}
+
 /**
  * cpuset_init_smp - initialize cpus_allowed
  *
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -388,15 +388,17 @@ get_dirty_limits(struct dirty_limits *dl, struct backing_dev_info *bdi,
 		dirtyable_memory = determine_dirtyable_memory();
 		nr_mapped = global_page_state(NR_FILE_MAPPED) +
 			global_page_state(NR_ANON_PAGES);
-	} else
+		dirty_ratio = vm_dirty_ratio;
+		background_ratio = dirty_background_ratio;
+	} else {
 		dirtyable_memory -= highmem_dirtyable_memory(nodes,
 							dirtyable_memory);
+		cpuset_get_current_dirty_ratios(&background_ratio,
+						&dirty_ratio);
+	}
 
-	dirty_ratio = vm_dirty_ratio;
 	if (dirty_ratio < 5)
 		dirty_ratio = 5;
-
-	background_ratio = dirty_background_ratio;
 	if (background_ratio >= dirty_ratio)
 		background_ratio = dirty_ratio / 2;
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 7/7] cpusets: update documentation for writeback throttling
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (4 preceding siblings ...)
  2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
@ 2008-10-28 16:08 ` David Rientjes
  2008-10-30 16:06   ` Christoph Lameter
  2008-10-28 17:37 ` [patch 1/7] cpusets: add dirty map to struct address_space Peter Zijlstra
  2008-10-28 17:46 ` Peter Zijlstra
  7 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, Randy Dunlap, linux-kernel

Update Documentation/cpusets.txt for per-cpuset writeback throttling.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cpusets.txt |   58 ++++++++++++++++++++++++++++++++++-----------
 1 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -14,15 +14,17 @@ CONTENTS:
 =========
 
 1. Cpusets
-  1.1 What are cpusets ?
-  1.2 Why are cpusets needed ?
-  1.3 How are cpusets implemented ?
-  1.4 What are exclusive cpusets ?
-  1.5 What is memory_pressure ?
-  1.6 What is memory spread ?
-  1.7 What is sched_load_balance ?
-  1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.1  What are cpusets ?
+  1.2  Why are cpusets needed ?
+  1.3  How are cpusets implemented ?
+  1.4  What are exclusive cpusets ?
+  1.5  What is memory_pressure ?
+  1.6  What is memory spread ?
+  1.7  What is sched_load_balance ?
+  1.8  What is sched_relax_domain_level ?
+  1.9  What is dirty_background_ratio ?
+  1.10 What is dirty_ratio ?
+  1.11 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -148,6 +150,8 @@ into the rest of the kernel, none in performance critical paths:
    Memory Nodes by what's allowed in that tasks cpuset.
  - in page_alloc.c, to restrict memory to allowed nodes.
  - in vmscan.c, to restrict page recovery to the current cpuset.
+ - in pdflush.c, to restrict pdflush's nodes of interest to those
+   of the current task's cpuset.
 
 You should mount the "cgroup" filesystem type in order to enable
 browsing and modifying the cpusets presently known to the kernel.  No
@@ -572,9 +576,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is dirty_background_ratio ?
+------------------------------------
+
+The per-cpuset percentage of dirtyable memory (free pages + mapped
+pages + file cache, not including locked or huge pages), the number of
+pages at which the pdflush background writeout daemon will start writing
+out dirtied data.
+
+The default value is /proc/sys/vm/dirty_background_ratio.  Set the
+cpuset's dirty_background_ratio to -1 to accept this system-wide
+default.
+
+
+1.10 What is dirty_ratio ?
 --------------------------
 
+The per-cpuset percentage of dirtyable memory (free pages + mapped
+pages + file cache, not including locked or huge pages), the number of
+pages at which a process generating disk writes will itself start
+writing out dirtied data.
+
+The default value is /proc/sys/vm/dirty_ratio.  Set the cpuset's
+dirty_ratio to -1 to accept this system-wide default.
+
+
+1.11 How do I use cpusets ?
+---------------------------
+
 In order to minimize the impact of cpusets on critical kernel
 code, such as the scheduler, and due to the fact that the kernel
 does not support one task updating the memory placement of another
@@ -716,10 +745,11 @@ Now you want to do something with this cpuset.
 
 In this directory you can find several files:
 # ls
-cpu_exclusive  memory_migrate      mems                      tasks
-cpus           memory_pressure     notify_on_release
-mem_exclusive  memory_spread_page  sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
+cpu_exclusive		mem_hardwall		mems
+cpus			memory_migrate		notify_on_release
+dirty_background_ratio	memory_pressure		sched_load_balance
+dirty_ratio		memory_spread_page	sched_relax_domain_level
+mem_exclusive		memory_spread_slab	tasks
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
@ 2008-10-28 17:31   ` Peter Zijlstra
  2008-10-28 19:16     ` David Rientjes
  2008-10-28 17:32   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-28 17:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:

> +/*
> + * Calculate the limits relative to the current cpuset
> + *
> + * We do not disregard highmem because all nodes (except maybe node 0) have
> + * either all memory in HIGHMEM (32-bit) or all memory in non-HIGHMEM (64-bit).
> + * If we would disregard highmem, then cpuset throttling would not work on
> + * 32-bit.
> + */
> +int cpuset_populate_dirty_limits(struct dirty_limits *dl,
> +				 unsigned long *dirtyable_memory,
> +				 unsigned long *nr_mapped,
> +				 const nodemask_t *nodes)
> +{
> +	int node;
> +
> +	if (likely(!nodes || nodes_subset(node_online_map, *nodes)))
> +		return 0;
> +	for_each_node_mask(node, *nodes) {
> +		if (!node_online(node))
> +			continue;
> +		dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
> +		dl->nr_unstable += node_page_state(node, NR_UNSTABLE_NFS);
> +		dl->nr_writeback += node_page_state(node, NR_WRITEBACK);
> +		dirtyable_memory +=

*dirtyable_memory perhaps?

> +			node_page_state(node, NR_ACTIVE_ANON) +
> +			node_page_state(node, NR_ACTIVE_FILE) +
> +			node_page_state(node, NR_INACTIVE_ANON) +
> +			node_page_state(node, NR_INACTIVE_FILE) +
> +			node_page_state(node, NR_FREE_PAGES);
> +		nr_mapped +=

idem?

> +			node_page_state(node, NR_FILE_MAPPED) +
> +			node_page_state(node, NR_ANON_PAGES);
> +	}
> +	return 1;
> +}




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
  2008-10-28 17:31   ` Peter Zijlstra
@ 2008-10-28 17:32   ` Peter Zijlstra
  2008-10-28 19:18     ` David Rientjes
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-28 17:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:

> +	is_subset = cpuset_populate_dirty_limits(dl, &dirtyable_memory,
> +						 &nr_mapped, nodes);
> +	if (!is_subset) {
> +		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
> +		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> +		dl->nr_writeback = global_page_state(NR_WRITEBACK);
> +		dirtyable_memory = determine_dirtyable_memory();
> +		nr_mapped = global_page_state(NR_FILE_MAPPED) +
> +			global_page_state(NR_ANON_PAGES);
> +	} else
> +		dirtyable_memory -= highmem_dirtyable_memory(nodes,
> +							dirtyable_memory);

Why not fold that all into cpuset_populate_dirty_limits() ?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (5 preceding siblings ...)
  2008-10-28 16:08 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
@ 2008-10-28 17:37 ` Peter Zijlstra
  2008-10-28 20:48   ` David Rientjes
  2008-10-28 17:46 ` Peter Zijlstra
  7 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-28 17:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:

> This patch implements the management of dirty node maps for an address
> space through the following functions:
> 
> cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes
> 
> cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
> 					map
> 
> cpuset_init_dirty_nodes(mapping)	Initialization of the map
> 
> 
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated for
> systems with a large number of nodes (f.e. ia64 with 1024 nodes).
> 
> Updating the dirty map may involve allocating it first for large
> configurations.  Therefore, we protect the allocation and setting of a
> node in the map through the tree_lock.  The tree_lock is already taken
> when a page is dirtied so there is no additional locking overhead if we
> insert the updating of the nodemask there.

I find this usage of tree lock most bothersome, as my concurrent
pagecache patches take the lock out. In which case this _does_ cause
extra locking overhead.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
                   ` (6 preceding siblings ...)
  2008-10-28 17:37 ` [patch 1/7] cpusets: add dirty map to struct address_space Peter Zijlstra
@ 2008-10-28 17:46 ` Peter Zijlstra
  2008-10-28 19:19   ` David Rientjes
  7 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-28 17:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:
> @@ -487,6 +488,12 @@ void generic_sync_sb_inodes(struct super_block *sb,
>                         continue;               /* blockdev has wrong queue */
>                 }
>  
> +               if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
> +                       /* No node pages under writeback */
> +                       requeue_io(inode);
> +                       continue;
> +               }

So, aside from all the dirty limit and passing node masks around, this
is the magic bit?

I totally missed it first time around, a short mention in the changelog
might not be undeserved.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 17:31   ` Peter Zijlstra
@ 2008-10-28 19:16     ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, Peter Zijlstra wrote:

> > +int cpuset_populate_dirty_limits(struct dirty_limits *dl,
> > +				 unsigned long *dirtyable_memory,
> > +				 unsigned long *nr_mapped,
> > +				 const nodemask_t *nodes)
> > +{
> > +	int node;
> > +
> > +	if (likely(!nodes || nodes_subset(node_online_map, *nodes)))
> > +		return 0;
> > +	for_each_node_mask(node, *nodes) {
> > +		if (!node_online(node))
> > +			continue;
> > +		dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
> > +		dl->nr_unstable += node_page_state(node, NR_UNSTABLE_NFS);
> > +		dl->nr_writeback += node_page_state(node, NR_WRITEBACK);
> > +		dirtyable_memory +=
> 
> *dirtyable_memory perhaps?
> 

Indeed, fixed.

> > +			node_page_state(node, NR_ACTIVE_ANON) +
> > +			node_page_state(node, NR_ACTIVE_FILE) +
> > +			node_page_state(node, NR_INACTIVE_ANON) +
> > +			node_page_state(node, NR_INACTIVE_FILE) +
> > +			node_page_state(node, NR_FREE_PAGES);
> > +		nr_mapped +=
> 
> idem?
> 

Likewise.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 17:32   ` Peter Zijlstra
@ 2008-10-28 19:18     ` David Rientjes
  2008-10-30  8:42       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-28 19:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, Peter Zijlstra wrote:

> > +	is_subset = cpuset_populate_dirty_limits(dl, &dirtyable_memory,
> > +						 &nr_mapped, nodes);
> > +	if (!is_subset) {
> > +		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
> > +		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> > +		dl->nr_writeback = global_page_state(NR_WRITEBACK);
> > +		dirtyable_memory = determine_dirtyable_memory();
> > +		nr_mapped = global_page_state(NR_FILE_MAPPED) +
> > +			global_page_state(NR_ANON_PAGES);
> > +	} else
> > +		dirtyable_memory -= highmem_dirtyable_memory(nodes,
> > +							dirtyable_memory);
> 
> Why not fold that all into cpuset_populate_dirty_limits() ?
> 

cpuset_populate_dirty_limits() is a no-op on !CONFIG_CPUSETS kernels.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-28 17:46 ` Peter Zijlstra
@ 2008-10-28 19:19   ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, Peter Zijlstra wrote:

> > @@ -487,6 +488,12 @@ void generic_sync_sb_inodes(struct super_block *sb,
> >                         continue;               /* blockdev has wrong queue */
> >                 }
> >  
> > +               if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
> > +                       /* No node pages under writeback */
> > +                       requeue_io(inode);
> > +                       continue;
> > +               }
> 
> So, aside from all the dirty limit and passing node masks around, this
> is the magic bit?
> 
> I totally missed it first time around, a short mention in the changelog
> might not be undeserved.
> 

Added, thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-28 17:37 ` [patch 1/7] cpusets: add dirty map to struct address_space Peter Zijlstra
@ 2008-10-28 20:48   ` David Rientjes
  2008-10-29  1:13     ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-28 20:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, Peter Zijlstra wrote:

> > This patch implements the management of dirty node maps for an address
> > space through the following functions:
> > 
> > cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes
> > 
> > cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
> > 					map
> > 
> > cpuset_init_dirty_nodes(mapping)	Initialization of the map
> > 
> > 
> > The dirty map may be stored either directly in the mapping (for NUMA
> > systems with less then BITS_PER_LONG nodes) or separately allocated for
> > systems with a large number of nodes (f.e. ia64 with 1024 nodes).
> > 
> > Updating the dirty map may involve allocating it first for large
> > configurations.  Therefore, we protect the allocation and setting of a
> > node in the map through the tree_lock.  The tree_lock is already taken
> > when a page is dirtied so there is no additional locking overhead if we
> > insert the updating of the nodemask there.
> 
> I find this usage of tree lock most bothersome, as my concurrent
> pagecache patches take the lock out. In which case this _does_ cause
> extra locking overhead.
> 

Yeah, if we don't serialize with tree_lock then we'll need to protect the 
attachment of mapping->dirty_nodes with a new spinlock in struct 
address_space (and only for configs where MAX_NUMNODES > BITS_PER_LONG). 
That locking overhead is negligible when mapping->dirty_nodes is non-NULL 
since there's no requirement to protect the setting of the node in the 
nodemask.

Are your concurrent pagecache patches in the latest mmotm?  If so, I can 
rebase this entire patchset off that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-28 20:48   ` David Rientjes
@ 2008-10-29  1:13     ` David Rientjes
  2008-10-29  2:24       ` David Rientjes
  2008-10-30  8:38       ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-29  1:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, David Rientjes wrote:

> Yeah, if we don't serialize with tree_lock then we'll need to protect the 
> attachment of mapping->dirty_nodes with a new spinlock in struct 
> address_space (and only for configs where MAX_NUMNODES > BITS_PER_LONG). 
> That locking overhead is negligible when mapping->dirty_nodes is non-NULL 
> since there's no requirement to protect the setting of the node in the 
> nodemask.
> 
> Are your concurrent pagecache patches in the latest mmotm?  If so, I can 
> rebase this entire patchset off that.

We're still taking mapping->tree_lock in both __set_page_dirty() and 
__set_page_dirty_nobuffers() in today's mmotm.

When tree_lock is removed with your patchset, we can add a spinlock to 
protect mapping->dirty_nodes when MAX_NUMNODES > BITS_PER_LONG.

Would you like to fold this patch into your series (which assumes we're 
not taking mapping->tree_lock in either of the two callers above)?
---
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,9 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->inotify_watches);
 	mutex_init(&inode->inotify_mutex);
 #endif
+#if MAX_NUMNODES > BITS_PER_LONG
+	spin_lock_init(&inode->i_data.dirty_nodes_lock);
+#endif
 }
 
 EXPORT_SYMBOL(inode_init_once);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -554,6 +554,7 @@ struct address_space {
 	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
 #else
 	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+	spinlock_t		dirty_nodes_lock; /* protects the above */
 #endif
 #endif
 } __attribute__((aligned(sizeof(long))));
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2413,10 +2413,7 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 #if MAX_NUMNODES > BITS_PER_LONG
 /*
  * Special functions for NUMA systems with a large number of nodes.  The
- * nodemask is pointed to from the address_space structure.  The attachment of
- * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
- * freed only when the inode is cleared (and therefore unused, thus no locking
- * is necessary).
+ * nodemask is pointed to from the address_space structure.
  */
 void cpuset_update_dirty_nodes(struct address_space *mapping,
 			       struct page *page)
@@ -2424,14 +2421,18 @@ void cpuset_update_dirty_nodes(struct address_space *mapping,
 	nodemask_t *nodes = mapping->dirty_nodes;
 	int node = page_to_nid(page);
 
+	spin_lock_irq(&mapping->dirty_nodes_lock);
 	if (!nodes) {
 		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
-		if (!nodes)
+		if (!nodes) {
+			spin_unlock_irq(&mapping->dirty_nodes_lock);
 			return;
+		}
 
 		*nodes = NODE_MASK_NONE;
 		mapping->dirty_nodes = nodes;
 	}
+	spin_unlock_irq(&mapping->dirty_nodes_lock);
 	node_set(node, *nodes);
 }
 
@@ -2446,8 +2447,8 @@ void cpuset_clear_dirty_nodes(struct address_space *mapping)
 }
 
 /*
- * Called without tree_lock.  The nodemask is only freed when the inode is
- * cleared and therefore this is safe.
+ * The nodemask is only freed when the inode is cleared and therefore this
+ * requires no locking.
  */
 int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 				  nodemask_t *mask)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-29  1:13     ` David Rientjes
@ 2008-10-29  2:24       ` David Rientjes
  2008-10-30  8:38       ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-29  2:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 28 Oct 2008, David Rientjes wrote:

> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2413,10 +2413,7 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>  #if MAX_NUMNODES > BITS_PER_LONG
>  /*
>   * Special functions for NUMA systems with a large number of nodes.  The
> - * nodemask is pointed to from the address_space structure.  The attachment of
> - * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
> - * freed only when the inode is cleared (and therefore unused, thus no locking
> - * is necessary).
> + * nodemask is pointed to from the address_space structure.
>   */
>  void cpuset_update_dirty_nodes(struct address_space *mapping,
>  			       struct page *page)
> @@ -2424,14 +2421,18 @@ void cpuset_update_dirty_nodes(struct address_space *mapping,
>  	nodemask_t *nodes = mapping->dirty_nodes;
>  	int node = page_to_nid(page);
>  
> +	spin_lock_irq(&mapping->dirty_nodes_lock);
>  	if (!nodes) {
>  		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
> -		if (!nodes)
> +		if (!nodes) {
> +			spin_unlock_irq(&mapping->dirty_nodes_lock);
>  			return;
> +		}
>  
>  		*nodes = NODE_MASK_NONE;
>  		mapping->dirty_nodes = nodes;
>  	}
> +	spin_unlock_irq(&mapping->dirty_nodes_lock);
>  	node_set(node, *nodes);
>  }
>  

Would need to be

	spin_lock_irq(&mapping->dirty_nodes_lock);
	nodes = mapping->dirty_nodes;
	if (!nodes) {
		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
		if (!nodes) {
			spin_unlock_irq(&mapping->dirty_nodes_lock);
			return;
		}
		...
	}
	spin_unlock_irq(&mapping->dirty_nodes_lock);

but this shouldn't add much locking overhead without mapping->tree_lock.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
@ 2008-10-30  6:59   ` Paul Menage
  2008-10-30  8:48     ` David Rientjes
  2008-10-30 15:28     ` Christoph Lameter
  2008-10-30  8:44   ` Peter Zijlstra
  1 sibling, 2 replies; 30+ messages in thread
From: Paul Menage @ 2008-10-30  6:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	Derek Fults, linux-kernel

On Tue, Oct 28, 2008 at 9:08 AM, David Rientjes <rientjes@google.com> wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
>
> This implements dirty ratios per cpuset.  Two new files are added to the
> cpuset directories:

Wouldn't this be equally applicable to cgroups using the memory
controller rather than cpusets? Might it make sense to have a common
subsystem that could be used by either of them?

Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/7] cpusets: add dirty map to struct address_space
  2008-10-29  1:13     ` David Rientjes
  2008-10-29  2:24       ` David Rientjes
@ 2008-10-30  8:38       ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-30  8:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 18:13 -0700, David Rientjes wrote:
> On Tue, 28 Oct 2008, David Rientjes wrote:
> 
> > Yeah, if we don't serialize with tree_lock then we'll need to protect the 
> > attachment of mapping->dirty_nodes with a new spinlock in struct 
> > address_space (and only for configs where MAX_NUMNODES > BITS_PER_LONG). 
> > That locking overhead is negligible when mapping->dirty_nodes is non-NULL 
> > since there's no requirement to protect the setting of the node in the 
> > nodemask.
> > 
> > Are your concurrent pagecache patches in the latest mmotm?  If so, I can 
> > rebase this entire patchset off that.
> 
> We're still taking mapping->tree_lock in both __set_page_dirty() and 
> __set_page_dirty_nobuffers() in today's mmotm.
> 
> When tree_lock is removed with your patchset, we can add a spinlock to 
> protect mapping->dirty_nodes when MAX_NUMNODES > BITS_PER_LONG.
> 
> Would you like to fold this patch into your series (which assumes we're 
> not taking mapping->tree_lock in either of the two callers above)?

Thanks!, I was working on cleaning up the patches to submit again
soonish. 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-28 19:18     ` David Rientjes
@ 2008-10-30  8:42       ` Peter Zijlstra
  2008-10-30  9:10         ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-30  8:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 12:18 -0700, David Rientjes wrote:
> On Tue, 28 Oct 2008, Peter Zijlstra wrote:
> 
> > > +	is_subset = cpuset_populate_dirty_limits(dl, &dirtyable_memory,
> > > +						 &nr_mapped, nodes);
> > > +	if (!is_subset) {
> > > +		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
> > > +		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> > > +		dl->nr_writeback = global_page_state(NR_WRITEBACK);
> > > +		dirtyable_memory = determine_dirtyable_memory();
> > > +		nr_mapped = global_page_state(NR_FILE_MAPPED) +
> > > +			global_page_state(NR_ANON_PAGES);
> > > +	} else
> > > +		dirtyable_memory -= highmem_dirtyable_memory(nodes,
> > > +							dirtyable_memory);
> > 
> > Why not fold that all into cpuset_populate_dirty_limits() ?
> > 
> 
> cpuset_populate_dirty_limits() is a no-op on !CONFIG_CPUSETS kernels.

Right, humm. Maybe introduce a populate_dirty_limits() and differentiate
that between CONFIG_CPUSETS and not, and make it do everything.

That would get rid of this fudge I think, no?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
  2008-10-30  6:59   ` Paul Menage
@ 2008-10-30  8:44   ` Peter Zijlstra
  2008-10-30  9:03     ` David Rientjes
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-30  8:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:

> +/*
> + * Determine the dirty ratios for the currently active cpuset
> + */
> +void cpuset_get_current_dirty_ratios(int *background, int *throttle)
> +{
> +	mutex_lock(&callback_mutex);
> +	task_lock(current);
> +	*background = task_cs(current)->dirty_background_ratio;
> +	*throttle = task_cs(current)->cpuset_dirty_ratio;
> +	task_unlock(current);
> +	mutex_unlock(&callback_mutex);
> +
> +	if (*background == -1)
> +		*background = dirty_background_ratio;
> +	if (*throttle == -1)
> +		*throttle = vm_dirty_ratio;
> +}

That's rather an awful lot of locking to read just two integers.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30  6:59   ` Paul Menage
@ 2008-10-30  8:48     ` David Rientjes
  2008-10-30 15:28     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-30  8:48 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	Derek Fults, linux-kernel

On Wed, 29 Oct 2008, Paul Menage wrote:

> Wouldn't this be equally applicable to cgroups using the memory
> controller rather than cpusets? Might it make sense to have a common
> subsystem that could be used by either of them?
> 

This change give us two functional advantages for cpusets that aren't 
possible with the memcontroller.

First, it allows for cpu isolation.  I see this as more of a NUMA issue 
than memory controller issue since we're throttling dirty writeout for 
current's set of allowable nodes in the reclaim path to the set of cpus 
that pdflush can use.  This allows a couple of things:

 - disjoint (and cpu-exclusive) cpusets with latency sensitive tasks are
   unaffected by a flood of synchronous dirty writeback, and

 - we preserve NUMA locality and its optimizations for pages under
   writeback.

Second, it allows for targeted dirty ratios.  The memcontroller doesn't 
allow us to determine if, for instance, the background dirty ratio has 
been exceeded for a set of tasks assigned to a specific cgroup since we 
can't determine its amount of dirtyable memory or NR_FILE_DIRTY pages 
(they're stored as ZVC values in the VM and aren't collected by the memory 
controller).

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30  8:44   ` Peter Zijlstra
@ 2008-10-30  9:03     ` David Rientjes
  2008-10-30  9:34       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-30  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Thu, 30 Oct 2008, Peter Zijlstra wrote:

> On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:
> 
> > +/*
> > + * Determine the dirty ratios for the currently active cpuset
> > + */
> > +void cpuset_get_current_dirty_ratios(int *background, int *throttle)
> > +{
> > +	mutex_lock(&callback_mutex);
> > +	task_lock(current);
> > +	*background = task_cs(current)->dirty_background_ratio;
> > +	*throttle = task_cs(current)->cpuset_dirty_ratio;
> > +	task_unlock(current);
> > +	mutex_unlock(&callback_mutex);
> > +
> > +	if (*background == -1)
> > +		*background = dirty_background_ratio;
> > +	if (*throttle == -1)
> > +		*throttle = vm_dirty_ratio;
> > +}
> 
> That's rather an awful lot of locking to read just two integers.
> 

As far as I know, task_lock(current) is required to dereference 
task_cs(current) and callback_mutex is required to ensure its the same 
cpuset.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-30  8:42       ` Peter Zijlstra
@ 2008-10-30  9:10         ` David Rientjes
  2008-10-30  9:34           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-30  9:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Thu, 30 Oct 2008, Peter Zijlstra wrote:

> On Tue, 2008-10-28 at 12:18 -0700, David Rientjes wrote:
> > On Tue, 28 Oct 2008, Peter Zijlstra wrote:
> > 
> > > > +	is_subset = cpuset_populate_dirty_limits(dl, &dirtyable_memory,
> > > > +						 &nr_mapped, nodes);
> > > > +	if (!is_subset) {
> > > > +		dl->nr_dirty = global_page_state(NR_FILE_DIRTY);
> > > > +		dl->nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> > > > +		dl->nr_writeback = global_page_state(NR_WRITEBACK);
> > > > +		dirtyable_memory = determine_dirtyable_memory();
> > > > +		nr_mapped = global_page_state(NR_FILE_MAPPED) +
> > > > +			global_page_state(NR_ANON_PAGES);
> > > > +	} else
> > > > +		dirtyable_memory -= highmem_dirtyable_memory(nodes,
> > > > +							dirtyable_memory);
> > > 
> > > Why not fold that all into cpuset_populate_dirty_limits() ?
> > > 
> > 
> > cpuset_populate_dirty_limits() is a no-op on !CONFIG_CPUSETS kernels.
> 
> Right, humm. Maybe introduce a populate_dirty_limits() and differentiate
> that between CONFIG_CPUSETS and not, and make it do everything.
> 
> That would get rid of this fudge I think, no?
> 

I agree it would look much cleaner and I actually did that originally, but 
it requires adding #ifdef's for CONFIG_CPUSETS to mm/page-writeback.c 
since highmem_dirtyable_memory() is static.

There's actually nothing cpuset-specific about 
cpuset_populate_dirty_limits() and it could be used for !CONFIG_CPUSETS 
kernels, but there's a performance benefit of using the global ZVC values 
for the page stats as opposed to iterating over each node and adding their 
respective values.  So is_subset is only non-zero when the passed nodemask 
excludes system nodes (the cpuset case), and we must iterate the node ZVC 
values.

So perhaps the solution is to introduce populate_nodemask_dirty_limits() 
and populate_global_dirty_limits() both in mm/page-writeback.c?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30  9:03     ` David Rientjes
@ 2008-10-30  9:34       ` Peter Zijlstra
  2008-10-30 10:02         ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-30  9:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Thu, 2008-10-30 at 02:03 -0700, David Rientjes wrote:
> On Thu, 30 Oct 2008, Peter Zijlstra wrote:
> 
> > On Tue, 2008-10-28 at 09:08 -0700, David Rientjes wrote:
> > 
> > > +/*
> > > + * Determine the dirty ratios for the currently active cpuset
> > > + */
> > > +void cpuset_get_current_dirty_ratios(int *background, int *throttle)
> > > +{
> > > +	mutex_lock(&callback_mutex);
> > > +	task_lock(current);
> > > +	*background = task_cs(current)->dirty_background_ratio;
> > > +	*throttle = task_cs(current)->cpuset_dirty_ratio;
> > > +	task_unlock(current);
> > > +	mutex_unlock(&callback_mutex);
> > > +
> > > +	if (*background == -1)
> > > +		*background = dirty_background_ratio;
> > > +	if (*throttle == -1)
> > > +		*throttle = vm_dirty_ratio;
> > > +}
> > 
> > That's rather an awful lot of locking to read just two integers.
> > 
> 
> As far as I know, task_lock(current) is required to dereference 
> task_cs(current) and callback_mutex is required to ensure its the same 
> cpuset.

Since we read these things for every evaluation, getting it wrong isn't
too harmful.

So I would suggest just enough locking to ensure we don't reference any
NULL pointers and such.

IIRC the cpuset stuff is RCU freed, so some racy read should be
possible, no?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 3/7] mm: make page writeback obey cpuset constraints
  2008-10-30  9:10         ` David Rientjes
@ 2008-10-30  9:34           ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2008-10-30  9:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Thu, 2008-10-30 at 02:10 -0700, David Rientjes wrote:

> So perhaps the solution is to introduce populate_nodemask_dirty_limits() 
> and populate_global_dirty_limits() both in mm/page-writeback.c?

Sure, sounds good.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30  9:34       ` Peter Zijlstra
@ 2008-10-30 10:02         ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-30 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Nick Piggin, Paul Menage,
	Derek Fults, linux-kernel

On Thu, 30 Oct 2008, Peter Zijlstra wrote:

> > > > +/*
> > > > + * Determine the dirty ratios for the currently active cpuset
> > > > + */
> > > > +void cpuset_get_current_dirty_ratios(int *background, int *throttle)
> > > > +{
> > > > +	mutex_lock(&callback_mutex);
> > > > +	task_lock(current);
> > > > +	*background = task_cs(current)->dirty_background_ratio;
> > > > +	*throttle = task_cs(current)->cpuset_dirty_ratio;
> > > > +	task_unlock(current);
> > > > +	mutex_unlock(&callback_mutex);
> > > > +
> > > > +	if (*background == -1)
> > > > +		*background = dirty_background_ratio;
> > > > +	if (*throttle == -1)
> > > > +		*throttle = vm_dirty_ratio;
> > > > +}
> > > 
> > > That's rather an awful lot of locking to read just two integers.
> > > 
> > 
> > As far as I know, task_lock(current) is required to dereference 
> > task_cs(current) and callback_mutex is required to ensure its the same 
> > cpuset.
> 
> Since we read these things for every evaluation, getting it wrong isn't
> too harmful.
> 
> So I would suggest just enough locking to ensure we don't reference any
> NULL pointers and such.
> 
> IIRC the cpuset stuff is RCU freed, so some racy read should be
> possible, no?
> 

Ah, that sounds reasonable.  We'll no longer require callback_mutex if we 
accept races when current attaches to another cpuset here.  We'll need 
rcu_read_lock() to safely dereference task_cs(current) unless it's 
top_cpuset, but that's much better than callback_mutex and spinning on 
task_lock(current).

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 6/7] cpusets: per cpuset dirty ratios
  2008-10-30  6:59   ` Paul Menage
  2008-10-30  8:48     ` David Rientjes
@ 2008-10-30 15:28     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-10-30 15:28 UTC (permalink / raw)
  To: Paul Menage
  Cc: David Rientjes, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Derek Fults, linux-kernel

On Wed, 29 Oct 2008, Paul Menage wrote:

> Wouldn't this be equally applicable to cgroups using the memory
> controller rather than cpusets? Might it make sense to have a common
> subsystem that could be used by either of them?

Yes that would be useful.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 7/7] cpusets: update documentation for writeback throttling
  2008-10-28 16:08 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
@ 2008-10-30 16:06   ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-10-30 16:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, Randy Dunlap, linux-kernel


Reviewed-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 5/7] mm: throttle writeout with cpuset awareness
  2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
@ 2008-10-30 19:23 ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

This bases the VM throttling from the reclaim path on the dirty ratio of
the cpuset.  Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine.  VM
throttling will only work during synchrononous reclaim and not from
kswapd.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |    4 ++--
 mm/vmscan.c               |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -114,7 +114,7 @@ static inline void inode_sync_wait(struct inode *inode)
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -638,12 +638,12 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
 	struct dirty_limits dl;
 
         for ( ; ; ) {
-		get_dirty_limits(&dl, NULL, &node_states[N_HIGH_MEMORY]);
+		get_dirty_limits(&dl, NULL, nodes);
 
 		/*
 		 * Boost the allowable dirty threshold a bit for page
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	else if (!scan_global_lru(sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
-	throttle_vm_writeout(sc->gfp_mask);
+	throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 	return nr_reclaimed;
 }
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-10-30 19:24 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
2008-10-28 17:31   ` Peter Zijlstra
2008-10-28 19:16     ` David Rientjes
2008-10-28 17:32   ` Peter Zijlstra
2008-10-28 19:18     ` David Rientjes
2008-10-30  8:42       ` Peter Zijlstra
2008-10-30  9:10         ` David Rientjes
2008-10-30  9:34           ` Peter Zijlstra
2008-10-28 16:08 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
2008-10-28 16:08 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
2008-10-30  6:59   ` Paul Menage
2008-10-30  8:48     ` David Rientjes
2008-10-30 15:28     ` Christoph Lameter
2008-10-30  8:44   ` Peter Zijlstra
2008-10-30  9:03     ` David Rientjes
2008-10-30  9:34       ` Peter Zijlstra
2008-10-30 10:02         ` David Rientjes
2008-10-28 16:08 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
2008-10-30 16:06   ` Christoph Lameter
2008-10-28 17:37 ` [patch 1/7] cpusets: add dirty map to struct address_space Peter Zijlstra
2008-10-28 20:48   ` David Rientjes
2008-10-29  1:13     ` David Rientjes
2008-10-29  2:24       ` David Rientjes
2008-10-30  8:38       ` Peter Zijlstra
2008-10-28 17:46 ` Peter Zijlstra
2008-10-28 19:19   ` David Rientjes
2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
2008-10-30 19:23 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).