LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 1/7] cpusets: add dirty map to struct address_space
@ 2008-10-28 16:08 David Rientjes
  2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: David Rientjes @ 2008-10-28 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

From: Christoph Lameter <cl@linux-foundation.org>

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located.  That way we will be able to implement writeout for
applications that are constrained to a portion of the memory of the
system as required by cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes
					map

cpuset_init_dirty_nodes(mapping)	Initialization of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated for
systems with a large number of nodes (f.e. ia64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations.  Therefore, we protect the allocation and setting of a
node in the map through the tree_lock.  The tree_lock is already taken
when a page is dirtied so there is no additional locking overhead if we
insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.  At
that point no pages are attached to the inode anymore and therefore it
can be done without any locking. The dirty map therefore records all nodes
that have been used for dirty pages by that inode until the inode is no
longer used.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Derek Fults <dfults@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c               |    2 +
 fs/fs-writeback.c         |    7 ++++++
 fs/inode.c                |    3 ++
 include/linux/cpuset.h    |   51 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h        |    7 ++++++
 include/linux/writeback.h |    2 +
 kernel/cpuset.c           |   53 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c       |    2 +
 8 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/cpuset.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -719,6 +720,7 @@ static int __set_page_dirty(struct page *page,
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	cpuset_update_dirty_nodes(mapping, page);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 
@@ -487,6 +488,12 @@ void generic_sync_sb_inodes(struct super_block *sb,
 			continue;		/* blockdev has wrong queue */
 		}
 
+		if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+			/* No node pages under writeback */
+			requeue_io(inode);
+			continue;
+		}
+
 		/* Was this inode dirtied after sync_sb_inodes was called? */
 		if (time_after(inode->dirtied_when, start))
 			break;
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
+#include <linux/cpuset.h>
 
 /*
  * This is needed for the following functions:
@@ -167,6 +168,7 @@ static struct inode *alloc_inode(struct super_block *sb)
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 		mapping->writeback_index = 0;
+		cpuset_init_dirty_nodes(mapping);
 
 		/*
 		 * If the block_device provides a backing_dev_info for client
@@ -271,6 +273,7 @@ void clear_inode(struct inode *inode)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
+	cpuset_clear_dirty_nodes(inode->i_mapping);
 	inode->i_state = I_CLEAR;
 }
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,36 @@ extern int current_cpuset_is_being_rebound(void);
 
 extern void rebuild_sched_domains(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_update_dirty_nodes(__mapping, __page)			\
+	node_set(page_to_nid(__page), (__mapping)->dirty_nodes);
+
+#define cpuset_clear_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NODE_MASK_NONE
+
+#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr)	\
+	(!(__nodemask_ptr) ||						\
+	 nodes_intersects((__mapping)->dirty_nodes, *(__nodemask_ptr)))
+
+#else
+struct address_space;
+
+#define cpuset_init_dirty_nodes(__mapping)				\
+	(__mapping)->dirty_nodes = NULL
+
+extern void cpuset_update_dirty_nodes(struct address_space *mapping,
+				      struct page *page);
+extern void cpuset_clear_dirty_nodes(struct address_space *mapping);
+extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+					 nodemask_t *mask);
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -163,6 +193,27 @@ static inline void rebuild_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+struct address_space;
+
+static inline void cpuset_init_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline void cpuset_update_dirty_nodes(struct address_space *mapping,
+					     struct page *page)
+{
+}
+
+static inline void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+}
+
+static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+						nodemask_t *mask)
+{
+	return 1;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -544,6 +544,13 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CPUSETS
+#if MAX_NUMNODES <= BITS_PER_LONG
+	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
+#else
+	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+#endif
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -72,6 +72,8 @@ struct writeback_control {
 	 * so we use a single control to update them
 	 */
 	unsigned no_nrwrite_index_update:1;
+
+	nodemask_t *nodes;		/* Nodemask to writeback */
 };
 
 /*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -16,6 +16,7 @@
  *  2006 Rework by Paul Menage to use generic cgroups
  *  2008 Rework of the scheduler domains and CPU hotplug handling
  *       by Max Krasnyansky
+ *  2008 Cpuset writeback by Christoph Lameter
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
@@ -2323,6 +2324,58 @@ int cpuset_mem_spread_node(void)
 }
 EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 
+#if MAX_NUMNODES > BITS_PER_LONG
+/*
+ * Special functions for NUMA systems with a large number of nodes.  The
+ * nodemask is pointed to from the address_space structure.  The attachment of
+ * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
+ * freed only when the inode is cleared (and therefore unused, thus no locking
+ * is necessary).
+ */
+void cpuset_update_dirty_nodes(struct address_space *mapping,
+			       struct page *page)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+	int node = page_to_nid(page);
+
+	if (!nodes) {
+		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
+		if (!nodes)
+			return;
+
+		*nodes = NODE_MASK_NONE;
+		mapping->dirty_nodes = nodes;
+	}
+	node_set(node, *nodes);
+}
+
+void cpuset_clear_dirty_nodes(struct address_space *mapping)
+{
+	nodemask_t *nodes = mapping->dirty_nodes;
+
+	if (nodes) {
+		mapping->dirty_nodes = NULL;
+		kfree(nodes);
+	}
+}
+
+/*
+ * Called without tree_lock.  The nodemask is only freed when the inode is
+ * cleared and therefore this is safe.
+ */
+int cpuset_intersects_dirty_nodes(struct address_space *mapping,
+				  nodemask_t *mask)
+{
+	nodemask_t *dirty_nodes = mapping->dirty_nodes;
+
+	if (!mask)
+		return 1;
+	if (!dirty_nodes)
+		return 0;
+	return nodes_intersects(*dirty_nodes, *mask);
+}
+#endif
+
 /**
  * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's?
  * @tsk1: pointer to task_struct of some task.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/cpuset.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1104,6 +1105,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		cpuset_update_dirty_nodes(mapping, page);
 		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */

^ permalink raw reply	[flat|nested] 30+ messages in thread
* [patch 0/7] cpuset writeback throttling
@ 2008-10-30 19:23 David Rientjes
  2008-10-30 19:23 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2008-10-30 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Peter Zijlstra, Paul Menage,
	Derek Fults, linux-kernel

Andrew,

This is the revised cpuset writeback throttling patchset posted to LKML 
on Tuesday, October 27.

The comments from Peter Zijlstra have been addressed.  His concurrent 
page cache patchset is not currently in -mm, so we can still serialize 
updating a struct address_space's dirty_nodes on its tree_lock.  When his 
patchset is merged, the patch at the end of this message can be used to 
introduce the necessary synchronization.

This patchset applies nicely to 2.6.28-rc2-mm1 with the exception of the 
first patch due to the alloc_inode() refactoring to inode_init_always() in
e9110864c440736beb484c2c74dedc307168b14e from linux-next and additions to 
include/linux/cpuset.h from 
oom-print-triggering-tasks-cpuset-and-mems-allowed.patch (oops :).

Please consider this for inclusion in the -mm tree.

A simple way of testing this change is to create a large file that exceeds 
the amount of memory allocated to a specific cpuset.  Then, mmap and 
modify the large file (such as in the following program) while running a 
latency sensitive task in a disjoint cpuset.  Notice the writeout 
throttling that doesn't interfere with the latency sensitive task.

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
	void *addr;
	unsigned long length;
	unsigned long i;
	int fd;

	if (argc != 3) {
		fprintf(stderr, "usage: %s <filename> <length>\n",
			argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDWR, 0644);
	if (fd < 0) {
		fprintf(stderr, "Cannot open file %s\n", argv[1]);
		exit(1);
	}

	length = strtoul(argv[2], NULL, 0);
	if (!length) {
		fprintf(stderr, "Invalid length %s\n", argv[2]);
		exit(1);
	}

	addr = mmap(0, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
		    0);
	if (addr == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		exit(1);
	}

	for (;;) {
		for (i = 0; i < length; i++)
			(*(char *)(addr + i))++;
		msync(addr, length, MS_SYNC);
	}
	return 0;
}



The following patch can be applied once the struct address_space's 
tree_lock is removed to protect the attachment of mapping->dirty_nodes.
---
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,9 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->inotify_watches);
 	mutex_init(&inode->inotify_mutex);
 #endif
+#if MAX_NUMNODES > BITS_PER_LONG
+	spin_lock_init(&inode->i_data.dirty_nodes_lock);
+#endif
 }
 
 EXPORT_SYMBOL(inode_init_once);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -554,6 +554,7 @@ struct address_space {
 	nodemask_t		dirty_nodes;	/* nodes with dirty pages */
 #else
 	nodemask_t		*dirty_nodes;	/* pointer to mask, if dirty */
+	spinlock_t		dirty_nodes_lock; /* protects the above */
 #endif
 #endif
 } __attribute__((aligned(sizeof(long))));
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2413,25 +2413,27 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
 #if MAX_NUMNODES > BITS_PER_LONG
 /*
  * Special functions for NUMA systems with a large number of nodes.  The
- * nodemask is pointed to from the address_space structure.  The attachment of
- * the dirty_nodes nodemask is protected by the tree_lock.  The nodemask is
- * freed only when the inode is cleared (and therefore unused, thus no locking
- * is necessary).
+ * nodemask is pointed to from the address_space structure.
  */
 void cpuset_update_dirty_nodes(struct address_space *mapping,
 			       struct page *page)
 {
-	nodemask_t *nodes = mapping->dirty_nodes;
+	nodemask_t *nodes;
 	int node = page_to_nid(page);
 
+	spin_lock_irq(&mapping->dirty_nodes_lock);
+	nodes = mapping->dirty_nodes;
 	if (!nodes) {
 		nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
-		if (!nodes)
+		if (!nodes) {
+			spin_unlock_irq(&mapping->dirty_nodes_lock);
 			return;
+		}
 
 		*nodes = NODE_MASK_NONE;
 		mapping->dirty_nodes = nodes;
 	}
+	spin_unlock_irq(&mapping->dirty_nodes_lock);
 	node_set(node, *nodes);
 }
 
@@ -2446,8 +2448,8 @@ void cpuset_clear_dirty_nodes(struct address_space *mapping)
 }
 
 /*
- * Called without tree_lock.  The nodemask is only freed when the inode is
- * cleared and therefore this is safe.
+ * The nodemask is only freed when the inode is cleared and therefore this
+ * requires no locking.
  */
 int cpuset_intersects_dirty_nodes(struct address_space *mapping,
 				  nodemask_t *mask)

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-10-30 19:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-28 16:08 [patch 1/7] cpusets: add dirty map to struct address_space David Rientjes
2008-10-28 16:08 ` [patch 2/7] pdflush: allow the passing of a nodemask parameter David Rientjes
2008-10-28 16:08 ` [patch 3/7] mm: make page writeback obey cpuset constraints David Rientjes
2008-10-28 17:31   ` Peter Zijlstra
2008-10-28 19:16     ` David Rientjes
2008-10-28 17:32   ` Peter Zijlstra
2008-10-28 19:18     ` David Rientjes
2008-10-30  8:42       ` Peter Zijlstra
2008-10-30  9:10         ` David Rientjes
2008-10-30  9:34           ` Peter Zijlstra
2008-10-28 16:08 ` [patch 4/7] mm: cpuset aware reclaim writeout David Rientjes
2008-10-28 16:08 ` [patch 5/7] mm: throttle writeout with cpuset awareness David Rientjes
2008-10-28 16:08 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes
2008-10-30  6:59   ` Paul Menage
2008-10-30  8:48     ` David Rientjes
2008-10-30 15:28     ` Christoph Lameter
2008-10-30  8:44   ` Peter Zijlstra
2008-10-30  9:03     ` David Rientjes
2008-10-30  9:34       ` Peter Zijlstra
2008-10-30 10:02         ` David Rientjes
2008-10-28 16:08 ` [patch 7/7] cpusets: update documentation for writeback throttling David Rientjes
2008-10-30 16:06   ` Christoph Lameter
2008-10-28 17:37 ` [patch 1/7] cpusets: add dirty map to struct address_space Peter Zijlstra
2008-10-28 20:48   ` David Rientjes
2008-10-29  1:13     ` David Rientjes
2008-10-29  2:24       ` David Rientjes
2008-10-30  8:38       ` Peter Zijlstra
2008-10-28 17:46 ` Peter Zijlstra
2008-10-28 19:19   ` David Rientjes
2008-10-30 19:23 [patch 0/7] cpuset writeback throttling David Rientjes
2008-10-30 19:23 ` [patch 6/7] cpusets: per cpuset dirty ratios David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).