LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] change global zonelist order v4 [0/2]
@ 2007-04-27  5:45 KAMEZAWA Hiroyuki
  2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-04-27  5:45 UTC (permalink / raw)
  To: LKML; +Cc: Linux-MM, AKPM, Christoph Lameter, Lee.Schermerhorn

Hi, this is version 4. including Lee Schermerhon's good rework.
and automatic configuration at boot time.

(This patch is reworked from V2, so skip V3 changelog.)

ChangeLog V2 -> V4
- automatic configuration is added.
- automatic configuration is now default.
- relaxed_zone_order is renamed to be numa_zonelist_order
  you can specify value "default" , "zone" , "numa"
- clean-up from Lee Schermerhorn
- patch is speareted to "base" and "autoconfiguration algorithm"

Changelog from V1 -> V2
- sysctl name is changed to be relaxed_zone_order
- NORMAL->NORMAL->....->DMA->DMA->DMA order (new ordering) is now default.
  NORMAL->DMA->NORMAL->DMA order (old ordering) is optional.
- addes boot opttion to set relaxed_zone_order. ia64 is supported now.
- Added documentation


Please don't hesitate to rework this if you have good plan.
I'll be offlined in the next week because my office will be closed.
Lee-san, please Ack or Sign-Off if patches seems O.K.

I think my autoconfiguration logic is reasonable to some extent. But we may
have some discussion. It can be rewritable by additional patch easily.

Thanks.
-Kame


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] change global zonelist order v4 [1/2] change zonelist ordering.
  2007-04-27  5:45 [PATCH] change global zonelist order v4 [0/2] KAMEZAWA Hiroyuki
@ 2007-04-27  6:04 ` KAMEZAWA Hiroyuki
  2007-04-30 16:12   ` Lee Schermerhorn
  2007-04-27  6:17 ` [PATCH] change global zonelist order v4 [2/2] auto configuration KAMEZAWA Hiroyuki
  2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
  2 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-04-27  6:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, clameter, Lee.Schermerhorn


Make zonelist creation policy selectable from sysctl v4.
Automatic configuration itself is provided by the next patch.

[Description]
Assume 2 node NUMA, only node(0) has ZONE_DMA.
(ia64's ZONE_DMA is below 4GB...x86_64's ZONE_DMA32)

In this case, current default (node0's) zonelist order is

Node(0)'s NORMAL -> Node(0)'s DMA -> Node(1)"s NORMAL.

This means Node(0)'s DMA will be used before Node(1)'s NORMAL.

This patch changes *default* zone order to

Node(0)'s NORMAL -> Node(1)'s NORMAL -> Node(0)'s DMA.

But, if Node(0)'s memory is too small (near or below 4G), Node(0)'s process has
to allocate its memory from Node(1) even if there are free memory in Node(0).
Some applications/uses will dislike this.
This patch adds a knob to change zonelist ordering.

[What this patch adds]

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in following order(old style, NODE order).

Node(0)'s NORMAL -> Node(0)'s DMA -> Node(0)'s NORMAL.

means put more priority on locality.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in following order(new style, ZONE order)

Node(0)'s NORMAL -> Node(1)'s NORMAL -> Node(0)'s DMA.

means put more priority on zone_type.

And you can specify this option as boot param.

Because autoconfig function does nothing. Default is "Node" order.

Tested on ia64 2-Node NUMA. works well.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/kernel-parameters.txt |   10 +
 Documentation/sysctl/vm.txt         |   32 ++++++
 include/linux/mmzone.h              |    5 
 kernel/sysctl.c                     |    9 +
 mm/page_alloc.c                     |  185 ++++++++++++++++++++++++++++++++----
 5 files changed, 225 insertions(+), 16 deletions(-)

Index: linux-2.6.21-rc7-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/sysctl.c
+++ linux-2.6.21-rc7-mm2/kernel/sysctl.c
@@ -893,6 +893,15 @@ static ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "numa_zonelist_order",
+		.data		= &numa_zonelist_order,
+		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
+		.mode		= 0644,
+		.proc_handler	= &numa_zonelist_order_handler,
+		.strategy	= &sysctl_string,
+	},
 #endif
 #if defined(CONFIG_X86_32) || \
    (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
Index: linux-2.6.21-rc7-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/mm/page_alloc.c
+++ linux-2.6.21-rc7-mm2/mm/page_alloc.c
@@ -2024,7 +2024,8 @@ void show_free_areas(void)
  * Add all populated zones of a node to the zonelist.
  */
 static int __meminit build_zonelists_node(pg_data_t *pgdat,
-			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
+			struct zonelist *zonelist, int nr_zones,
+			enum zone_type zone_type)
 {
 	struct zone *zone;
 
@@ -2045,7 +2046,7 @@ static int __meminit build_zonelists_nod
 
 #ifdef CONFIG_NUMA
 #define MAX_NODE_LOAD (num_online_nodes())
-static int __meminitdata node_load[MAX_NUMNODES];
+static int node_load[MAX_NUMNODES];
 /**
  * find_next_best_node - find the next node that should appear in a given node's fallback list
  * @node: node whose fallback list we're appending
@@ -2060,7 +2061,7 @@ static int __meminitdata node_load[MAX_N
  * on them otherwise.
  * It returns -1 if no node is found.
  */
-static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -2106,13 +2107,124 @@ static int __meminit find_next_best_node
 	return best_node;
 }
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+/*
+ * numa_zonelist_order:
+ *  0 = automatic detection of better ordering.
+ *  1 = order by ([node] distance, -zonetype)
+ *  2 = order by (-zonetype, [node] distance)
+ */
+#define ZONELIST_ORDER_AUTO	0
+#define ZONELIST_ORDER_NODE	1
+#define ZONELIST_ORDER_ZONE	2
+static int zonelist_order = 0;
+
+/*
+ * command line option "numa_zonelist_order"
+ *	= "[dD]efault | "0"	- default, automatic configuration.
+ *	= "[nN]ode"|"1" 	- order by node locality,
+ *         			  then zone within node.
+ *	= "[zZ]one"|"2" - order by zone, then by locality within zone
+ */
+char numa_zonelist_order[NUMA_ZONELIST_ORDER_LEN] = "default";
+
+static int __parse_numa_zonelist_order(char *s)
+{
+	if (*s == 'd' || *s == 'D' || *s == '0') {
+		strncpy(numa_zonelist_order, "default",
+					NUMA_ZONELIST_ORDER_LEN);
+		zonelist_order = ZONELIST_ORDER_AUTO;
+	} else if (*s == 'n' || *s == 'N' || *s == '1') {
+		strncpy(numa_zonelist_order, "node",
+					NUMA_ZONELIST_ORDER_LEN);
+		zonelist_order = ZONELIST_ORDER_NODE;
+	} else if (*s == 'z' || *s == 'Z' || *s == '2') {
+		strncpy(numa_zonelist_order, "zone",
+					NUMA_ZONELIST_ORDER_LEN);
+		zonelist_order = ZONELIST_ORDER_ZONE;
+	} else {
+		printk(KERN_WARNING
+			"Ignoring invalid numa_zonelist_order value:  "
+			"%s\n", s);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static __init int setup_numa_zonelist_order(char *s)
+{
+	if (s)
+		return __parse_numa_zonelist_order(s);
+	return 0;
+}
+early_param("numa_zonelist_order", setup_numa_zonelist_order);
+
+/*
+ * Build zonelists ordered by node and zones within node.
+ * This results in maximum locality--normal zone overflows into local
+ * DMA zone, if any--but risks exhausting DMA zone.
+ */
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	int j, node, local_node;
 	enum zone_type i;
-	int prev_node, load;
+	int j;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		for (j = 0; zonelist->zones[j] != NULL; j++);
+
+ 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
+ * Build zonelists ordered by zone and nodes within zones.
+ * This results in conserving DMA zone[s] until all Normal memory is
+ * exhausted, but results in overflowing to remote node while memory
+ * may still exist in local DMA zone.
+ */
+static int node_order[MAX_NUMNODES];
+
+static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
+{
+	enum zone_type i;
+	int pos, j, node;
+	int zone_type;		/* needs to be signed */
+	struct zone *z;
 	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		pos = 0;
+		for (zone_type = i; zone_type >= 0; zone_type--) {
+			for (j = 0; j < nr_nodes; j++) {
+				node = node_order[j];
+				z = &NODE_DATA(node)->node_zones[zone_type];
+				if (populated_zone(z))
+					zonelist->zones[pos++] = z;
+			}
+		}
+		zonelist->zones[pos] = NULL;
+	}
+}
+
+static int estimate_zonelist_order(void)
+{
+	/* dummy, just select node order. */
+	return ZONELIST_ORDER_NODE;
+}
+
+
+
+static void build_zonelists(pg_data_t *pgdat)
+{
+	int j, node, load;
+	enum zone_type i;
 	nodemask_t used_mask;
+	int local_node, prev_node;
+	struct zonelist *zonelist;
+	int ordering;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -2120,11 +2232,18 @@ static void __meminit build_zonelists(pg
 		zonelist->zones[0] = NULL;
 	}
 
+	ordering = zonelist_order;
+	if (ordering == ZONELIST_ORDER_AUTO)
+		ordering = estimate_zonelist_order();
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
 	load = num_online_nodes();
 	prev_node = local_node;
 	nodes_clear(used_mask);
+
+	memset(node_order, 0, sizeof(node_order));
+	j = 0;
+
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		int distance = node_distance(local_node, node);
 
@@ -2140,18 +2259,20 @@ static void __meminit build_zonelists(pg
 		 * So adding penalty to the first node in same
 		 * distance group to make it round-robin.
 		 */
-
 		if (distance != node_distance(local_node, prev_node))
-			node_load[node] += load;
+			node_load[node] = load;
+
 		prev_node = node;
 		load--;
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			zonelist = pgdat->node_zonelists + i;
-			for (j = 0; zonelist->zones[j] != NULL; j++);
+		if (ordering == ZONELIST_ORDER_NODE)	/* default */
+			build_zonelists_in_node_order(pgdat, node);
+		else
+			node_order[j++] = node;	/* remember order */
+	}
 
-	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-			zonelist->zones[j] = NULL;
-		}
+	if (ordering == ZONELIST_ORDER_ZONE) {
+		/* calculate node order -- i.e., DMA last! */
+		build_zonelists_in_zone_order(pgdat, j);
 	}
 }
 
@@ -2173,6 +2294,37 @@ static void __meminit build_zonelist_cac
 	}
 }
 
+/*
+ * sysctl handler for numa_zonelist_order
+ */
+int numa_zonelist_order_handler(ctl_table *table, int write,
+		struct file *file, void __user *buffer, size_t *length,
+		loff_t *ppos)
+{
+	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	int ret;
+
+	if (write)
+		strncpy(saved_string, (char*)table->data,
+			NUMA_ZONELIST_ORDER_LEN);
+	ret = proc_dostring(table, write, file, buffer, length, ppos);
+	if (ret)
+		return ret;
+	if (write) {
+		int oldval = zonelist_order;
+		if (__parse_numa_zonelist_order((char*)table->data)) {
+			/*
+			 * bogus value.  restore saved string
+			 */
+			strncpy((char*)table->data, saved_string,
+				NUMA_ZONELIST_ORDER_LEN);
+			zonelist_order = oldval;
+		} else if (oldval != zonelist_order)
+			build_all_zonelists();
+	}
+	return 0;
+}
+
 #else	/* CONFIG_NUMA */
 
 static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -2222,7 +2374,7 @@ static void __meminit build_zonelist_cac
 #endif	/* CONFIG_NUMA */
 
 /* return values int ....just for stop_machine_run() */
-static int __meminit __build_all_zonelists(void *dummy)
+static int __build_all_zonelists(void *dummy)
 {
 	int nid;
 
@@ -2233,12 +2385,13 @@ static int __meminit __build_all_zonelis
 	return 0;
 }
 
-void __meminit build_all_zonelists(void)
+void build_all_zonelists(void)
 {
 	if (system_state == SYSTEM_BOOTING) {
 		__build_all_zonelists(NULL);
 		cpuset_init_current_mems_allowed();
 	} else {
+		memset(node_load, 0, sizeof(node_load));
 		/* we have to stop all cpus to guaranntee there is no user
 		   of zonelist */
 		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
Index: linux-2.6.21-rc7-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.21-rc7-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.21-rc7-mm2/include/linux/mmzone.h
@@ -608,6 +608,11 @@ int sysctl_min_unmapped_ratio_sysctl_han
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 
+extern int numa_zonelist_order_handler(struct ctl_table *, int,
+			struct file *, void __user *, size_t *, loff_t *);
+extern char numa_zonelist_order[];
+#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
+
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
 #ifndef numa_node_id
Index: linux-2.6.21-rc7-mm2/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.21-rc7-mm2.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.21-rc7-mm2/Documentation/kernel-parameters.txt
@@ -1500,6 +1500,16 @@ and is between 256 and 4096 characters. 
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c			
 
+	numa_zonelist_order [KNL,BOOT]
+			Select memory allocation zonelist order for NUMA
+			platform.  Default is automatic configuration.
+			"Node order" orders the zonelists by node [locality],
+			then zones within nodes.  "Zone order" orders the
+			zonelists by zone,then nodes within the zone.
+			This moves DMA zone, if any, to the end of the
+			allocation lists.
+			See also Documentation/sysctl/vm.txt
+
 	reserve=	[KNL,BUGS] Force the kernel to ignore some iomem area
 
 	reservetop=	[X86-32]
Index: linux-2.6.21-rc7-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-rc7-mm2.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.21-rc7-mm2/Documentation/sysctl/vm.txt
@@ -34,6 +34,7 @@ Currently, these files are in /proc/sys/
 - swap_prefetch
 - readahead_ratio
 - readahead_hit_rate
+- numa_zonelist_order
 
 ==============================================================
 
@@ -275,3 +276,34 @@ Possible values can be:
 The larger value, the more capabilities, with more possible overheads.
 
 The default value is 1.
+
+=============================================================
+
+numa_zonelist_order
+
+This sysctl is only for NUMA.
+
+numa_zonelist_order selects the order of the memory allocation zonelists.
+The default order [a.k.a. "node order"] orders the zonelists by node, the
+by zone within each node. The default is automatic configuration.
+Specify "[Dd]fault" or "0" to request automatic configuration.
+
+ For example, assume 2 Node NUMA.  The "Node order" kernel memory allocation
+order on Node(0) will be:
+
+	Node(0)NORMAL -> Node(0)DMA -> Node(1)NORMAL -> Node(1)DMA(if any)
+
+Thus, allocations that request Node(0) NORMAL may overflow onto Node(0)DMA
+first.  This provides maximum locality, but risks exhausting all of DMA
+memory while NORMAL memory exists elsewhere on the system.  This can result
+in OOM-KILL in ZONE_DMA.  Secify "[Zz]one" or "1" to request zone order.
+
+If numa_zonelist_order is set to "node" order, the kernel memory allocation
+order on Node(0) becomes:
+
+	Node(0)NORMAL -> Node(1)NORMAL -> Node(0)DMA -> Node(1)DMA(if any)
+
+In this mode, DMA memory will be used in place of NORMAL memory, only when
+all NORMAL zones are exhausted.  Specify "[Nn]ode" or "2" for node order.
+
+The default value is 0.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] change global zonelist order v4 [2/2] auto configuration
  2007-04-27  5:45 [PATCH] change global zonelist order v4 [0/2] KAMEZAWA Hiroyuki
  2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
@ 2007-04-27  6:17 ` KAMEZAWA Hiroyuki
  2007-04-30 16:26   ` Lee Schermerhorn
  2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
  2 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-04-27  6:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, clameter, Lee.Schermerhorn

Add auto zone ordering configuration.

This function will select ZONE_ORDER_NODE when

- There are only ZONE_DMA or ZONE_DMA32.
(or) size of (ZONE_DMA/DMA32) > (System Total Memory)/2
(or) Assume Node(A)
	* Node (A)'s total memory > System Total Memory/num_of_node+1
	(and) Node (A)'s ZONE_DMA/DMA32 occupies 60% of Node(A)'s memory.

otherwise, ZONE_ORDER_ZONE is selected.

Note: a user can specifiy this ordering from boot option.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/page_alloc.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

Index: linux-2.6.21-rc7-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/mm/page_alloc.c	2007-04-27 15:39:49.000000000 +0900
+++ linux-2.6.21-rc7-mm2/mm/page_alloc.c	2007-04-27 15:55:51.000000000 +0900
@@ -2211,8 +2211,50 @@
 
 static int estimate_zonelist_order(void)
 {
-	/* dummy, just select node order. */
-	return ZONELIST_ORDER_NODE;
+	int nid, zone_type;
+	unsigned long low_kmem_size,total_size;
+	struct zone *z;
+	int average_size;
+	/* ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
+	   If they are really small and used heavily,
+	   the system can fall into OOM very easily.
+	   This function detect ZONE_DMA/DMA32 size and confgigure
+	   zone ordering */
+	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
+	low_kmem_size = 0;
+	total_size = 0;
+	for_each_online_node(nid) {
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+	}
+	if (!low_kmem_size ||  /* there is no DMA area. */
+	    !low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
+		return ZONELIST_ORDER_NODE;
+	/* look into each node's config. where all processes starts... */
+	/* average size..a bit smaller than real average size */
+	average_size = total_size / (num_online_nodes() + 1);
+	for_each_online_node(nid) {
+		low_kmem_size = 0;
+		total_size = 0;
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+		if (total_size > average_size && /* ignore unbalanced node */
+		    low_kmem_size > total_size * 60/100)
+			return ZONELIST_ORDER_NODE;
+	}
+	return ZONELIST_ORDER_ZONE;
 }
 
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [1/2] change zonelist ordering.
  2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
@ 2007-04-30 16:12   ` Lee Schermerhorn
  0 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2007-04-30 16:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, akpm, clameter

On Fri, 2007-04-27 at 15:04 +0900, KAMEZAWA Hiroyuki wrote:
> Make zonelist creation policy selectable from sysctl v4.
> Automatic configuration itself is provided by the next patch.
> 
> [Description]
> Assume 2 node NUMA, only node(0) has ZONE_DMA.
> (ia64's ZONE_DMA is below 4GB...x86_64's ZONE_DMA32)
> 
> In this case, current default (node0's) zonelist order is
> 
> Node(0)'s NORMAL -> Node(0)'s DMA -> Node(1)"s NORMAL.
> 
> This means Node(0)'s DMA will be used before Node(1)'s NORMAL.
> 
> This patch changes *default* zone order to
> 
> Node(0)'s NORMAL -> Node(1)'s NORMAL -> Node(0)'s DMA.
> 
> But, if Node(0)'s memory is too small (near or below 4G), Node(0)'s process has
> to allocate its memory from Node(1) even if there are free memory in Node(0).
> Some applications/uses will dislike this.
> This patch adds a knob to change zonelist ordering.
> 
> [What this patch adds]
> 
> command:
> %echo N > /proc/sys/vm/numa_zonelist_order
> 
> Will rebuild zonelist in following order(old style, NODE order).
> 
> Node(0)'s NORMAL -> Node(0)'s DMA -> Node(0)'s NORMAL.
> 
> means put more priority on locality.
> 
> command:
> %echo Z > /proc/sys/vm/numa_zonelist_order
> 
> Will rebuild zonelist in following order(new style, ZONE order)
> 
> Node(0)'s NORMAL -> Node(1)'s NORMAL -> Node(0)'s DMA.
> 
> means put more priority on zone_type.
> 
> And you can specify this option as boot param.
> 
> Because autoconfig function does nothing. Default is "Node" order.
> 
> Tested on ia64 2-Node NUMA. works well.
> 
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

sysctl documentation could use a bit of cleanup [spelling, inconsistent
statements re:  default ordering], but this can be addressed in a
separate patch.  Code looks/tested OK.

Acked-by:  Lee.Schermerhorn <lee.schermerhorn@hp.com>
> 
> ---
>  Documentation/kernel-parameters.txt |   10 +
>  Documentation/sysctl/vm.txt         |   32 ++++++
>  include/linux/mmzone.h              |    5 
>  kernel/sysctl.c                     |    9 +
>  mm/page_alloc.c                     |  185 ++++++++++++++++++++++++++++++++----
>  5 files changed, 225 insertions(+), 16 deletions(-)
> 
> Index: linux-2.6.21-rc7-mm2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/kernel/sysctl.c
> +++ linux-2.6.21-rc7-mm2/kernel/sysctl.c
> @@ -893,6 +893,15 @@ static ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one_hundred,
>  	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "numa_zonelist_order",
> +		.data		= &numa_zonelist_order,
> +		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
> +		.mode		= 0644,
> +		.proc_handler	= &numa_zonelist_order_handler,
> +		.strategy	= &sysctl_string,
> +	},
>  #endif
>  #if defined(CONFIG_X86_32) || \
>     (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
> Index: linux-2.6.21-rc7-mm2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/mm/page_alloc.c
> +++ linux-2.6.21-rc7-mm2/mm/page_alloc.c
> @@ -2024,7 +2024,8 @@ void show_free_areas(void)
>   * Add all populated zones of a node to the zonelist.
>   */
>  static int __meminit build_zonelists_node(pg_data_t *pgdat,
> -			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
> +			struct zonelist *zonelist, int nr_zones,
> +			enum zone_type zone_type)
>  {
>  	struct zone *zone;
>  
> @@ -2045,7 +2046,7 @@ static int __meminit build_zonelists_nod
>  
>  #ifdef CONFIG_NUMA
>  #define MAX_NODE_LOAD (num_online_nodes())
> -static int __meminitdata node_load[MAX_NUMNODES];
> +static int node_load[MAX_NUMNODES];
>  /**
>   * find_next_best_node - find the next node that should appear in a given node's fallback list
>   * @node: node whose fallback list we're appending
> @@ -2060,7 +2061,7 @@ static int __meminitdata node_load[MAX_N
>   * on them otherwise.
>   * It returns -1 if no node is found.
>   */
> -static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
> +static int find_next_best_node(int node, nodemask_t *used_node_mask)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> @@ -2106,13 +2107,124 @@ static int __meminit find_next_best_node
>  	return best_node;
>  }
>  
> -static void __meminit build_zonelists(pg_data_t *pgdat)
> +/*
> + * numa_zonelist_order:
> + *  0 = automatic detection of better ordering.
> + *  1 = order by ([node] distance, -zonetype)
> + *  2 = order by (-zonetype, [node] distance)
> + */
> +#define ZONELIST_ORDER_AUTO	0
> +#define ZONELIST_ORDER_NODE	1
> +#define ZONELIST_ORDER_ZONE	2
> +static int zonelist_order = 0;
> +
> +/*
> + * command line option "numa_zonelist_order"
> + *	= "[dD]efault | "0"	- default, automatic configuration.
> + *	= "[nN]ode"|"1" 	- order by node locality,
> + *         			  then zone within node.
> + *	= "[zZ]one"|"2" - order by zone, then by locality within zone
> + */
> +char numa_zonelist_order[NUMA_ZONELIST_ORDER_LEN] = "default";
> +
> +static int __parse_numa_zonelist_order(char *s)
> +{
> +	if (*s == 'd' || *s == 'D' || *s == '0') {
> +		strncpy(numa_zonelist_order, "default",
> +					NUMA_ZONELIST_ORDER_LEN);
> +		zonelist_order = ZONELIST_ORDER_AUTO;
> +	} else if (*s == 'n' || *s == 'N' || *s == '1') {
> +		strncpy(numa_zonelist_order, "node",
> +					NUMA_ZONELIST_ORDER_LEN);
> +		zonelist_order = ZONELIST_ORDER_NODE;
> +	} else if (*s == 'z' || *s == 'Z' || *s == '2') {
> +		strncpy(numa_zonelist_order, "zone",
> +					NUMA_ZONELIST_ORDER_LEN);
> +		zonelist_order = ZONELIST_ORDER_ZONE;
> +	} else {
> +		printk(KERN_WARNING
> +			"Ignoring invalid numa_zonelist_order value:  "
> +			"%s\n", s);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static __init int setup_numa_zonelist_order(char *s)
> +{
> +	if (s)
> +		return __parse_numa_zonelist_order(s);
> +	return 0;
> +}
> +early_param("numa_zonelist_order", setup_numa_zonelist_order);
> +
> +/*
> + * Build zonelists ordered by node and zones within node.
> + * This results in maximum locality--normal zone overflows into local
> + * DMA zone, if any--but risks exhausting DMA zone.
> + */
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
>  {
> -	int j, node, local_node;
>  	enum zone_type i;
> -	int prev_node, load;
> +	int j;
> +	struct zonelist *zonelist;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		zonelist = pgdat->node_zonelists + i;
> +		for (j = 0; zonelist->zones[j] != NULL; j++);
> +
> + 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
> +		zonelist->zones[j] = NULL;
> +	}
> +}
> +
> +/*
> + * Build zonelists ordered by zone and nodes within zones.
> + * This results in conserving DMA zone[s] until all Normal memory is
> + * exhausted, but results in overflowing to remote node while memory
> + * may still exist in local DMA zone.
> + */
> +static int node_order[MAX_NUMNODES];
> +
> +static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
> +{
> +	enum zone_type i;
> +	int pos, j, node;
> +	int zone_type;		/* needs to be signed */
> +	struct zone *z;
>  	struct zonelist *zonelist;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		zonelist = pgdat->node_zonelists + i;
> +		pos = 0;
> +		for (zone_type = i; zone_type >= 0; zone_type--) {
> +			for (j = 0; j < nr_nodes; j++) {
> +				node = node_order[j];
> +				z = &NODE_DATA(node)->node_zones[zone_type];
> +				if (populated_zone(z))
> +					zonelist->zones[pos++] = z;
> +			}
> +		}
> +		zonelist->zones[pos] = NULL;
> +	}
> +}
> +
> +static int estimate_zonelist_order(void)
> +{
> +	/* dummy, just select node order. */
> +	return ZONELIST_ORDER_NODE;
> +}
> +
> +
> +
> +static void build_zonelists(pg_data_t *pgdat)
> +{
> +	int j, node, load;
> +	enum zone_type i;
>  	nodemask_t used_mask;
> +	int local_node, prev_node;
> +	struct zonelist *zonelist;
> +	int ordering;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
> @@ -2120,11 +2232,18 @@ static void __meminit build_zonelists(pg
>  		zonelist->zones[0] = NULL;
>  	}
>  
> +	ordering = zonelist_order;
> +	if (ordering == ZONELIST_ORDER_AUTO)
> +		ordering = estimate_zonelist_order();
>  	/* NUMA-aware ordering of nodes */
>  	local_node = pgdat->node_id;
>  	load = num_online_nodes();
>  	prev_node = local_node;
>  	nodes_clear(used_mask);
> +
> +	memset(node_order, 0, sizeof(node_order));
> +	j = 0;
> +
>  	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
>  		int distance = node_distance(local_node, node);
>  
> @@ -2140,18 +2259,20 @@ static void __meminit build_zonelists(pg
>  		 * So adding penalty to the first node in same
>  		 * distance group to make it round-robin.
>  		 */
> -
>  		if (distance != node_distance(local_node, prev_node))
> -			node_load[node] += load;
> +			node_load[node] = load;
> +
>  		prev_node = node;
>  		load--;
> -		for (i = 0; i < MAX_NR_ZONES; i++) {
> -			zonelist = pgdat->node_zonelists + i;
> -			for (j = 0; zonelist->zones[j] != NULL; j++);
> +		if (ordering == ZONELIST_ORDER_NODE)	/* default */
> +			build_zonelists_in_node_order(pgdat, node);
> +		else
> +			node_order[j++] = node;	/* remember order */
> +	}
>  
> -	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
> -			zonelist->zones[j] = NULL;
> -		}
> +	if (ordering == ZONELIST_ORDER_ZONE) {
> +		/* calculate node order -- i.e., DMA last! */
> +		build_zonelists_in_zone_order(pgdat, j);
>  	}
>  }
>  
> @@ -2173,6 +2294,37 @@ static void __meminit build_zonelist_cac
>  	}
>  }
>  
> +/*
> + * sysctl handler for numa_zonelist_order
> + */
> +int numa_zonelist_order_handler(ctl_table *table, int write,
> +		struct file *file, void __user *buffer, size_t *length,
> +		loff_t *ppos)
> +{
> +	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	int ret;
> +
> +	if (write)
> +		strncpy(saved_string, (char*)table->data,
> +			NUMA_ZONELIST_ORDER_LEN);
> +	ret = proc_dostring(table, write, file, buffer, length, ppos);
> +	if (ret)
> +		return ret;
> +	if (write) {
> +		int oldval = zonelist_order;
> +		if (__parse_numa_zonelist_order((char*)table->data)) {
> +			/*
> +			 * bogus value.  restore saved string
> +			 */
> +			strncpy((char*)table->data, saved_string,
> +				NUMA_ZONELIST_ORDER_LEN);
> +			zonelist_order = oldval;
> +		} else if (oldval != zonelist_order)
> +			build_all_zonelists();
> +	}
> +	return 0;
> +}
> +
>  #else	/* CONFIG_NUMA */
>  
>  static void __meminit build_zonelists(pg_data_t *pgdat)
> @@ -2222,7 +2374,7 @@ static void __meminit build_zonelist_cac
>  #endif	/* CONFIG_NUMA */
>  
>  /* return values int ....just for stop_machine_run() */
> -static int __meminit __build_all_zonelists(void *dummy)
> +static int __build_all_zonelists(void *dummy)
>  {
>  	int nid;
>  
> @@ -2233,12 +2385,13 @@ static int __meminit __build_all_zonelis
>  	return 0;
>  }
>  
> -void __meminit build_all_zonelists(void)
> +void build_all_zonelists(void)
>  {
>  	if (system_state == SYSTEM_BOOTING) {
>  		__build_all_zonelists(NULL);
>  		cpuset_init_current_mems_allowed();
>  	} else {
> +		memset(node_load, 0, sizeof(node_load));
>  		/* we have to stop all cpus to guaranntee there is no user
>  		   of zonelist */
>  		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
> Index: linux-2.6.21-rc7-mm2/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/include/linux/mmzone.h
> +++ linux-2.6.21-rc7-mm2/include/linux/mmzone.h
> @@ -608,6 +608,11 @@ int sysctl_min_unmapped_ratio_sysctl_han
>  int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
>  			struct file *, void __user *, size_t *, loff_t *);
>  
> +extern int numa_zonelist_order_handler(struct ctl_table *, int,
> +			struct file *, void __user *, size_t *, loff_t *);
> +extern char numa_zonelist_order[];
> +#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
> +
>  #include <linux/topology.h>
>  /* Returns the number of the current Node. */
>  #ifndef numa_node_id
> Index: linux-2.6.21-rc7-mm2/Documentation/kernel-parameters.txt
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/Documentation/kernel-parameters.txt
> +++ linux-2.6.21-rc7-mm2/Documentation/kernel-parameters.txt
> @@ -1500,6 +1500,16 @@ and is between 256 and 4096 characters. 
>  			Format: <reboot_mode>[,<reboot_mode2>[,...]]
>  			See arch/*/kernel/reboot.c or arch/*/kernel/process.c			
>  
> +	numa_zonelist_order [KNL,BOOT]
> +			Select memory allocation zonelist order for NUMA
> +			platform.  Default is automatic configuration.
> +			"Node order" orders the zonelists by node [locality],
> +			then zones within nodes.  "Zone order" orders the
> +			zonelists by zone,then nodes within the zone.
> +			This moves DMA zone, if any, to the end of the
> +			allocation lists.
> +			See also Documentation/sysctl/vm.txt
> +
>  	reserve=	[KNL,BUGS] Force the kernel to ignore some iomem area
>  
>  	reservetop=	[X86-32]
> Index: linux-2.6.21-rc7-mm2/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/Documentation/sysctl/vm.txt
> +++ linux-2.6.21-rc7-mm2/Documentation/sysctl/vm.txt
> @@ -34,6 +34,7 @@ Currently, these files are in /proc/sys/
>  - swap_prefetch
>  - readahead_ratio
>  - readahead_hit_rate
> +- numa_zonelist_order
>  
>  ==============================================================
>  
> @@ -275,3 +276,34 @@ Possible values can be:
>  The larger value, the more capabilities, with more possible overheads.
>  
>  The default value is 1.
> +
> +=============================================================
> +
> +numa_zonelist_order
> +
> +This sysctl is only for NUMA.
> +
> +numa_zonelist_order selects the order of the memory allocation zonelists.
> +The default order [a.k.a. "node order"] orders the zonelists by node, the
                                                                         then
> +by zone within each node. The default is automatic configuration.
> +Specify "[Dd]fault" or "0" to request automatic configuration.
           "[Dd]efault"  but maybe should be "[Aa]uto" based on Kame's
rework and to avoid confusion w/rt node order being default? ...
> +
> + For example, assume 2 Node NUMA.  The "Node order" kernel memory allocation
> +order on Node(0) will be:
> +
> +	Node(0)NORMAL -> Node(0)DMA -> Node(1)NORMAL -> Node(1)DMA(if any)
> +
> +Thus, allocations that request Node(0) NORMAL may overflow onto Node(0)DMA
> +first.  This provides maximum locality, but risks exhausting all of DMA
> +memory while NORMAL memory exists elsewhere on the system.  This can result
> +in OOM-KILL in ZONE_DMA.  Secify "[Zz]one" or "1" to request zone order.
> +
> +If numa_zonelist_order is set to "node" order, the kernel memory allocation
> +order on Node(0) becomes:
> +
> +	Node(0)NORMAL -> Node(1)NORMAL -> Node(0)DMA -> Node(1)DMA(if any)
> +
> +In this mode, DMA memory will be used in place of NORMAL memory, only when
> +all NORMAL zones are exhausted.  Specify "[Nn]ode" or "2" for node order.
> +
> +The default value is 0.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [2/2] auto configuration
  2007-04-27  6:17 ` [PATCH] change global zonelist order v4 [2/2] auto configuration KAMEZAWA Hiroyuki
@ 2007-04-30 16:26   ` Lee Schermerhorn
  0 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2007-04-30 16:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, akpm, clameter

On Fri, 2007-04-27 at 15:17 +0900, KAMEZAWA Hiroyuki wrote:
> Add auto zone ordering configuration.
> 
> This function will select ZONE_ORDER_NODE when
> 
> - There are only ZONE_DMA or ZONE_DMA32.
> (or) size of (ZONE_DMA/DMA32) > (System Total Memory)/2
> (or) Assume Node(A)
> 	* Node (A)'s total memory > System Total Memory/num_of_node+1
> 	(and) Node (A)'s ZONE_DMA/DMA32 occupies 60% of Node(A)'s memory.
> 
> otherwise, ZONE_ORDER_ZONE is selected.
> 
> Note: a user can specifiy this ordering from boot option.
                   specify
> 
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Minor editorial [spelling, ...] comments.

Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> ---
>  mm/page_alloc.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.21-rc7-mm2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/mm/page_alloc.c	2007-04-27 15:39:49.000000000 +0900
> +++ linux-2.6.21-rc7-mm2/mm/page_alloc.c	2007-04-27 15:55:51.000000000 +0900
> @@ -2211,8 +2211,50 @@
>  
>  static int estimate_zonelist_order(void)
>  {
> -	/* dummy, just select node order. */
> -	return ZONELIST_ORDER_NODE;
> +	int nid, zone_type;
> +	unsigned long low_kmem_size,total_size;
> +	struct zone *z;
> +	int average_size;
> +	/* ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
> +	   If they are really small and used heavily,
> +	   the system can fall into OOM very easily.
> +	   This function detect ZONE_DMA/DMA32 size and confgigure
                           detects                        configures
> +	   zone ordering */
> +	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
> +	low_kmem_size = 0;
> +	total_size = 0;
> +	for_each_online_node(nid) {
> +		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
> +			z = &NODE_DATA(nid)->node_zones[zone_type];
> +			if (populated_zone(z)) {
> +				if (zone_type < ZONE_NORMAL)
> +					low_kmem_size += z->present_pages;
> +				total_size += z->present_pages;
> +			}
> +		}
> +	}
> +	if (!low_kmem_size ||  /* there is no DMA area. */
> +	    !low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
> +		return ZONELIST_ORDER_NODE;
> +	/* look into each node's config. where all processes starts... */
> +	/* average size..a bit smaller than real average size */
> +	average_size = total_size / (num_online_nodes() + 1);
> +	for_each_online_node(nid) {
> +		low_kmem_size = 0;
> +		total_size = 0;
> +		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
> +			z = &NODE_DATA(nid)->node_zones[zone_type];
> +			if (populated_zone(z)) {
> +				if (zone_type < ZONE_NORMAL)
> +					low_kmem_size += z->present_pages;
> +				total_size += z->present_pages;
> +			}
> +		}
> +		if (total_size > average_size && /* ignore unbalanced node */
> +		    low_kmem_size > total_size * 60/100)
> +			return ZONELIST_ORDER_NODE;
> +	}
> +	return ZONELIST_ORDER_ZONE;
>  }
>  
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-04-27  5:45 [PATCH] change global zonelist order v4 [0/2] KAMEZAWA Hiroyuki
  2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
  2007-04-27  6:17 ` [PATCH] change global zonelist order v4 [2/2] auto configuration KAMEZAWA Hiroyuki
@ 2007-05-04  5:47 ` Andrew Morton
  2007-05-04 15:26   ` Jesse Barnes
  2007-05-04 17:12   ` Lee Schermerhorn
  2 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2007-05-04  5:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, Linux-MM, Christoph Lameter, Lee.Schermerhorn

On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Hi, this is version 4. including Lee Schermerhon's good rework.
> and automatic configuration at boot time.

hm, this adds rather a lot of code.  Have we established that it's worth
it?

And it's complex - how do poor users know what to do with this new control?


This:

+ *	= "[dD]efault | "0"	- default, automatic configuration.
+ *	= "[nN]ode"|"1" 	- order by node locality,
+ *         			  then zone within node.
+ *	= "[zZ]one"|"2" - order by zone, then by locality within zone

seems a bit excessive.  I think just the 0/1/2 plus documentation would
suffice?


I haven't followed this discussion very closely I'm afraid.  If we came up
with a good reason why Linux needs this feature then could someone please
(re)describe it?

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
@ 2007-05-04 15:26   ` Jesse Barnes
  2007-05-04 16:18     ` Christoph Lameter
  2007-05-04 17:12   ` Lee Schermerhorn
  1 sibling, 1 reply; 13+ messages in thread
From: Jesse Barnes @ 2007-05-04 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, LKML, Linux-MM, Christoph Lameter, Lee.Schermerhorn

On Thursday, May 03, 2007, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki 
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
>
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?
>
> And it's complex - how do poor users know what to do with this new
> control?
>
>
> This:
>
> + *	= "[dD]efault | "0"	- default, automatic configuration.
> + *	= "[nN]ode"|"1" 	- order by node locality,
> + *         			  then zone within node.
> + *	= "[zZ]one"|"2" - order by zone, then by locality within zone
>
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?
>
>
> I haven't followed this discussion very closely I'm afraid.  If we came
> up with a good reason why Linux needs this feature then could someone
> please (re)describe it?

I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
ordering the fallback list first by zone, then by node distance (e.g. 
ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

As for documentation, it would be good if the "default" behavior was 
described as well (it's mostly by node first, then by zone iirc, but has a 
few other tweaks).

Another option would be to make this behavior automatic if both ZONE_DMA 
and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
that machines that really needed it would have all their memory in 
ZONE_DMA, but obviously that's not the case, so some more smarts are 
needed.

Jesse


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04 15:26   ` Jesse Barnes
@ 2007-05-04 16:18     ` Christoph Lameter
  2007-05-04 17:24       ` Lee Schermerhorn
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2007-05-04 16:18 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, LKML, Linux-MM, Lee.Schermerhorn

On Fri, 4 May 2007, Jesse Barnes wrote:

> I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
> ordering the fallback list first by zone, then by node distance (e.g. 
> ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
> followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

Maybe it would be cleaner to setup a DMA and DMA32 "node" up and define 
them at a certain distance to the rest of the nodes that only contain 
ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
have that effect without reworking zone list generation. Plus in the long 
run we may then be able to get to 1 zone per node avoiding the 
difficulties coming zone fallback altogether.

> Another option would be to make this behavior automatic if both ZONE_DMA 
> and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
> that machines that really needed it would have all their memory in 
> ZONE_DMA, but obviously that's not the case, so some more smarts are 
> needed.

I think what would work is to first setup nodes that use the highest zone. 
Then add virtual nodes for the lower zones that may only exist on a single 
node.

I.e. a 4 node x86_64 box may have

Node
0	ZONE_NORMAL
1	ZONE_NORMAL
2	ZONE_NORMAL
3	ZONE_NORMAL
4	ZONE_DMA32
5	[additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
6	ZONE_DMA

The SLIT information can be used to control how the nodes fallback to the 
DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
that it would be used only if an actual __GFP_DMA occurs or the system 
really runs into memory difficulties.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
  2007-05-04 15:26   ` Jesse Barnes
@ 2007-05-04 17:12   ` Lee Schermerhorn
  1 sibling, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2007-05-04 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, LKML, Linux-MM, Christoph Lameter, Eric Whitney

On Thu, 2007-05-03 at 22:47 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
> 
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?

See below.  Something is needed here on some platforms.  The current
zonelist ordering results in some unfortunate behavior on some
platforms.


> 
> And it's complex - how do poor users know what to do with this new control?
> 
Kame's autoconfig seems to be doing the right thing for our platform.
Might not be the case for other platforms, or some workloads on them.  I
suppose the documentation in sysctl.txt could be expanded to describe
when you might want to select a non-default setting, should we decide to
provide that capability.

> 
> This:
> 
> + *	= "[dD]efault | "0"	- default, automatic configuration.
> + *	= "[nN]ode"|"1" 	- order by node locality,
> + *         			  then zone within node.
> + *	= "[zZ]one"|"2" - order by zone, then by locality within zone
> 
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?

I agree, but I was considering dropping the "0/1/2" in favor of the more
descriptive [IMO] values ;-).

> 
> 
> I haven't followed this discussion very closely I'm afraid.  If we came up
> with a good reason why Linux needs this feature then could someone please
> (re)describe it?

Kame originally described the need for it in:

	http://marc.info/?l=linux-mm&m=117747120307559&w=4

I chimed in with support as we have a similar need for our cell-based
ia64 platforms:

	http://marc.info/?l=linux-mm&m=117760331328012&w=4

I can easily consume all of DMA on our platforms [configured as 100%
"cell local memory" -- always leaves some "cache-line interleaved" at
phys addr zero => ZONE_DMA] by allocating, e.g., a shared memory segment
of size > 1 node's memory + size of ZONE_DMA.  This occurs because the
node containing zone DMA is always 2nd in a zone's ZONE_NORMAL zonelist
[after the zone itself, assuming it has memory].  Then, any driver that
requests memory from ZONE_DMA will be denied, resulting in IO errors,
death of hald [maybe that's a feature? ;-)], ...

I guess I would be happy with Kame's V3 patch that unconditionally
changes the order to be zone first--i.e., ZONE_NORMAL for all nodes
before ZONE_DMA*:

	http://marc.info/?l=linux-mm&m=117758484122663&w=4

However, this patch apparently crossed in the mail with Christoph's
observation that making the new order [zone order] the default w/o any
option wouldn't be appropriate for some configurations:

	http://marc.info/?l=linux-mm&m=117760245022005&w=4

Meanwhile, I was factoring out common code in Kame's V1/V2 patch and
adding the "excessive" user interface to the boot parameter/sysctl.
After some additional rework, Kame posted this a V4--the one you're
questioning.

If we decide to proceed with this, I have another "cleanup" patch that
eliminates some redundant "estimating of zone order" [autoconfig] and
reports what order was chosen in the "Build %d zonelists..." message.


Lee


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04 16:18     ` Christoph Lameter
@ 2007-05-04 17:24       ` Lee Schermerhorn
  2007-05-04 17:28         ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Lee Schermerhorn @ 2007-05-04 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesse Barnes, Andrew Morton, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 2007-05-04 at 09:18 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Jesse Barnes wrote:
> 
> > I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
> > ordering the fallback list first by zone, then by node distance (e.g. 
> > ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
> > followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).
> 
> Maybe it would be cleaner to setup a DMA and DMA32 "node" up and define 
> them at a certain distance to the rest of the nodes that only contain 
> ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
> have that effect without reworking zone list generation. Plus in the long 
> run we may then be able to get to 1 zone per node avoiding the 
> difficulties coming zone fallback altogether.
> 
> > Another option would be to make this behavior automatic if both ZONE_DMA 
> > and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
> > that machines that really needed it would have all their memory in 
> > ZONE_DMA, but obviously that's not the case, so some more smarts are 
> > needed.
> 
> I think what would work is to first setup nodes that use the highest zone. 
> Then add virtual nodes for the lower zones that may only exist on a single 
> node.
> 
> I.e. a 4 node x86_64 box may have
> 
> Node
> 0	ZONE_NORMAL
> 1	ZONE_NORMAL
> 2	ZONE_NORMAL
> 3	ZONE_NORMAL
> 4	ZONE_DMA32
> 5	[additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
> 6	ZONE_DMA
> 
> The SLIT information can be used to control how the nodes fallback to the 
> DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
> that it would be used only if an actual __GFP_DMA occurs or the system 
> really runs into memory difficulties.

Hmmm...  "serious hackery", indeed!  ;-)

Lee


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04 17:24       ` Lee Schermerhorn
@ 2007-05-04 17:28         ` Christoph Lameter
  2007-05-04 17:36           ` Jesse Barnes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2007-05-04 17:28 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Jesse Barnes, Andrew Morton, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 4 May 2007, Lee Schermerhorn wrote:

> Hmmm...  "serious hackery", indeed!  ;-)

Maybe on the arch level but minimal changes to core code.
And it is a step towards avoiding zones in NUMA.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04 17:28         ` Christoph Lameter
@ 2007-05-04 17:36           ` Jesse Barnes
  2007-05-04 18:03             ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Jesse Barnes @ 2007-05-04 17:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Andrew Morton, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Friday, May 04, 2007, Christoph Lameter wrote:
> On Fri, 4 May 2007, Lee Schermerhorn wrote:
> > Hmmm...  "serious hackery", indeed!  ;-)
>
> Maybe on the arch level but minimal changes to core code.
> And it is a step towards avoiding zones in NUMA.

You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
first node could have a few gigs of memory, but for regular ZONE_DMA it's 
probably the right thing to do...

So aside from the comment issues Lee already pointed out, I think 
Kamezawa-san's patch from 
http://marc.info/?l=linux-mm&m=117758484122663&w=4 seems reasonable.

Jesse

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] change global zonelist order v4 [0/2]
  2007-05-04 17:36           ` Jesse Barnes
@ 2007-05-04 18:03             ` Christoph Lameter
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2007-05-04 18:03 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Lee Schermerhorn, Andrew Morton, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 4 May 2007, Jesse Barnes wrote:

> You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
> the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
> bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
> first node could have a few gigs of memory, but for regular ZONE_DMA it's 
> probably the right thing to do...

If the fallback sequence is f.e. Node 0 NORMAL (500m) Node 1 NORMAL(4G) 
node 2 Normal (4G) ... many more ... Node 0 DMA32 (~4G) Node 0 DMA then 
memory is frequently going to be not optimally placed for allocations from 
processes running on node 0 because node 0 is memory starved. 
Allocations will be made from node 1 which may create a shortage there 
which fall again. Could be a cascade effect because the symmetry in 
memory is no longer there.

The proposal to create an additional node may solve that to some extend by placing the 
DMA node nearer to node 0.

Maybe the best approach is to leave things as is and just be careful with 
I/O to 32 bits? I do not think there is an easy solution. A 64 bit NUMA 
platforms should have I/O that is 64 bit capable and not restricted to DMA 
zones.

> > So aside from the comment issues Lee already pointed out, I think 
> Kamezawa-san's patch from 
> http://marc.info/?l=linux-mm&m=117758484122663&w=4 seems reasonable.

If we are going to do this then the patch needs to be fine tuned first and 
the impact on core code needs to be minimized. I want to make really sure 
that platforms without DMA zones work right, if zones are empty it should 
work right and weird x86_64 combinations of NORMAL, DMA and DMA32 
distributed over various nodes would need to be covered and tested first.

How will this affect NUMAQ (32 bit NUMA) where we have HIGHMEM on the 
(most) nodes and NORMAL/DMA on node 0?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-05-04 18:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-27  5:45 [PATCH] change global zonelist order v4 [0/2] KAMEZAWA Hiroyuki
2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
2007-04-30 16:12   ` Lee Schermerhorn
2007-04-27  6:17 ` [PATCH] change global zonelist order v4 [2/2] auto configuration KAMEZAWA Hiroyuki
2007-04-30 16:26   ` Lee Schermerhorn
2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
2007-05-04 15:26   ` Jesse Barnes
2007-05-04 16:18     ` Christoph Lameter
2007-05-04 17:24       ` Lee Schermerhorn
2007-05-04 17:28         ` Christoph Lameter
2007-05-04 17:36           ` Jesse Barnes
2007-05-04 18:03             ` Christoph Lameter
2007-05-04 17:12   ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).