LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/4] De-couple sysfs memory directories from memory sections
@ 2011-01-20 16:36 Nathan Fontenot
  2011-01-20 16:43 ` [PATCH 1/4] Allow memory blocks to span multiple " Nathan Fontenot
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:36 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

This is a re-send of the remaining patches that did not make it
into the last kernel release for de-coupling sysfs memory
directories from memory sections.  The first three patches of the
previous set went in, and this is the remaining patches that
need to be applied.

The patches decouple the concept that a single memory section corresponds
to a single directory in /sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are performance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8+ hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against sibling
directories ( see sysfs_find_dirent() ) to ensure we do not create 
duplicates.  The list of directory nodes in sysfs is kept as an
unsorted list which results in this being an exponentially longer
operation as the number of directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc and x86 in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

-Nathan Fontenot

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] Allow memory blocks to span multiple memory sections
  2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
@ 2011-01-20 16:43 ` Nathan Fontenot
  2011-01-20 16:44 ` [PATCH 2/4] Update phys_index to [start|end]_section_nr Nathan Fontenot
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:43 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

Update the memory sysfs code such that each sysfs memory directory is now
considered a memory block that can span multiple memory sections per
memory block.  The default size of each memory block is SECTION_SIZE_BITS
to maintain the current behavior of having a single memory section per
memory block (i.e. one sysfs directory per memory section).

For architectures that want to have memory blocks span multiple
memory sections they need only define their own memory_block_size_bytes()
routine.

Update the memory hotplug documentation to reflect the new behaviors of
memory blocks reflected in sysfs.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/memory-hotplug.txt |   47 +++++++----
 drivers/base/memory.c            |  155 +++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 63 deletions(-)

Index: linux-2.6/Documentation/memory-hotplug.txt
===================================================================
--- linux-2.6.orig/Documentation/memory-hotplug.txt	2011-01-05 10:08:16.000000000 -0600
+++ linux-2.6/Documentation/memory-hotplug.txt	2011-01-05 10:17:37.000000000 -0600
@@ -126,36 +126,51 @@ config options.
 --------------------------------
 4 sysfs files for memory hotplug
 --------------------------------
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs.  Each section is part of
+a memory block under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
 
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block.  The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory.  It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
 
 For example, assume 1GiB section size. A device for a memory starting at
 0x100000000 is /sys/device/system/memory/memory4
 (0x100000000 / 1Gib = 4)
 This device covers address range [0x100000000 ... 0x140000000)
 
-Under each section, you can see 4 files.
+Under each section, you can see 4 or 5 files, the end_phys_index file being
+a recent addition and not present on older kernels.
 
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index' : read-only and contains section id, same as XXX.
-'state'      : read-write
-               at read:  contains online/offline state of memory.
-               at write: user can specify "online", "offline" command
-'phys_device': read-only: designed to show the name of physical memory device.
-               This is not well implemented now.
-'removable'  : read-only: contains an integer value indicating
-               whether the memory section is removable or not
-               removable.  A value of 1 indicates that the memory
-               section is removable and a value of 0 indicates that
-               it is not removable.
+'phys_index'      : read-only and contains section id of the first section
+		    in the memory block, same as XXX.
+'end_phys_index'  : read-only and contains section id of the last section
+		    in the memory block.
+'state'           : read-write
+                    at read:  contains online/offline state of memory.
+                    at write: user can specify "online", "offline" command
+                    which will be performed on al sections in the block.
+'phys_device'     : read-only: designed to show the name of physical memory
+                    device.  This is not well implemented now.
+'removable'       : read-only: contains an integer value indicating
+                    whether the memory block is removable or not
+                    removable.  A value of 1 indicates that the memory
+                    block is removable and a value of 0 indicates that
+                    it is not removable. A memory block is removable only if
+                    every section in the block is removable.
 
 NOTE:
   These directories/files appear after physical memory hotplug phase.
Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c	2011-01-05 10:08:16.000000000 -0600
+++ linux-2.6/drivers/base/memory.c	2011-01-05 10:17:37.000000000 -0600
@@ -30,6 +30,14 @@
 static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME	"memory"
+#define MIN_MEMORY_BLOCK_SIZE	(1 << SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+	return section_nr / sections_per_block;
+}
 
 static struct sysdev_class memory_sysdev_class = {
 	.name = MEMORY_CLASS_NAME,
@@ -84,28 +92,47 @@ EXPORT_SYMBOL(unregister_memory_isolate_
  * register_memory - Setup a sysfs device for a memory block
  */
 static
-int register_memory(struct memory_block *memory, struct mem_section *section)
+int register_memory(struct memory_block *memory)
 {
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = __section_nr(section);
+	memory->sysdev.id = memory->phys_index / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
 }
 
 static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
 {
 	BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
-	BUG_ON(memory->sysdev.id != __section_nr(section));
 
 	/* drop the ref. we got in remove_memory_block() */
 	kobject_put(&memory->sysdev.kobj);
 	sysdev_unregister(&memory->sysdev);
 }
 
+unsigned long __weak memory_block_size_bytes(void)
+{
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static unsigned long get_memory_block_size(void)
+{
+	unsigned long block_sz;
+
+	block_sz = memory_block_size_bytes();
+
+	/* Validate blk_sz is a power of 2 and not less than section size */
+	if ((block_sz & (block_sz - 1)) || (block_sz < MIN_MEMORY_BLOCK_SIZE)) {
+		WARN_ON(1);
+		block_sz = MIN_MEMORY_BLOCK_SIZE;
+	}
+
+	return block_sz;
+}
+
 /*
  * use this as the physical section index that this memsection
  * uses.
@@ -116,7 +143,7 @@ static ssize_t show_mem_phys_index(struc
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index);
+	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
 }
 
 /*
@@ -125,13 +152,16 @@ static ssize_t show_mem_phys_index(struc
 static ssize_t show_mem_removable(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
-	unsigned long start_pfn;
-	int ret;
+	unsigned long i, pfn;
+	int ret = 1;
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
 
-	start_pfn = section_nr_to_pfn(mem->phys_index);
-	ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+	for (i = 0; i < sections_per_block; i++) {
+		pfn = section_nr_to_pfn(mem->phys_index + i);
+		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
+	}
+
 	return sprintf(buf, "%d\n", ret);
 }
 
@@ -184,17 +214,14 @@ int memory_isolate_notify(unsigned long
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_section_action(unsigned long phys_index, unsigned long action)
 {
 	int i;
-	unsigned long psection;
 	unsigned long start_pfn, start_paddr;
 	struct page *first_page;
 	int ret;
-	int old_state = mem->state;
 
-	psection = mem->phys_index;
-	first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
+	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
 
 	/*
 	 * The probe routines leave the pages reserved, just
@@ -207,8 +234,8 @@ memory_block_action(struct memory_block
 				continue;
 
 			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online? \n",
-				psection, i);
+				"not reserved, was it already online?\n",
+				phys_index, i);
 			return -EBUSY;
 		}
 	}
@@ -219,18 +246,13 @@ memory_block_action(struct memory_block
 			ret = online_pages(start_pfn, PAGES_PER_SECTION);
 			break;
 		case MEM_OFFLINE:
-			mem->state = MEM_GOING_OFFLINE;
 			start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
 			ret = remove_memory(start_paddr,
 					    PAGES_PER_SECTION << PAGE_SHIFT);
-			if (ret) {
-				mem->state = old_state;
-				break;
-			}
 			break;
 		default:
-			WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
-					__func__, mem, action, action);
+			WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
+			     "%ld\n", __func__, phys_index, action, action);
 			ret = -EINVAL;
 	}
 
@@ -240,7 +262,8 @@ memory_block_action(struct memory_block
 static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
-	int ret = 0;
+	int i, ret = 0;
+
 	mutex_lock(&mem->state_mutex);
 
 	if (mem->state != from_state_req) {
@@ -248,8 +271,22 @@ static int memory_block_change_state(str
 		goto out;
 	}
 
-	ret = memory_block_action(mem, to_state);
-	if (!ret)
+	if (to_state == MEM_OFFLINE)
+		mem->state = MEM_GOING_OFFLINE;
+
+	for (i = 0; i < sections_per_block; i++) {
+		ret = memory_section_action(mem->phys_index + i, to_state);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (i = 0; i < sections_per_block; i++)
+			memory_section_action(mem->phys_index + i,
+					      from_state_req);
+
+		mem->state = from_state_req;
+	} else
 		mem->state = to_state;
 
 out:
@@ -262,20 +299,15 @@ store_mem_state(struct sys_device *dev,
 		struct sysdev_attribute *attr, const char *buf, size_t count)
 {
 	struct memory_block *mem;
-	unsigned int phys_section_nr;
 	int ret = -EINVAL;
 
 	mem = container_of(dev, struct memory_block, sysdev);
-	phys_section_nr = mem->phys_index;
-
-	if (!present_section_nr(phys_section_nr))
-		goto out;
 
 	if (!strncmp(buf, "online", min((int)count, 6)))
 		ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 	else if(!strncmp(buf, "offline", min((int)count, 7)))
 		ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
-out:
+
 	if (ret)
 		return ret;
 	return count;
@@ -315,7 +347,7 @@ static ssize_t
 print_block_size(struct sysdev_class *class, struct sysdev_class_attribute *attr,
 		 char *buf)
 {
-	return sprintf(buf, "%lx\n", (unsigned long)PAGES_PER_SECTION * PAGE_SIZE);
+	return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
 static SYSDEV_CLASS_ATTR(block_size_bytes, 0444, print_block_size, NULL);
@@ -444,6 +476,7 @@ struct memory_block *find_memory_block_h
 	struct sys_device *sysdev;
 	struct memory_block *mem;
 	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+	int block_id = base_memory_block_id(__section_nr(section));
 
 	kobj = hint ? &hint->sysdev.kobj : NULL;
 
@@ -451,7 +484,7 @@ struct memory_block *find_memory_block_h
 	 * This only works because we know that section == sysdev->id
 	 * slightly redundant with sysdev_register()
 	 */
-	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);
 
 	kobj = kset_find_obj_hinted(&memory_sysdev_class.kset, name, kobj);
 	if (!kobj)
@@ -476,26 +509,27 @@ struct memory_block *find_memory_block(s
 	return find_memory_block_hinted(section, NULL);
 }
 
-static int add_memory_block(int nid, struct mem_section *section,
-			unsigned long state, enum mem_add_context context)
+static int init_memory_block(struct memory_block **memory,
+			     struct mem_section *section, unsigned long state)
 {
-	struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	struct memory_block *mem;
 	unsigned long start_pfn;
+	int scn_nr;
 	int ret = 0;
 
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
 
-	mutex_lock(&mem_sysfs_mutex);
-
-	mem->phys_index = __section_nr(section);
+	scn_nr = __section_nr(section);
+	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
 	mem->state = state;
 	mem->section_count++;
 	mutex_init(&mem->state_mutex);
 	start_pfn = section_nr_to_pfn(mem->phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
-	ret = register_memory(mem, section);
+	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
@@ -504,8 +538,29 @@ static int add_memory_block(int nid, str
 		ret = mem_create_simple_file(mem, phys_device);
 	if (!ret)
 		ret = mem_create_simple_file(mem, removable);
+
+	*memory = mem;
+	return ret;
+}
+
+static int add_memory_section(int nid, struct mem_section *section,
+			unsigned long state, enum mem_add_context context)
+{
+	struct memory_block *mem;
+	int ret = 0;
+
+	mutex_lock(&mem_sysfs_mutex);
+
+	mem = find_memory_block(section);
+	if (mem) {
+		mem->section_count++;
+		kobject_put(&mem->sysdev.kobj);
+	} else
+		ret = init_memory_block(&mem, section, state);
+
 	if (!ret) {
-		if (context == HOTPLUG)
+		if (context == HOTPLUG &&
+		    mem->section_count == sections_per_block)
 			ret = register_mem_sect_under_node(mem, nid);
 	}
 
@@ -528,8 +583,10 @@ int remove_memory_block(unsigned long no
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
-		unregister_memory(mem, section);
-	}
+		unregister_memory(mem);
+		kfree(mem);
+	} else
+		kobject_put(&mem->sysdev.kobj);
 
 	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
@@ -541,7 +598,7 @@ int remove_memory_block(unsigned long no
  */
 int register_new_memory(int nid, struct mem_section *section)
 {
-	return add_memory_block(nid, section, MEM_OFFLINE, HOTPLUG);
+	return add_memory_section(nid, section, MEM_OFFLINE, HOTPLUG);
 }
 
 int unregister_memory_section(struct mem_section *section)
@@ -560,12 +617,16 @@ int __init memory_dev_init(void)
 	unsigned int i;
 	int ret;
 	int err;
+	unsigned long block_sz;
 
 	memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
 	ret = sysdev_class_register(&memory_sysdev_class);
 	if (ret)
 		goto out;
 
+	block_sz = get_memory_block_size();
+	sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+
 	/*
 	 * Create entries for memory sections that were found
 	 * during boot and have been initialized
@@ -573,8 +634,8 @@ int __init memory_dev_init(void)
 	for (i = 0; i < NR_MEM_SECTIONS; i++) {
 		if (!present_section_nr(i))
 			continue;
-		err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
-				       BOOT);
+		err = add_memory_section(0, __nr_to_section(i), MEM_ONLINE,
+					 BOOT);
 		if (!ret)
 			ret = err;
 	}


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/4] Update phys_index to [start|end]_section_nr
  2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
  2011-01-20 16:43 ` [PATCH 1/4] Allow memory blocks to span multiple " Nathan Fontenot
@ 2011-01-20 16:44 ` Nathan Fontenot
  2011-01-20 16:45 ` [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries Nathan Fontenot
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:44 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

Update the 'phys_index' property of a the memory_block struct to be
called start_section_nr, and add a end_section_nr property.  The
data tracked here is the same but the updated naming is more in line
with what is stored here, namely the first and last section number
that the memory block spans.

The names presented to userspace remain the same, phys_index for
start_section_nr and end_phys_index for end_section_nr, to avoid breaking
anything in userspace.

This also updates the node sysfs code to be aware of the new capability for
a memory block to contain multiple memory sections and be aware of the memory
block structure name changes (start_section_nr).  This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 drivers/base/memory.c  |   41 +++++++++++++++++++++++++++++++----------
 drivers/base/node.c    |   12 ++++++++----
 include/linux/memory.h |    3 ++-
 include/linux/node.h   |    6 ++++--
 4 files changed, 45 insertions(+), 17 deletions(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c	2011-01-20 08:20:54.000000000 -0600
+++ linux-2.6/drivers/base/memory.c	2011-01-20 08:20:56.000000000 -0600
@@ -97,7 +97,7 @@ int register_memory(struct memory_block
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = memory->phys_index / sections_per_block;
+	memory->sysdev.id = memory->start_section_nr / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
@@ -138,12 +138,26 @@ static unsigned long get_memory_block_si
  * uses.
  */
 
-static ssize_t show_mem_phys_index(struct sys_device *dev,
+static ssize_t show_mem_start_phys_index(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
+	unsigned long phys_index;
+
+	phys_index = mem->start_section_nr / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+			struct sysdev_attribute *attr, char *buf)
+{
+	struct memory_block *mem =
+		container_of(dev, struct memory_block, sysdev);
+	unsigned long phys_index;
+
+	phys_index = mem->end_section_nr / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
 }
 
 /*
@@ -158,7 +172,7 @@ static ssize_t show_mem_removable(struct
 		container_of(dev, struct memory_block, sysdev);
 
 	for (i = 0; i < sections_per_block; i++) {
-		pfn = section_nr_to_pfn(mem->phys_index + i);
+		pfn = section_nr_to_pfn(mem->start_section_nr + i);
 		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
 	}
 
@@ -275,14 +289,15 @@ static int memory_block_change_state(str
 		mem->state = MEM_GOING_OFFLINE;
 
 	for (i = 0; i < sections_per_block; i++) {
-		ret = memory_section_action(mem->phys_index + i, to_state);
+		ret = memory_section_action(mem->start_section_nr + i,
+					    to_state);
 		if (ret)
 			break;
 	}
 
 	if (ret) {
 		for (i = 0; i < sections_per_block; i++)
-			memory_section_action(mem->phys_index + i,
+			memory_section_action(mem->start_section_nr + i,
 					      from_state_req);
 
 		mem->state = from_state_req;
@@ -330,7 +345,8 @@ static ssize_t show_phys_device(struct s
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
-static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
 static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
 static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
 static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -522,17 +538,21 @@ static int init_memory_block(struct memo
 		return -ENOMEM;
 
 	scn_nr = __section_nr(section);
-	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
+	mem->start_section_nr =
+			base_memory_block_id(scn_nr) * sections_per_block;
+	mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;
 	mem->state = state;
 	mem->section_count++;
 	mutex_init(&mem->state_mutex);
-	start_pfn = section_nr_to_pfn(mem->phys_index);
+	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
 	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
+		ret = mem_create_simple_file(mem, end_phys_index);
+	if (!ret)
 		ret = mem_create_simple_file(mem, state);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_device);
@@ -575,11 +595,12 @@ int remove_memory_block(unsigned long no
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
+	unregister_mem_sect_under_nodes(mem, __section_nr(section));
 
 	mem->section_count--;
 	if (mem->section_count == 0) {
-		unregister_mem_sect_under_nodes(mem);
 		mem_remove_simple_file(mem, phys_index);
+		mem_remove_simple_file(mem, end_phys_index);
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
Index: linux-2.6/drivers/base/node.c
===================================================================
--- linux-2.6.orig/drivers/base/node.c	2011-01-20 08:20:03.000000000 -0600
+++ linux-2.6/drivers/base/node.c	2011-01-20 08:20:56.000000000 -0600
@@ -375,8 +375,10 @@ int register_mem_sect_under_node(struct
 		return -EFAULT;
 	if (!node_online(nid))
 		return 0;
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int page_nid;
 
@@ -400,7 +402,8 @@ int register_mem_sect_under_node(struct
 }
 
 /* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+				    unsigned long phys_index)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -412,7 +415,8 @@ int unregister_mem_sect_under_nodes(stru
 	if (!unlinked_nodes)
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+
+	sect_start_pfn = section_nr_to_pfn(phys_index);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
Index: linux-2.6/include/linux/memory.h
===================================================================
--- linux-2.6.orig/include/linux/memory.h	2011-01-20 08:18:22.000000000 -0600
+++ linux-2.6/include/linux/memory.h	2011-01-20 08:20:56.000000000 -0600
@@ -21,7 +21,8 @@
 #include <linux/mutex.h>
 
 struct memory_block {
-	unsigned long phys_index;
+	unsigned long start_section_nr;
+	unsigned long end_section_nr;
 	unsigned long state;
 	int section_count;
 
Index: linux-2.6/include/linux/node.h
===================================================================
--- linux-2.6.orig/include/linux/node.h	2011-01-20 08:18:22.000000000 -0600
+++ linux-2.6/include/linux/node.h	2011-01-20 08:20:56.000000000 -0600
@@ -39,7 +39,8 @@ extern int register_cpu_under_node(unsig
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+					   unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -67,7 +68,8 @@ static inline int register_mem_sect_unde
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+						  unsigned long phys_index)
 {
 	return 0;
 }



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries
  2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
  2011-01-20 16:43 ` [PATCH 1/4] Allow memory blocks to span multiple " Nathan Fontenot
  2011-01-20 16:44 ` [PATCH 2/4] Update phys_index to [start|end]_section_nr Nathan Fontenot
@ 2011-01-20 16:45 ` Nathan Fontenot
  2011-02-06 23:39   ` Benjamin Herrenschmidt
  2011-01-20 16:45 ` [PATCH 0/4] De-couple sysfs memory directories from memory sections Greg KH
  2011-01-20 16:46 ` [PATCH 4/4] Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV Nathan Fontenot
  4 siblings, 1 reply; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:45 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

Define a version of memory_block_size_bytes() for powerpc/pseries such that
a memory block spans an entire lmb.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Reviewed-by: Robin Holt <holt@sgi.com>

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   66 +++++++++++++++++++-----
 1 file changed, 53 insertions(+), 13 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/pseries/hotplug-memory.c	2011-01-20 08:18:21.000000000 -0600
+++ linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c	2011-01-20 08:21:07.000000000 -0600
@@ -17,6 +17,54 @@
 #include <asm/pSeries_reconfig.h>
 #include <asm/sparsemem.h>
 
+static unsigned long get_memblock_size(void)
+{
+	struct device_node *np;
+	unsigned int memblock_size = 0;
+
+	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+	if (np) {
+		const unsigned long *size;
+
+		size = of_get_property(np, "ibm,lmb-size", NULL);
+		memblock_size = size ? *size : 0;
+
+		of_node_put(np);
+	} else {
+		unsigned int memzero_size = 0;
+		const unsigned int *regs;
+
+		np = of_find_node_by_path("/memory@0");
+		if (np) {
+			regs = of_get_property(np, "reg", NULL);
+			memzero_size = regs ? regs[3] : 0;
+			of_node_put(np);
+		}
+
+		if (memzero_size) {
+			/* We now know the size of memory@0, use this to find
+			 * the first memoryblock and get its size.
+			 */
+			char buf[64];
+
+			sprintf(buf, "/memory@%x", memzero_size);
+			np = of_find_node_by_path(buf);
+			if (np) {
+				regs = of_get_property(np, "reg", NULL);
+				memblock_size = regs ? regs[3] : 0;
+				of_node_put(np);
+			}
+		}
+	}
+
+	return memblock_size;
+}
+
+unsigned long memory_block_size_bytes(void)
+{
+	return get_memblock_size();
+}
+
 static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
 {
 	unsigned long start, start_pfn;
@@ -127,30 +175,22 @@ static int pseries_add_memory(struct dev
 
 static int pseries_drconf_memory(unsigned long *base, unsigned int action)
 {
-	struct device_node *np;
-	const unsigned long *lmb_size;
+	unsigned long memblock_size;
 	int rc;
 
-	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
-	if (!np)
+	memblock_size = get_memblock_size();
+	if (!memblock_size)
 		return -EINVAL;
 
-	lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
-	if (!lmb_size) {
-		of_node_put(np);
-		return -EINVAL;
-	}
-
 	if (action == PSERIES_DRCONF_MEM_ADD) {
-		rc = memblock_add(*base, *lmb_size);
+		rc = memblock_add(*base, memblock_size);
 		rc = (rc < 0) ? -EINVAL : 0;
 	} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
-		rc = pseries_remove_memblock(*base, *lmb_size);
+		rc = pseries_remove_memblock(*base, memblock_size);
 	} else {
 		rc = -EINVAL;
 	}
 
-	of_node_put(np);
 	return rc;
 }
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (2 preceding siblings ...)
  2011-01-20 16:45 ` [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries Nathan Fontenot
@ 2011-01-20 16:45 ` Greg KH
  2011-01-20 16:51   ` Nathan Fontenot
  2011-01-20 17:09   ` Dave Hansen
  2011-01-20 16:46 ` [PATCH 4/4] Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV Nathan Fontenot
  4 siblings, 2 replies; 15+ messages in thread
From: Greg KH @ 2011-01-20 16:45 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

On Thu, Jan 20, 2011 at 10:36:40AM -0600, Nathan Fontenot wrote:
> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against sibling
> directories ( see sysfs_find_dirent() ) to ensure we do not create 
> duplicates.  The list of directory nodes in sysfs is kept as an
> unsorted list which results in this being an exponentially longer
> operation as the number of directories are created.

Again, are you sure about this?  I thought we resolved this issue in the
past, but you were going to check it.  Did you?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/4] Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV
  2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (3 preceding siblings ...)
  2011-01-20 16:45 ` [PATCH 0/4] De-couple sysfs memory directories from memory sections Greg KH
@ 2011-01-20 16:46 ` Nathan Fontenot
  4 siblings, 0 replies; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:46 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

Define a version of memory_block_size_bytes for x86_64 when CONFIG_X86_UV is
set.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/x86/mm/init_64.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c	2011-01-20 08:18:20.000000000 -0600
+++ linux-2.6/arch/x86/mm/init_64.c	2011-01-20 08:21:10.000000000 -0600
@@ -51,6 +51,7 @@
 #include <asm/numa.h>
 #include <asm/cacheflush.h>
 #include <asm/init.h>
+#include <asm/uv/uv.h>
 
 static int __init parse_direct_gbpages_off(char *arg)
 {
@@ -908,6 +909,19 @@ const char *arch_vma_name(struct vm_area
 	return NULL;
 }
 
+#ifdef CONFIG_X86_UV
+#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
+
+unsigned long memory_block_size_bytes(void)
+{
+	if (is_uv_system()) {
+		printk(KERN_INFO "UV: memory block size 2GB\n");
+		return 2UL * 1024 * 1024 * 1024;
+	}
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-20 16:45 ` [PATCH 0/4] De-couple sysfs memory directories from memory sections Greg KH
@ 2011-01-20 16:51   ` Nathan Fontenot
  2011-01-20 17:25     ` Greg KH
  2011-01-20 17:09   ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-20 16:51 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

On 01/20/2011 10:45 AM, Greg KH wrote:
> On Thu, Jan 20, 2011 at 10:36:40AM -0600, Nathan Fontenot wrote:
>> The root of this issue is in sysfs directory creation. Every time
>> a directory is created a string compare is done against sibling
>> directories ( see sysfs_find_dirent() ) to ensure we do not create 
>> duplicates.  The list of directory nodes in sysfs is kept as an
>> unsorted list which results in this being an exponentially longer
>> operation as the number of directories are created.
> 
> Again, are you sure about this?  I thought we resolved this issue in the
> past, but you were going to check it.  Did you?
> 

Yes, the string compare is still present in the sysfs code.  There was
discussion around this sometime last year when I sent a patch out that
stored the directory entries in something other than a linked list.
That patch was rejected but it was agreed that something should be done.

-Nathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-20 16:45 ` [PATCH 0/4] De-couple sysfs memory directories from memory sections Greg KH
  2011-01-20 16:51   ` Nathan Fontenot
@ 2011-01-20 17:09   ` Dave Hansen
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2011-01-20 17:09 UTC (permalink / raw)
  To: Greg KH
  Cc: Nathan Fontenot, linux-mm, linuxppc-dev, linux-kernel,
	KAMEZAWA Hiroyuki, Robin Holt

On Thu, 2011-01-20 at 08:45 -0800, Greg KH wrote:
> On Thu, Jan 20, 2011 at 10:36:40AM -0600, Nathan Fontenot wrote:
> > The root of this issue is in sysfs directory creation. Every time
> > a directory is created a string compare is done against sibling
> > directories ( see sysfs_find_dirent() ) to ensure we do not create 
> > duplicates.  The list of directory nodes in sysfs is kept as an
> > unsorted list which results in this being an exponentially longer
> > operation as the number of directories are created.
> 
> Again, are you sure about this?  I thought we resolved this issue in the
> past, but you were going to check it.  Did you?

Just to be clear, simply reducing the number of kobjects can make these
patches worthwhile on their own.  I originally figured that the
SECTION_SIZE would go up over time as systems got larger, and _that_
would keep the number of sections and number of sysfs objects down.
Well, that turned out to be wrong, and we're eating up a ton of memory
now.  We can't fix the SECTION_SIZE easily, but we can reduce the number
of kobjects that we need to track the sections.  *That* is the main
benefit I see from these patches.

I think there's a problem worth fixing, even ignoring the directory
creation issue (if it still exists).

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-20 16:51   ` Nathan Fontenot
@ 2011-01-20 17:25     ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2011-01-20 17:25 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki, Robin Holt

On Thu, Jan 20, 2011 at 10:51:44AM -0600, Nathan Fontenot wrote:
> On 01/20/2011 10:45 AM, Greg KH wrote:
> > On Thu, Jan 20, 2011 at 10:36:40AM -0600, Nathan Fontenot wrote:
> >> The root of this issue is in sysfs directory creation. Every time
> >> a directory is created a string compare is done against sibling
> >> directories ( see sysfs_find_dirent() ) to ensure we do not create 
> >> duplicates.  The list of directory nodes in sysfs is kept as an
> >> unsorted list which results in this being an exponentially longer
> >> operation as the number of directories are created.
> > 
> > Again, are you sure about this?  I thought we resolved this issue in the
> > past, but you were going to check it.  Did you?
> > 
> 
> Yes, the string compare is still present in the sysfs code.  There was
> discussion around this sometime last year when I sent a patch out that
> stored the directory entries in something other than a linked list.
> That patch was rejected but it was agreed that something should be done.

Ah, ok, thanks for verifying.

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries
  2011-01-20 16:45 ` [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries Nathan Fontenot
@ 2011-02-06 23:39   ` Benjamin Herrenschmidt
  2011-02-07  1:42     ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2011-02-06 23:39 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, linux-mm, linuxppc-dev, linux-kernel, KAMEZAWA Hiroyuki,
	Robin Holt

On Thu, 2011-01-20 at 10:45 -0600, Nathan Fontenot wrote:
> Define a version of memory_block_size_bytes() for powerpc/pseries such that
> a memory block spans an entire lmb.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
> Reviewed-by: Robin Holt <holt@sgi.com>

Hi Nathan !

Is somebody from -mm picking the rest of the series ? This patch as well
or shall I wait for the first two to go in and then pick that one in
-powerpc ?

Cheers,
Ben.

> ---
>  arch/powerpc/platforms/pseries/hotplug-memory.c |   66 +++++++++++++++++++-----
>  1 file changed, 53 insertions(+), 13 deletions(-)
> 
> Index: linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/platforms/pseries/hotplug-memory.c	2011-01-20 08:18:21.000000000 -0600
> +++ linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c	2011-01-20 08:21:07.000000000 -0600
> @@ -17,6 +17,54 @@
>  #include <asm/pSeries_reconfig.h>
>  #include <asm/sparsemem.h>
>  
> +static unsigned long get_memblock_size(void)
> +{
> +	struct device_node *np;
> +	unsigned int memblock_size = 0;
> +
> +	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
> +	if (np) {
> +		const unsigned long *size;
> +
> +		size = of_get_property(np, "ibm,lmb-size", NULL);
> +		memblock_size = size ? *size : 0;
> +
> +		of_node_put(np);
> +	} else {
> +		unsigned int memzero_size = 0;
> +		const unsigned int *regs;
> +
> +		np = of_find_node_by_path("/memory@0");
> +		if (np) {
> +			regs = of_get_property(np, "reg", NULL);
> +			memzero_size = regs ? regs[3] : 0;
> +			of_node_put(np);
> +		}
> +
> +		if (memzero_size) {
> +			/* We now know the size of memory@0, use this to find
> +			 * the first memoryblock and get its size.
> +			 */
> +			char buf[64];
> +
> +			sprintf(buf, "/memory@%x", memzero_size);
> +			np = of_find_node_by_path(buf);
> +			if (np) {
> +				regs = of_get_property(np, "reg", NULL);
> +				memblock_size = regs ? regs[3] : 0;
> +				of_node_put(np);
> +			}
> +		}
> +	}
> +
> +	return memblock_size;
> +}
> +
> +unsigned long memory_block_size_bytes(void)
> +{
> +	return get_memblock_size();
> +}
> +
>  static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
>  {
>  	unsigned long start, start_pfn;
> @@ -127,30 +175,22 @@ static int pseries_add_memory(struct dev
>  
>  static int pseries_drconf_memory(unsigned long *base, unsigned int action)
>  {
> -	struct device_node *np;
> -	const unsigned long *lmb_size;
> +	unsigned long memblock_size;
>  	int rc;
>  
> -	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
> -	if (!np)
> +	memblock_size = get_memblock_size();
> +	if (!memblock_size)
>  		return -EINVAL;
>  
> -	lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
> -	if (!lmb_size) {
> -		of_node_put(np);
> -		return -EINVAL;
> -	}
> -
>  	if (action == PSERIES_DRCONF_MEM_ADD) {
> -		rc = memblock_add(*base, *lmb_size);
> +		rc = memblock_add(*base, memblock_size);
>  		rc = (rc < 0) ? -EINVAL : 0;
>  	} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
> -		rc = pseries_remove_memblock(*base, *lmb_size);
> +		rc = pseries_remove_memblock(*base, memblock_size);
>  	} else {
>  		rc = -EINVAL;
>  	}
>  
> -	of_node_put(np);
>  	return rc;
>  }
>  
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries
  2011-02-06 23:39   ` Benjamin Herrenschmidt
@ 2011-02-07  1:42     ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2011-02-07  1:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nathan Fontenot, linux-mm, linuxppc-dev, linux-kernel,
	KAMEZAWA Hiroyuki, Robin Holt

On Mon, Feb 07, 2011 at 10:39:23AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2011-01-20 at 10:45 -0600, Nathan Fontenot wrote:
> > Define a version of memory_block_size_bytes() for powerpc/pseries such that
> > a memory block spans an entire lmb.
> > 
> > Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
> > Reviewed-by: Robin Holt <holt@sgi.com>
> 
> Hi Nathan !
> 
> Is somebody from -mm picking the rest of the series ? This patch as well
> or shall I wait for the first two to go in and then pick that one in
> -powerpc ?

I took all of these in my tree already, is that ok?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-10 18:47   ` Nathan Fontenot
@ 2011-01-10 19:11     ` Robin Holt
  0 siblings, 0 replies; 15+ messages in thread
From: Robin Holt @ 2011-01-10 19:11 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, linux-kernel, linuxppc-dev, linux-mm, KAMEZAWA Hiroyuki,
	Robin Holt

> >> The root of this issue is in sysfs directory creation. Every time
> >> a directory is created a string compare is done against all sibling
> >> directories to ensure we do not create duplicates.  The list of
> >> directory nodes in sysfs is kept as an unsorted list which results
> >> in this being an exponentially longer operation as the number of
> >> directories are created.
> > 
> > Are you sure this is still an issue?  I thought we solved this last
> > kernel or so with a simple patch?
> 
> I'll go back and look at this again.

What I recall fixing is the symbolic linking from the node* to the
memory section.  In that case, we cached the most recent mem section
and since they always were added sequentially, the cache saved a rescan.

Of course, I could be remembering something completely unrelated.

Robin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-10 18:44 ` Greg KH
@ 2011-01-10 18:47   ` Nathan Fontenot
  2011-01-10 19:11     ` Robin Holt
  0 siblings, 1 reply; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-10 18:47 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linuxppc-dev, linux-mm, KAMEZAWA Hiroyuki, Robin Holt

On 01/10/2011 12:44 PM, Greg KH wrote:
> On Mon, Jan 10, 2011 at 12:08:56PM -0600, Nathan Fontenot wrote:
>> This is a re-send of the remaining patches that did not make it
>> into the last kernel release for de-coupling sysfs memory
>> directories from memory sections.  The first three patches of the
>> previous set went in, and this is the remaining patches that
>> need to be applied.
> 
> Well, it's a bit late right now, as we are merging stuff that is already
> in our trees, and we are busy with that, so this is likely to be ignored
> until after .38-rc1 is out.
> 
> So, care to resend this after .38-rc1 is out so people can pay attention
> to it?

I was afraid of this. I didn't get a chance to get it out sooner but thought
I would send it out anyway.

> 
> 
>> The root of this issue is in sysfs directory creation. Every time
>> a directory is created a string compare is done against all sibling
>> directories to ensure we do not create duplicates.  The list of
>> directory nodes in sysfs is kept as an unsorted list which results
>> in this being an exponentially longer operation as the number of
>> directories are created.
> 
> Are you sure this is still an issue?  I thought we solved this last
> kernel or so with a simple patch?

I'll go back and look at this again.

thanks,
-Nathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] De-couple sysfs memory directories from memory sections
  2011-01-10 18:08 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
@ 2011-01-10 18:44 ` Greg KH
  2011-01-10 18:47   ` Nathan Fontenot
  0 siblings, 1 reply; 15+ messages in thread
From: Greg KH @ 2011-01-10 18:44 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-kernel, linuxppc-dev, linux-mm, KAMEZAWA Hiroyuki, Robin Holt

On Mon, Jan 10, 2011 at 12:08:56PM -0600, Nathan Fontenot wrote:
> This is a re-send of the remaining patches that did not make it
> into the last kernel release for de-coupling sysfs memory
> directories from memory sections.  The first three patches of the
> previous set went in, and this is the remaining patches that
> need to be applied.

Well, it's a bit late right now, as we are merging stuff that is already
in our trees, and we are busy with that, so this is likely to be ignored
until after .38-rc1 is out.

So, care to resend this after .38-rc1 is out so people can pay attention
to it?


> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against all sibling
> directories to ensure we do not create duplicates.  The list of
> directory nodes in sysfs is kept as an unsorted list which results
> in this being an exponentially longer operation as the number of
> directories are created.

Are you sure this is still an issue?  I thought we solved this last
kernel or so with a simple patch?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/4] De-couple sysfs memory directories from memory sections
@ 2011-01-10 18:08 Nathan Fontenot
  2011-01-10 18:44 ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Nathan Fontenot @ 2011-01-10 18:08 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linuxppc-dev, linux-mm, KAMEZAWA Hiroyuki, Robin Holt

This is a re-send of the remaining patches that did not make it
into the last kernel release for de-coupling sysfs memory
directories from memory sections.  The first three patches of the
previous set went in, and this is the remaining patches that
need to be applied.

The patches decouple the concept that a single memory
section corresponds to a single directory in 
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are performance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc and x86 in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

-Nathan Fontenot

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-02-07  1:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-20 16:36 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
2011-01-20 16:43 ` [PATCH 1/4] Allow memory blocks to span multiple " Nathan Fontenot
2011-01-20 16:44 ` [PATCH 2/4] Update phys_index to [start|end]_section_nr Nathan Fontenot
2011-01-20 16:45 ` [PATCH 3/4]Define memory_block_size_bytes for powerpc/pseries Nathan Fontenot
2011-02-06 23:39   ` Benjamin Herrenschmidt
2011-02-07  1:42     ` Greg KH
2011-01-20 16:45 ` [PATCH 0/4] De-couple sysfs memory directories from memory sections Greg KH
2011-01-20 16:51   ` Nathan Fontenot
2011-01-20 17:25     ` Greg KH
2011-01-20 17:09   ` Dave Hansen
2011-01-20 16:46 ` [PATCH 4/4] Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV Nathan Fontenot
  -- strict thread matches above, loose matches on Subject: below --
2011-01-10 18:08 [PATCH 0/4] De-couple sysfs memory directories from memory sections Nathan Fontenot
2011-01-10 18:44 ` Greg KH
2011-01-10 18:47   ` Nathan Fontenot
2011-01-10 19:11     ` Robin Holt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).