LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	Chuck Ebbert <cebbert@redhat.com>,
	Domenico Andreoli <cavokz@gmail.com>, Willy Tarreau <w@1wt.eu>,
	Rodrigo Rubira Branco <rbranco@la.checkpoint.com>,
	Jake Edge <jake@lwn.net>, Eugene Teo <eteo@redhat.com>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk,
	Jon Tollefson <kniht@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: [patch 28/57] powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes
Date: Tue, 4 Nov 2008 15:31:59 -0800	[thread overview]
Message-ID: <20081104233159.GC659@suse.de> (raw)
In-Reply-To: <20081104233028.GA659@suse.de>

[-- Attachment #1: powerpc-reserve-in-bootmem-lmb-reserved-regions-that-cross-numa-nodes.patch --]
[-- Type: text/plain, Size: 6658 bytes --]

2.6.27-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Jon Tollefson <kniht@linux.vnet.ibm.com>

commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb upstream

If there are multiple reserved memory blocks via lmb_reserve() that are
contiguous addresses and on different NUMA nodes we are losing track of which
address ranges to reserve in bootmem on which node.  I discovered this
when I recently got to try 16GB huge pages on a system with more then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with
the addresses of the 16G pages that we find so that the memory doesn't
get used for something else.  For example the addresses for the pages
could be 4000000000, 4400000000, 4800000000, 4C00000000, etc - 8 pages,
one on each of eight nodes.  In the lmb after all the pages have been
reserved it will look something like the following:

lmb_dump_all:
    memory.cnt            = 0x2
    memory.size           = 0x3e80000000
    memory.region[0x0].base       = 0x0
                      .size     = 0x1e80000000
    memory.region[0x1].base       = 0x4000000000
                      .size     = 0x2000000000
    reserved.cnt          = 0x5
    reserved.size         = 0x3e80000000
    reserved.region[0x0].base       = 0x0
                      .size     = 0x7b5000
    reserved.region[0x1].base       = 0x2a00000
                      .size     = 0x78c000
    reserved.region[0x2].base       = 0x328c000
                      .size     = 0x43000
    reserved.region[0x3].base       = 0xf4e8000
                      .size     = 0xb18000
    reserved.region[0x4].base       = 0x4000000000
                      .size     = 0x2000000000

The reserved.region[0x4] contains the 16G pages.  In
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the
node numbers looking for the reserved regions that belong to the
particular node.  It is not able to identify region 0x4 as being a part
of each of the 8 nodes.  It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node.  It looks up the active region containing
the start of the reserved region.  If it extends past that active region then
it adjusts the size and gets the next active region containing it.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 arch/powerpc/mm/numa.c |  108 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 80 insertions(+), 28 deletions(-)

--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -89,6 +89,46 @@ static int __cpuinit fake_numa_create_ne
 	return 0;
 }
 
+/*
+ * get_active_region_work_fn - A helper function for get_node_active_region
+ *	Returns datax set to the start_pfn and end_pfn if they contain
+ *	the initial value of datax->start_pfn between them
+ * @start_pfn: start page(inclusive) of region to check
+ * @end_pfn: end page(exclusive) of region to check
+ * @datax: comes in with ->start_pfn set to value to search for and
+ *	goes out with active range if it contains it
+ * Returns 1 if search value is in range else 0
+ */
+static int __init get_active_region_work_fn(unsigned long start_pfn,
+					unsigned long end_pfn, void *datax)
+{
+	struct node_active_region *data;
+	data = (struct node_active_region *)datax;
+
+	if (start_pfn <= data->start_pfn && end_pfn > data->start_pfn) {
+		data->start_pfn = start_pfn;
+		data->end_pfn = end_pfn;
+		return 1;
+	}
+	return 0;
+
+}
+
+/*
+ * get_node_active_region - Return active region containing start_pfn
+ * @start_pfn: The page to return the region for.
+ * @node_ar: Returned set to the active region containing start_pfn
+ */
+static void __init get_node_active_region(unsigned long start_pfn,
+		       struct node_active_region *node_ar)
+{
+	int nid = early_pfn_to_nid(start_pfn);
+
+	node_ar->nid = nid;
+	node_ar->start_pfn = start_pfn;
+	work_with_active_regions(nid, get_active_region_work_fn, node_ar);
+}
+
 static void __cpuinit map_cpu_to_node(int cpu, int node)
 {
 	numa_cpu_lookup_table[cpu] = node;
@@ -837,38 +877,50 @@ void __init do_init_bootmem(void)
 				  start_pfn, end_pfn);
 
 		free_bootmem_with_active_regions(nid, end_pfn);
+	}
 
-		/* Mark reserved regions on this node */
-		for (i = 0; i < lmb.reserved.cnt; i++) {
-			unsigned long physbase = lmb.reserved.region[i].base;
-			unsigned long size = lmb.reserved.region[i].size;
-			unsigned long start_paddr = start_pfn << PAGE_SHIFT;
-			unsigned long end_paddr = end_pfn << PAGE_SHIFT;
-
-			if (early_pfn_to_nid(physbase >> PAGE_SHIFT) != nid &&
-			    early_pfn_to_nid((physbase+size-1) >> PAGE_SHIFT) != nid)
-				continue;
-
-			if (physbase < end_paddr &&
-			    (physbase+size) > start_paddr) {
-				/* overlaps */
-				if (physbase < start_paddr) {
-					size -= start_paddr - physbase;
-					physbase = start_paddr;
-				}
-
-				if (size > end_paddr - physbase)
-					size = end_paddr - physbase;
-
-				dbg("reserve_bootmem %lx %lx\n", physbase,
-				    size);
-				reserve_bootmem_node(NODE_DATA(nid), physbase,
-						     size, BOOTMEM_DEFAULT);
-			}
+	/* Mark reserved regions */
+	for (i = 0; i < lmb.reserved.cnt; i++) {
+		unsigned long physbase = lmb.reserved.region[i].base;
+		unsigned long size = lmb.reserved.region[i].size;
+		unsigned long start_pfn = physbase >> PAGE_SHIFT;
+		unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);
+		struct node_active_region node_ar;
+
+		get_node_active_region(start_pfn, &node_ar);
+		while (start_pfn < end_pfn) {
+			/*
+			 * if reserved region extends past active region
+			 * then trim size to active region
+			 */
+			if (end_pfn > node_ar.end_pfn)
+				size = (node_ar.end_pfn << PAGE_SHIFT)
+					- (start_pfn << PAGE_SHIFT);
+			dbg("reserve_bootmem %lx %lx nid=%d\n", physbase, size,
+				node_ar.nid);
+			reserve_bootmem_node(NODE_DATA(node_ar.nid), physbase,
+						size, BOOTMEM_DEFAULT);
+			/*
+			 * if reserved region is contained in the active region
+			 * then done.
+			 */
+			if (end_pfn <= node_ar.end_pfn)
+				break;
+
+			/*
+			 * reserved region extends past the active region
+			 *   get next active region that contains this
+			 *   reserved region
+			 */
+			start_pfn = node_ar.end_pfn;
+			physbase = start_pfn << PAGE_SHIFT;
+			get_node_active_region(start_pfn, &node_ar);
 		}
 
-		sparse_memory_present_with_active_regions(nid);
 	}
+
+	for_each_online_node(nid)
+		sparse_memory_present_with_active_regions(nid);
 }
 
 void __init paging_init(void)

-- 

  parent reply	other threads:[~2008-11-04 23:48 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20081104232144.186593464@mini.kroah.org>
2008-11-04 23:30 ` [patch 00/57] 2.6.27-stable review Greg KH
2008-11-04 23:30   ` [patch 01/57] agp: Fix stolen memory counting on G4X Greg KH
2008-11-04 23:30   ` [patch 02/57] SCSI: sd: Fix handling of NO_SENSE check condition Greg KH
2008-11-04 23:31   ` [patch 03/57] S390: Fix sysdev class file creation Greg KH
2008-11-04 23:31   ` [patch 04/57] sysfs: Fix return values for sysdev_store_{ulong, int} Greg KH
2008-11-04 23:31   ` [patch 05/57] ipmi: add MODULE_ALIAS to load ipmi_devintf with ipmi_si Greg KH
2008-11-04 23:31   ` [patch 06/57] USB: fix crash when URBs are unlinked after the device is gone Greg KH
2008-11-04 23:31   ` [patch 07/57] ALSA: hda - Add reboot notifier Greg KH
2008-11-04 23:31   ` [patch 08/57] kbuild: mkspec - fix build rpm Greg KH
2008-11-04 23:31   ` [patch 09/57] x86: fix /dev/mem mmap breakage when PAT is disabled Greg KH
2008-11-04 23:31   ` [patch 10/57] atl1: fix vlan tag regression Greg KH
2008-11-04 23:31   ` [patch 11/57] libertas: fix buffer overrun Greg KH
2008-11-04 23:31   ` [patch 12/57] Revert "HID: Invert HWHEEL mappings for some Logitech mice" Greg KH
2008-11-04 23:31   ` [patch 13/57] libata: initialize port_task when !CONFIG_ATA_SFF Greg KH
2008-11-04 23:31   ` [patch 14/57] syncookies: fix inclusion of tcp options in syn-ack Greg KH
2008-11-04 23:31   ` [patch 15/57] tcp: Restore ordering of TCP options for the sake of inter-operability Greg KH
2008-11-04 23:31   ` [patch 16/57] tcpv6: fix option space offsets with md5 Greg KH
2008-11-04 23:31   ` [patch 17/57] pkt_sched: sch_generic: Fix oops in sch_teql Greg KH
2008-11-04 23:31   ` [patch 18/57] sparc64: Fix race in arch/sparc64/kernel/trampoline.S Greg KH
2008-11-04 23:31   ` [patch 19/57] math-emu: Fix signalling of underflow and inexact while packing result Greg KH
2008-11-04 23:31   ` [patch 20/57] firewire: fix setting tag and sy in iso transmission Greg KH
2008-11-04 23:31   ` [patch 21/57] firewire: fix ioctl() return code Greg KH
2008-11-04 23:31   ` [patch 22/57] firewire: Survive more than 256 bus resets Greg KH
2008-11-04 23:31   ` [patch 23/57] firewire: fix struct fw_node memory leak Greg KH
2008-11-04 23:31   ` [patch 24/57] firewire: fw-sbp2: delay first login to avoid retries Greg KH
2008-11-04 23:31   ` [patch 25/57] firewire: fw-sbp2: fix races Greg KH
2008-11-04 23:31   ` [patch 26/57] ACPI: Always report a sync event after a lid state change Greg KH
2008-11-04 23:31   ` [patch 27/57] powerpc: fix i2c on PPC linkstation / kurobox machines Greg KH
2008-11-04 23:31   ` Greg KH [this message]
2008-11-04 23:32   ` [patch 29/57] powerpc/numa: Make memory reserve code more robust Greg KH
2008-11-04 23:32   ` [patch 30/57] powerpc: Dont use a 16G page if beyond mem= limits Greg KH
2008-11-04 23:32   ` [patch 31/57] i2c: The i2c mailing list is moving Greg KH
2008-11-04 23:32   ` [patch 32/57] scx200_i2c: Add missing class parameter Greg KH
2008-11-04 23:32   ` [patch 33/57] ALSA: use correct lock in snd_ctl_dev_disconnect() Greg KH
2008-11-04 23:32   ` [patch 34/57] V4L: pvrusb2: Keep MPEG PTSs from drifting away Greg KH
2008-11-04 23:32   ` [patch 35/57] DVB: s5h1411: bugfix: Setting serial or parallel mode could destroy bits Greg KH
2008-11-04 23:32   ` [patch 36/57] DVB: s5h1411: Perform s5h1411 soft reset after tuning Greg KH
2008-11-04 23:32   ` [patch 37/57] DVB: s5h1411: Power down s5h1411 when not in use Greg KH
2008-11-04 23:32   ` [patch 38/57] PCI: fix 64-vbit prefetchable memory resource BARs Greg KH
2008-11-04 23:32   ` [patch 39/57] sched: disable the hrtick for now Greg KH
2008-11-04 23:33   ` [patch 40/57] sched_clock: prevent scd->clock from moving backwards Greg KH
2008-11-04 23:33   ` [patch 41/57] x86: avoid dereferencing beyond stack + THREAD_SIZE Greg KH
2008-11-04 23:33   ` [patch 42/57] rtc-cmos: look for PNP RTC first, then for platform RTC Greg KH
2008-11-04 23:33   ` [patch 43/57] USB: storage: Avoid I/O errors when issuing SCSI ioctls to JMicron USB/ATA bridge Greg KH
2008-11-04 23:33   ` [patch 44/57] x86: register a platform RTC device if PNP doesnt describe it Greg KH
2008-11-04 23:33   ` [patch 45/57] sata_promise: add ATA engine reset to reset ops Greg KH
2008-11-04 23:33   ` [patch 46/57] sata_nv: fix generic, nf2/3 detection regression Greg KH
2008-11-04 23:33   ` [patch 47/57] ACPI: EC: do transaction from interrupt context Greg KH
2008-11-04 23:33   ` [patch 48/57] ACPI: EC: Rename some variables Greg KH
2008-11-04 23:33   ` [patch 49/57] ACPI: EC: Check for IBF=0 periodically if not in GPE mode Greg KH
2008-11-04 23:33   ` [patch 50/57] libata: Fix LBA48 on pata_it821x RAID volumes Greg KH
2008-11-04 23:33   ` [patch 51/57] ACPI: Ingore the RESET_REG_SUP bit when using ACPI reset mechanism Greg KH
2008-11-05  0:48     ` Zhao Yakui
2008-11-05  1:02       ` Greg KH
2008-11-04 23:33   ` [patch 52/57] ACPI: Clear WAK_STS on resume Greg KH
2008-11-04 23:33   ` [patch 53/57] Input: atkbd - expand Latitudes force release quirk to other Dells Greg KH
2008-11-04 23:33   ` [patch 54/57] hfsplus: fix Buffer overflow with a corrupted image Greg KH
2008-11-04 23:33   ` [patch 55/57] hfsplus: check read_mapping_page() return value Greg KH
2008-11-04 23:33   ` [patch 56/57] bonding: fix panic when taking bond interface down before removing module Greg KH
2008-11-04 23:33   ` [patch 57/57] file caps: always start with clear bprm->caps_* Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081104233159.GC659@suse.de \
    --to=gregkh@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=benh@kernel.crashing.org \
    --cc=cavokz@gmail.com \
    --cc=cebbert@redhat.com \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=eteo@redhat.com \
    --cc=jake@lwn.net \
    --cc=jmforbes@linuxtx.org \
    --cc=kniht@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkrufky@linuxtv.org \
    --cc=rbranco@la.checkpoint.com \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=w@1wt.eu \
    --cc=zwane@arm.linux.org.uk \
    --subject='Re: [patch 28/57] powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).